CharadesSTA

multimodal official site →

Charades-STA is a benchmark dataset for temporal activity localization via language queries, extending the Charades dataset with sentence temporal annotations. It contains 12,408 training and 3,720 testing segment-sentence pairs from videos with natural language descriptions and precise temporal boundaries for localizing activities based on language queries.

Methodology

Imported from llm-stats public benchmark metadata. Modality: multimodal. Max score: 1. Categories: language, multimodal, video, vision. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Qwen3 VL 235B A22B Instruct self-reported llm-stats
    64.8%
  2. Qwen3 VL 235B A22B Thinking self-reported llm-stats
    63.5%
  3. Qwen3 VL 30B A3B Instruct self-reported llm-stats
    63.5%
  4. Qwen3 VL 32B Thinking self-reported llm-stats
    62.8%
  5. Qwen3 VL 30B A3B Thinking self-reported llm-stats
    62.7%
  6. Qwen3 VL 32B Instruct self-reported llm-stats
    61.2%
  7. Qwen3 VL 8B Thinking self-reported llm-stats
    59.9%
  8. Qwen3 VL 4B Thinking self-reported llm-stats
    59.0%
  9. Qwen3 VL 8B Instruct self-reported llm-stats
    56.0%
  10. Qwen3 VL 4B Instruct self-reported llm-stats
    55.5%
  11. Qwen2.5 VL 32B Instruct self-reported llm-stats
    54.2%
  12. Qwen2.5 VL 7B Instruct self-reported llm-stats
    43.6%