MVBench

reasoning official site →

A comprehensive multi-modal video understanding benchmark covering 20 challenging video tasks that require temporal understanding beyond single-frame analysis. Tasks span from perception to cognition, including action recognition, temporal reasoning, spatial reasoning, object interaction, scene transition, and counterfactual inference. Uses a novel static-to-dynamic method to systematically generate video tasks from existing annotations.

Methodology

Imported from llm-stats public benchmark metadata. Modality: multimodal. Max score: 1. Categories: multimodal, reasoning, spatial_reasoning, video, vision. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Qwen3.5-122B-A10B self-reported llm-stats
    76.6%
  2. Qwen3.6-27B self-reported llm-stats
    75.5%
  3. Qwen3.5-35B-A3B self-reported llm-stats
    74.8%
  4. Qwen3.5-27B self-reported llm-stats
    74.6%
  5. Qwen2-VL-72B-Instruct self-reported llm-stats
    73.6%
  6. Qwen3 VL 32B Thinking self-reported llm-stats
    73.2%
  7. Qwen3 VL 32B Instruct self-reported llm-stats
    72.8%
  8. Qwen3 VL 30B A3B Instruct self-reported llm-stats
    72.3%
  9. Qwen3 VL 30B A3B Thinking self-reported llm-stats
    72.0%
  10. Qwen2.5 VL 72B Instruct self-reported llm-stats
    70.4%
  11. Qwen2.5-Omni-7B self-reported llm-stats
    70.3%
  12. Qwen2.5 VL 7B Instruct self-reported llm-stats
    69.6%
  13. Qwen3 VL 4B Thinking self-reported llm-stats
    69.3%
  14. Qwen3 VL 8B Thinking self-reported llm-stats
    69.0%
  15. Qwen3 VL 4B Instruct self-reported llm-stats
    68.9%
  16. Qwen3 VL 8B Instruct self-reported llm-stats
    68.7%