MVBench

reasoning

A comprehensive multi-modal video understanding benchmark covering 20 challenging video tasks that require temporal understanding beyond single-frame analysis. Tasks span from perception to cognition, including action recognition, temporal reasoning, spatial reasoning, object interaction, scene transition, and counterfactual inference. Uses a novel static-to-dynamic method to systematically generate video tasks from existing annotations.

Leaderboard

Showing 17 of 17 results

Qwen3.5-122B-A10B

76.6%

i
Qwen3.6-27B

75.5%

i
Qwen3.5-35B-A3B

74.8%

i
Qwen3.5-27B

74.6%

i
Qwen3.6-35B-A3B

74.6%

i
Qwen2-VL-72B-Instruct

73.6%

i
Qwen3 VL 32B Thinking

73.2%

i
Qwen3 VL 32B Instruct

72.8%

i
Qwen3 VL 30B A3B Instruct

72.3%

i
Qwen3 VL 30B A3B Thinking

72.0%

i
Qwen2.5 VL 72B Instruct

70.4%

i
Qwen2.5-Omni-7B

70.3%

i
Qwen2.5 VL 7B Instruct

69.6%

i
Qwen3 VL 4B Thinking

69.3%

i
Qwen3 VL 8B Thinking

69.0%

i
Qwen3 VL 4B Instruct

68.9%

i
Qwen3 VL 8B Instruct

68.7%

i