VideoMME w/o sub.

multimodal

Video-MME is a comprehensive evaluation benchmark for multi-modal large language models in video analysis. It features 900 videos across 6 primary visual domains with 30 subfields, ranging from 11 seconds to 1 hour in duration, with 2,700 question-answer pairs. The benchmark evaluates MLLMs' capabilities in processing sequential visual data and multi-modal content including video frames, subtitles, and audio.

Leaderboard

Showing 10 of 10 results

Qwen3.5-122B-A10B

83.9%

i
Qwen3.5-27B

82.8%

i
Qwen3.5-35B-A3B

82.5%

i
Qwen3.6-35B-A3B

82.5%

i
Qwen3 VL 235B A22B Instruct

79.2%

i
Qwen3 VL 235B A22B Thinking

79.0%

i
Qwen3 VL 32B Thinking

77.3%

i
Qwen2.5 VL 72B Instruct

73.3%

i
Qwen2.5 VL 32B Instruct

70.5%

i
Qwen2.5 VL 7B Instruct

65.1%

i