VideoMME w sub.

multimodal

The first-ever comprehensive evaluation benchmark of Multi-modal LLMs in Video analysis. Features 900 videos (254 hours) with 2,700 question-answer pairs covering 6 primary visual domains and 30 subfields. Evaluates temporal understanding across short (11 seconds) to long (1 hour) videos with multi-modal inputs including video frames, subtitles, and audio.

Leaderboard

Showing 9 of 9 results

Qwen3.6-27B

87.7%

i
Qwen3.5-122B-A10B

87.3%

i
Qwen3.5-27B

87.0%

i
GPT-5

86.7%

i
Qwen3.5-35B-A3B

86.6%

i
Qwen3.6-35B-A3B

86.6%

i
Qwen2.5 VL 32B Instruct

77.9%

i
Qwen2.5-Omni-7B

72.4%

i
Qwen2.5 VL 7B Instruct

71.6%

i