Video-MME

reasoning

Video-MME is the first-ever comprehensive evaluation benchmark of Multi-modal Large Language Models (MLLMs) in video analysis. It features 900 videos totaling 254 hours with 2,700 human-annotated question-answer pairs across 6 primary visual domains (Knowledge, Film & Television, Sports Competition, Life Record, Multilingual, and others) and 30 subfields. The benchmark evaluates models across diverse temporal dimensions (11 seconds to 1 hour), integrates multi-modal inputs including video frames, subtitles, and audio, and uses rigorous manual labeling by expert annotators for precise assessment.

Leaderboard

Showing 14 of 14 results

MiMo-V2.5

87.7%

i
Kimi K2.5

87.4%

i
MiniMax M3

85.4%

i
Gemini 2.5 Pro

84.8%

i
Qwen3.6 Plus

84.2%

i
Gemini 1.5 Pro

78.6%

i
Nova 2 Omni

77.9%

i
Gemini 1.5 Flash

76.1%

i
Qwen3 VL 30B A3B Instruct

74.5%

i
Qwen3 VL 30B A3B Thinking

73.3%

i
Qwen3 VL 8B Thinking

71.8%

i
Qwen3 VL 8B Instruct

71.4%

i
Gemini 1.5 Flash 8B

66.2%

i
Phi-4-multimodal-instruct

55.0%

i