Video-MME

reasoning official site →

Video-MME is the first-ever comprehensive evaluation benchmark of Multi-modal Large Language Models (MLLMs) in video analysis. It features 900 videos totaling 254 hours with 2,700 human-annotated question-answer pairs across 6 primary visual domains (Knowledge, Film & Television, Sports Competition, Life Record, Multilingual, and others) and 30 subfields. The benchmark evaluates models across diverse temporal dimensions (11 seconds to 1 hour), integrates multi-modal inputs including video frames, subtitles, and audio, and uses rigorous manual labeling by expert annotators for precise assessment.

Methodology

Imported from llm-stats public benchmark metadata. Modality: multimodal. Max score: 1. Categories: multimodal, reasoning, vision. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Kimi K2.5 self-reported llm-stats
    87.4%
  2. MiniMax M3 self-reported llm-stats
    85.4%
  3. Gemini 2.5 Pro self-reported llm-stats
    84.8%
  4. Qwen3.6 Plus self-reported llm-stats
    84.2%
  5. Gemini 1.5 Pro self-reported llm-stats
    78.6%
  6. Nova 2 Omni self-reported llm-stats
    77.9%
  7. Gemini 1.5 Flash self-reported llm-stats
    76.1%
  8. Qwen3 VL 30B A3B Instruct self-reported llm-stats
    74.5%
  9. Qwen3 VL 30B A3B Thinking self-reported llm-stats
    73.3%
  10. Qwen3 VL 8B Thinking self-reported llm-stats
    71.8%
  11. Qwen3 VL 8B Instruct self-reported llm-stats
    71.4%
  12. Gemini 1.5 Flash 8B self-reported llm-stats
    66.2%
  13. Phi-4-multimodal-instruct self-reported llm-stats
    55.0%