VideoMME w/o sub.

multimodal official site →

Video-MME is a comprehensive evaluation benchmark for multi-modal large language models in video analysis. It features 900 videos across 6 primary visual domains with 30 subfields, ranging from 11 seconds to 1 hour in duration, with 2,700 question-answer pairs. The benchmark evaluates MLLMs' capabilities in processing sequential visual data and multi-modal content including video frames, subtitles, and audio.

Methodology

Imported from llm-stats public benchmark metadata. Modality: multimodal. Max score: 1. Categories: multimodal, video, vision. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Qwen3.5-122B-A10B self-reported llm-stats
    83.9%
  2. Qwen3.5-27B self-reported llm-stats
    82.8%
  3. Qwen3.5-35B-A3B self-reported llm-stats
    82.5%
  4. Qwen3 VL 235B A22B Instruct self-reported llm-stats
    79.2%
  5. Qwen3 VL 235B A22B Thinking self-reported llm-stats
    79.0%
  6. Qwen3 VL 32B Thinking self-reported llm-stats
    77.3%
  7. Qwen2.5 VL 72B Instruct self-reported llm-stats
    73.3%
  8. Qwen2.5 VL 32B Instruct self-reported llm-stats
    70.5%
  9. Qwen2.5 VL 7B Instruct self-reported llm-stats
    65.1%