MMMU-Pro

reasoning official site →

A more robust multi-discipline multimodal understanding benchmark that enhances MMMU through a three-step process: filtering text-only answerable questions, augmenting candidate options, and introducing vision-only input settings. Achieves significantly lower model performance (16.8-26.9%) compared to original MMMU, providing more rigorous evaluation that closely mimics real-world scenarios.

Methodology

Imported from llm-stats public benchmark metadata. Modality: multimodal. Max score: 1. Categories: general, multimodal, reasoning, vision. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Gemini 3.5 Flash self-reported llm-stats
    83.6%
  2. GPT-5.5 self-reported llm-stats
    83.2%
  3. Gemini 3 Flash self-reported llm-stats
    81.2%
  4. GPT-5.4 self-reported llm-stats
    81.2%
  5. Gemini 3 Pro self-reported llm-stats
    81.0%
  6. Gemini 3.1 Pro self-reported llm-stats
    80.5%
  7. Muse Spark self-reported llm-stats
    80.4%
  8. Kimi K2.6 self-reported llm-stats
    80.1%
  9. GPT-5.2 self-reported llm-stats
    79.5%
  10. Qwen3.6 Plus self-reported llm-stats
    78.8%
  11. Kimi K2.5 self-reported llm-stats
    78.5%
  12. GPT-5 self-reported llm-stats
    78.4%
  13. MiniMax M3 self-reported llm-stats
    78.1%
  14. Claude Opus 4.6 self-reported llm-stats
    77.3%
  15. Gemma 4 31B self-reported llm-stats
    76.9%
  16. Qwen3.5-122B-A10B self-reported llm-stats
    76.9%
  17. Gemini 3.1 Flash-Lite self-reported llm-stats
    76.8%
  18. GPT-5.4 mini self-reported llm-stats
    76.6%
  19. o3 self-reported llm-stats
    76.4%
  20. GPT-5.5 Instant self-reported llm-stats
    76.0%