MMLU-Pro

math official site →

A more robust and challenging multi-task language understanding benchmark that extends MMLU by expanding multiple-choice options from 4 to 10, eliminating trivial questions, and focusing on reasoning-intensive tasks. Features over 12,000 curated questions across 14 domains and causes a 16-33% accuracy drop compared to original MMLU.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: finance, general, healthcare, language, legal, math, reasoning. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Qwen3.7 Max self-reported llm-stats
    89.6%
  2. Qwen3.6 Plus self-reported llm-stats
    88.5%
  3. MiniMax M2.1 self-reported llm-stats
    88.0%
  4. Qwen3.5-397B-A17B self-reported llm-stats
    87.8%
  5. DeepSeek-V4-Pro-Max self-reported llm-stats
    87.5%
  6. Kimi K2.5 self-reported llm-stats
    87.1%
  7. ERNIE 5.0 self-reported llm-stats
    87.0%
  8. Qwen3.5-122B-A10B self-reported llm-stats
    86.7%
  9. DeepSeek-V4-Flash-Max self-reported llm-stats
    86.2%
  10. Qwen3.6-27B self-reported llm-stats
    86.2%
  11. Qwen3.5-27B self-reported llm-stats
    86.1%
  12. Qwen3.5-35B-A3B self-reported llm-stats
    85.3%
  13. Gemma 4 31B self-reported llm-stats
    85.2%
  14. DeepSeek-R1-0528 self-reported llm-stats
    85.0%
  15. DeepSeek-V3.2-Exp self-reported llm-stats
    85.0%
  16. DeepSeek-V3.2 (Thinking) self-reported llm-stats
    85.0%
  17. DeepSeek-V3.2 self-reported llm-stats
    85.0%
  18. MAI-Thinking-1 self-reported llm-stats
    85.0%
  19. MiMo-V2-Flash self-reported llm-stats
    84.9%
  20. GLM-4.5 self-reported llm-stats
    84.6%