MMLU-ProX

math official site →

Extended version of MMLU-Pro providing additional challenging multiple-choice questions for evaluating language models across diverse academic and professional domains. Built on the foundation of the Massive Multitask Language Understanding benchmark framework.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: finance, general, healthcare, language, legal, math, reasoning. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Qwen3.7 Max self-reported llm-stats
    87.0%
  2. Qwen3.5-397B-A17B self-reported llm-stats
    84.7%
  3. Qwen3.6 Plus self-reported llm-stats
    84.7%
  4. Qwen3.5-122B-A10B self-reported llm-stats
    82.2%
  5. Qwen3.5-27B self-reported llm-stats
    82.2%
  6. Qwen3-235B-A22B-Thinking-2507 self-reported llm-stats
    81.0%
  7. Qwen3.5-35B-A3B self-reported llm-stats
    81.0%
  8. Qwen3 VL 235B A22B Thinking self-reported llm-stats
    80.6%
  9. Qwen3-235B-A22B-Instruct-2507 self-reported llm-stats
    79.4%
  10. Nemotron 3 Super (120B A12B) self-reported llm-stats
    79.4%
  11. Qwen3-Next-80B-A3B-Thinking self-reported llm-stats
    78.7%
  12. Qwen3 VL 235B A22B Instruct self-reported llm-stats
    77.8%
  13. Qwen3 VL 32B Thinking self-reported llm-stats
    77.2%
  14. Qwen3-Next-80B-A3B-Instruct self-reported llm-stats
    76.7%
  15. Qwen3 VL 30B A3B Thinking self-reported llm-stats
    76.1%
  16. Qwen3 VL 32B Instruct self-reported llm-stats
    73.4%
  17. Qwen3 VL 30B A3B Instruct self-reported llm-stats
    70.9%
  18. Qwen3 VL 8B Thinking self-reported llm-stats
    70.7%
  19. Qwen3 VL 8B Instruct self-reported llm-stats
    65.4%
  20. Qwen3 VL 4B Thinking self-reported llm-stats
    65.0%