MMLU

math official site →

Massive Multitask Language Understanding benchmark testing knowledge across 57 diverse subjects including STEM, humanities, social sciences, and professional domains

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: finance, general, healthcare, language, legal, math, reasoning. Language: en. Verified by llm-stats: no.

Leaderboard

  1. GPT-5 self-reported llm-stats
    92.5%
  2. o1 self-reported llm-stats
    91.8%
  3. GPT-4.5 self-reported llm-stats
    90.8%
  4. o1-preview self-reported llm-stats
    90.8%
  5. Qwen3 VL 235B A22B Thinking self-reported llm-stats
    90.6%
  6. Sarvam-105B self-reported llm-stats
    90.6%
  7. Claude 3.5 Sonnet self-reported llm-stats
    90.4%
  8. Claude 3.5 Sonnet self-reported llm-stats
    90.4%
  9. Kimi K2 0905 self-reported llm-stats
    90.2%
  10. GPT-4.1 self-reported llm-stats
    90.2%
  11. GPT OSS 120B self-reported llm-stats
    90.0%
  12. LongCat-Flash-Chat self-reported llm-stats
    89.7%
  13. Kimi K2 Instruct self-reported llm-stats
    89.5%
  14. Kimi K2-Instruct-0905 self-reported llm-stats
    89.5%
  15. Qwen3 VL 235B A22B Instruct self-reported llm-stats
    88.8%
  16. 88.7%
  17. 88.7%
  18. GPT-4o self-reported llm-stats
    88.7%
  19. Qwen3 VL 32B Thinking self-reported llm-stats
    88.7%
  20. DeepSeek-V3 self-reported llm-stats
    88.5%