MMLU-Redux

math official site →

An improved version of the MMLU benchmark featuring manually re-annotated questions to identify and correct errors in the original dataset. Provides more reliable evaluation metrics for language models by addressing dataset quality issues found in the original MMLU.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: general, language, math, reasoning. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Qwen3.7 Max self-reported llm-stats
    95.0%
  2. Qwen3.5-397B-A17B self-reported llm-stats
    94.9%
  3. Qwen3.6 Plus self-reported llm-stats
    94.5%
  4. Kimi K2-Thinking-0905 self-reported llm-stats
    94.4%
  5. Qwen3.5-122B-A10B self-reported llm-stats
    94.0%
  6. Qwen3-235B-A22B-Thinking-2507 self-reported llm-stats
    93.8%
  7. Qwen3 VL 235B A22B Thinking self-reported llm-stats
    93.7%
  8. Qwen3.6-27B self-reported llm-stats
    93.5%
  9. DeepSeek-R1-0528 self-reported llm-stats
    93.4%
  10. Qwen3.5-35B-A3B self-reported llm-stats
    93.3%
  11. Qwen3.5-27B self-reported llm-stats
    93.2%
  12. Qwen3-235B-A22B-Instruct-2507 self-reported llm-stats
    93.1%
  13. Kimi K2 Instruct self-reported llm-stats
    92.7%
  14. Kimi K2-Instruct-0905 self-reported llm-stats
    92.7%
  15. Qwen3-Next-80B-A3B-Thinking self-reported llm-stats
    92.5%
  16. Qwen3 VL 235B A22B Instruct self-reported llm-stats
    92.2%
  17. Qwen3 VL 32B Thinking self-reported llm-stats
    91.9%
  18. DeepSeek-V3.1 self-reported llm-stats
    91.8%
  19. Qwen3-Next-80B-A3B-Instruct self-reported llm-stats
    90.9%
  20. Qwen3 VL 30B A3B Thinking self-reported llm-stats
    90.9%