MMLU-Redux

math

An improved version of the MMLU benchmark featuring manually re-annotated questions to identify and correct errors in the original dataset. Provides more reliable evaluation metrics for language models by addressing dataset quality issues found in the original MMLU.

Leaderboard

Showing 20 of 47 results

Qwen3.7 Max

95.0%

i
Qwen3.5-397B-A17B

94.9%

i
Qwen3.6 Plus

94.5%

i
Kimi K2-Thinking-0905

94.4%

i
Qwen3.5-122B-A10B

94.0%

i
Qwen3-235B-A22B-Thinking-2507

93.8%

i
Qwen3 VL 235B A22B Thinking

93.7%

i
Qwen3.6-27B

93.5%

i
DeepSeek-R1-0528

93.4%

i
Qwen3.5-35B-A3B

93.3%

i
Qwen3.6-35B-A3B

93.3%

i
Qwen3.5-27B

93.2%

i
Qwen3-235B-A22B-Instruct-2507

93.1%

i
MiMo-V2.5-Pro

92.8%

i
Kimi K2 Instruct

92.7%

i
Kimi K2-Instruct-0905

92.7%

i
Qwen3-Next-80B-A3B-Thinking

92.5%

i
Qwen3 VL 235B A22B Instruct

92.2%

i
Qwen3 VL 32B Thinking

91.9%

i
DeepSeek-V3.1

91.8%

i