MMLU-Pro

math

A more robust and challenging multi-task language understanding benchmark that extends MMLU by expanding multiple-choice options from 4 to 10, eliminating trivial questions, and focusing on reasoning-intensive tasks. Features over 12,000 curated questions across 14 domains and causes a 16-33% accuracy drop compared to original MMLU.

Leaderboard

Showing 20 of 127 results

Qwen3.7 Max

89.6%

i
Qwen3.6 Plus

88.5%

i
MiniMax M2.1

88.0%

i
Qwen3.5-397B-A17B

87.8%

i
DeepSeek-V4-Pro-Max

87.5%

i
Kimi K2.5

87.1%

i
ERNIE 5.0

87.0%

i
Qwen3.5-122B-A10B

86.7%

i
DeepSeek-V4-Flash-Max

86.2%

i
Qwen3.6-27B

86.2%

i
Qwen3.5-27B

86.1%

i
Qwen3.5-35B-A3B

85.3%

i
Gemma 4 31B

85.2%

i
Qwen3.6-35B-A3B

85.2%

i
DeepSeek-R1-0528

85.0%

i
DeepSeek-V3.2-Exp

85.0%

i
DeepSeek-V3.2 (Thinking)

85.0%

i
DeepSeek-V3.2

85.0%

i
MAI-Thinking-1

85.0%

i
MiMo-V2-Flash

84.9%

i