MMMU-Pro

reasoning

A more robust multi-discipline multimodal understanding benchmark that enhances MMMU through a three-step process: filtering text-only answerable questions, augmenting candidate options, and introducing vision-only input settings. Achieves significantly lower model performance (16.8-26.9%) compared to original MMMU, providing more rigorous evaluation that closely mimics real-world scenarios.

Leaderboard

Showing 20 of 56 results

Gemini 3.5 Flash

83.6%

i
GPT-5.5

83.2%

i
Gemini 3 Flash

81.2%

i
GPT-5.4

81.2%

i
Gemini 3 Pro

81.0%

i
Gemini 3.1 Pro

80.5%

i
Muse Spark

80.4%

i
Kimi K2.6

80.1%

i
GPT-5.2

79.5%

i
Qwen3.6 Plus

78.8%

i
Kimi K2.5

78.5%

i
GPT-5

78.4%

i
MiniMax M3

78.1%

i
MiMo-V2.5

77.9%

i
Claude Opus 4.6

77.3%

i
Gemma 4 31B

76.9%

i
Qwen3.5-122B-A10B

76.9%

i
Gemini 3.1 Flash-Lite

76.8%

i
GPT-5.4 mini

76.6%

i
o3

76.4%

i