MMBench

reasoning

A bilingual benchmark for assessing multi-modal capabilities of vision-language models through multiple-choice questions in both English and Chinese, providing systematic evaluation across diverse vision-language tasks with robust metrics.

Leaderboard

Showing 9 of 9 results

Step3-VL-10B

91.8%

i
Qwen2.5 VL 72B Instruct

88.0%

i
Phi-4-multimodal-instruct

86.7%

i
Qwen2-VL-72B-Instruct

86.5%

i
Qwen2.5 VL 7B Instruct

84.3%

i
Phi-3.5-vision-instruct

81.9%

i
DeepSeek VL2 Small

80.3%

i
DeepSeek VL2

79.6%

i
DeepSeek VL2 Tiny

69.2%

i