MMBench-V1.1

reasoning official site →

Version 1.1 of MMBench, an improved bilingual benchmark for assessing multi-modal capabilities of vision-language models through multiple-choice questions in both English and Chinese, providing systematic evaluation across diverse vision-language tasks.

Methodology

Imported from llm-stats public benchmark metadata. Modality: multimodal. Max score: 1. Categories: multimodal, reasoning, vision. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Qwen3.5-122B-A10B self-reported llm-stats
    92.8%
  2. Qwen3.5-27B self-reported llm-stats
    92.6%
  3. Qwen3.6-27B self-reported llm-stats
    92.3%
  4. Qwen3.5-35B-A3B self-reported llm-stats
    91.5%
  5. Qwen3 VL 32B Thinking self-reported llm-stats
    90.8%
  6. Qwen3 VL 235B A22B Thinking self-reported llm-stats
    90.6%
  7. Qwen3 VL 235B A22B Instruct self-reported llm-stats
    89.9%
  8. Qwen3 VL 30B A3B Thinking self-reported llm-stats
    88.9%
  9. Qwen3 VL 8B Thinking self-reported llm-stats
    87.5%
  10. Qwen3 VL 30B A3B Instruct self-reported llm-stats
    87.0%
  11. Qwen3 VL 4B Thinking self-reported llm-stats
    86.7%
  12. Qwen3 VL 4B Instruct self-reported llm-stats
    85.1%
  13. Qwen3 VL 8B Instruct self-reported llm-stats
    85.0%
  14. Qwen2.5-Omni-7B self-reported llm-stats
    81.8%
  15. DeepSeek VL2 Small self-reported llm-stats
    79.3%
  16. DeepSeek VL2 self-reported llm-stats
    79.2%
  17. DeepSeek VL2 Tiny self-reported llm-stats
    68.3%