MuirBench

reasoning official site →

A comprehensive benchmark for robust multi-image understanding capabilities of multimodal LLMs. Consists of 12 diverse multi-image tasks involving 10 categories of multi-image relations (e.g., multiview, temporal relations, narrative, complementary). Comprises 11,264 images and 2,600 multiple-choice questions created in a pairwise manner, where each standard instance is paired with an unanswerable variant for reliable assessment.

Methodology

Imported from llm-stats public benchmark metadata. Modality: multimodal. Max score: 1. Categories: multimodal, reasoning, vision. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Qwen3 VL 32B Thinking self-reported llm-stats
    80.3%
  2. Qwen3 VL 235B A22B Thinking self-reported llm-stats
    80.1%
  3. Qwen3 VL 30B A3B Thinking self-reported llm-stats
    77.6%
  4. Qwen3 VL 8B Thinking self-reported llm-stats
    76.8%
  5. Qwen3 VL 4B Thinking self-reported llm-stats
    75.0%
  6. Qwen3 VL 235B A22B Instruct self-reported llm-stats
    72.8%
  7. Qwen3 VL 32B Instruct self-reported llm-stats
    72.8%
  8. Qwen3 VL 8B Instruct self-reported llm-stats
    64.4%
  9. Qwen3 VL 4B Instruct self-reported llm-stats
    63.8%
  10. Qwen3 VL 30B A3B Instruct self-reported llm-stats
    62.9%
  11. Qwen2.5-Omni-7B self-reported llm-stats
    59.2%