MuirBench

reasoning

A comprehensive benchmark for robust multi-image understanding capabilities of multimodal LLMs. Consists of 12 diverse multi-image tasks involving 10 categories of multi-image relations (e.g., multiview, temporal relations, narrative, complementary). Comprises 11,264 images and 2,600 multiple-choice questions created in a pairwise manner, where each standard instance is paired with an unanswerable variant for reliable assessment.

Leaderboard

Showing 11 of 11 results

Qwen3 VL 32B Thinking

80.3%

i
Qwen3 VL 235B A22B Thinking

80.1%

i
Qwen3 VL 30B A3B Thinking

77.6%

i
Qwen3 VL 8B Thinking

76.8%

i
Qwen3 VL 4B Thinking

75.0%

i
Qwen3 VL 235B A22B Instruct

72.8%

i
Qwen3 VL 32B Instruct

72.8%

i
Qwen3 VL 8B Instruct

64.4%

i
Qwen3 VL 4B Instruct

63.8%

i
Qwen3 VL 30B A3B Instruct

62.9%

i
Qwen2.5-Omni-7B

59.2%

i