BLINK

reasoning

BLINK: Multimodal Large Language Models Can See but Not Perceive. A benchmark for multimodal language models focusing on core visual perception abilities. Reformats 14 classic computer vision tasks into 3,807 multiple-choice questions paired with single or multiple images and visual prompting. Tasks include relative depth estimation, visual correspondence, forensics detection, multi-view reasoning, counting, object localization, and spatial reasoning that humans can solve 'within a blink'.

Leaderboard

Showing 11 of 11 results

Qwen3 VL 235B A22B Instruct

70.7%

i
Qwen3 VL 8B Instruct

69.1%

i
Qwen3 VL 8B Thinking

68.7%

i
Qwen3 VL 32B Thinking

68.5%

i
Qwen3 VL 30B A3B Instruct

67.7%

i
Qwen3 VL 32B Instruct

67.3%

i
Qwen3 VL 235B A22B Thinking

67.1%

i
Qwen3 VL 4B Instruct

65.8%

i
Qwen3 VL 30B A3B Thinking

65.4%

i
Qwen3 VL 4B Thinking

63.4%

i
Phi-4-multimodal-instruct

61.3%

i