BLINK

reasoning official site →

BLINK: Multimodal Large Language Models Can See but Not Perceive. A benchmark for multimodal language models focusing on core visual perception abilities. Reformats 14 classic computer vision tasks into 3,807 multiple-choice questions paired with single or multiple images and visual prompting. Tasks include relative depth estimation, visual correspondence, forensics detection, multi-view reasoning, counting, object localization, and spatial reasoning that humans can solve 'within a blink'.

Methodology

Imported from llm-stats public benchmark metadata. Modality: multimodal. Max score: 1. Categories: 3d, multimodal, reasoning, spatial_reasoning, vision. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Qwen3 VL 235B A22B Instruct self-reported llm-stats
    70.7%
  2. Qwen3 VL 8B Instruct self-reported llm-stats
    69.1%
  3. Qwen3 VL 8B Thinking self-reported llm-stats
    68.7%
  4. Qwen3 VL 32B Thinking self-reported llm-stats
    68.5%
  5. Qwen3 VL 30B A3B Instruct self-reported llm-stats
    67.7%
  6. Qwen3 VL 32B Instruct self-reported llm-stats
    67.3%
  7. Qwen3 VL 235B A22B Thinking self-reported llm-stats
    67.1%
  8. Qwen3 VL 4B Instruct self-reported llm-stats
    65.8%
  9. Qwen3 VL 30B A3B Thinking self-reported llm-stats
    65.4%
  10. Qwen3 VL 4B Thinking self-reported llm-stats
    63.4%
  11. Phi-4-multimodal-instruct self-reported llm-stats
    61.3%