Hallusion Bench

reasoning official site →

A comprehensive benchmark designed to evaluate image-context reasoning in large visual-language models (LVLMs) by challenging models with 346 images and 1,129 carefully crafted questions to assess language hallucination and visual illusion

Methodology

Imported from llm-stats public benchmark metadata. Modality: multimodal. Max score: 1. Categories: reasoning, vision. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Qwen3.5-27B self-reported llm-stats
    70.0%
  2. Qwen3.5-35B-A3B self-reported llm-stats
    67.9%
  3. Qwen3.5-122B-A10B self-reported llm-stats
    67.6%
  4. Qwen3 VL 32B Thinking self-reported llm-stats
    67.4%
  5. Qwen3 VL 235B A22B Thinking self-reported llm-stats
    66.7%
  6. Qwen3 VL 30B A3B Thinking self-reported llm-stats
    66.0%
  7. Qwen3 VL 8B Thinking self-reported llm-stats
    65.4%
  8. Qwen3 VL 4B Thinking self-reported llm-stats
    64.1%
  9. Qwen3 VL 32B Instruct self-reported llm-stats
    63.8%
  10. Qwen3 VL 235B A22B Instruct self-reported llm-stats
    63.2%
  11. Qwen3 VL 30B A3B Instruct self-reported llm-stats
    61.5%
  12. Qwen3 VL 8B Instruct self-reported llm-stats
    61.1%
  13. Qwen3 VL 4B Instruct self-reported llm-stats
    57.6%
  14. Qwen2.5 VL 72B Instruct self-reported llm-stats
    55.2%
  15. Qwen2.5 VL 7B Instruct self-reported llm-stats
    52.9%