AI2D

reasoning official site →

AI2D is a dataset of 4,903 illustrative diagrams from grade school natural sciences (such as food webs, human physiology, and life cycles) with over 15,000 multiple choice questions and answers. The benchmark evaluates diagram understanding and visual reasoning capabilities, requiring models to interpret diagrammatic elements, relationships, and structure to answer questions about scientific concepts represented in visual form.

Methodology

Imported from llm-stats public benchmark metadata. Modality: multimodal. Max score: 1. Categories: multimodal, reasoning, vision. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Claude 3.5 Sonnet self-reported llm-stats
    94.7%
  2. Qwen3.6 Plus self-reported llm-stats
    94.4%
  3. GPT-4o self-reported llm-stats
    94.2%
  4. Pixtral Large self-reported llm-stats
    93.8%
  5. Qwen3.5-122B-A10B self-reported llm-stats
    93.3%
  6. Mistral Small 3.2 24B Instruct self-reported llm-stats
    92.9%
  7. Qwen3.5-27B self-reported llm-stats
    92.9%
  8. Qwen3.5-35B-A3B self-reported llm-stats
    92.6%
  9. Llama 3.2 90B Instruct self-reported llm-stats
    92.3%
  10. Llama 3.2 11B Instruct self-reported llm-stats
    91.1%
  11. Qwen3 VL 235B A22B Instruct self-reported llm-stats
    89.7%
  12. Qwen3 VL 32B Instruct self-reported llm-stats
    89.5%
  13. Qwen3 VL 235B A22B Thinking self-reported llm-stats
    89.2%
  14. Qwen3 VL 32B Thinking self-reported llm-stats
    88.9%
  15. Qwen2.5 VL 72B Instruct self-reported llm-stats
    88.4%
  16. Grok-1.5V self-reported llm-stats
    88.3%
  17. Qwen3 VL 30B A3B Thinking self-reported llm-stats
    86.9%
  18. Qwen3 VL 8B Instruct self-reported llm-stats
    85.7%
  19. Qwen3 VL 30B A3B Instruct self-reported llm-stats
    85.0%
  20. Qwen3 VL 4B Thinking self-reported llm-stats
    84.9%