ARC-AGI v2

reasoning official site →

ARC-AGI-2 is an upgraded benchmark for measuring abstract reasoning and problem-solving abilities in AI systems through visual grid transformation tasks. It evaluates fluid intelligence via input-output grid pairs (1x1 to 30x30) using colored cells (0-9), requiring models to identify underlying transformation rules from demonstration examples and apply them to test cases. Designed to be easy for humans but challenging for AI, focusing on core cognitive abilities like spatial reasoning, pattern recognition, and compositional generalization.

Methodology

Imported from llm-stats public benchmark metadata. Modality: multimodal. Max score: 1. Categories: reasoning, spatial_reasoning, vision. Language: en. Verified by llm-stats: no.

Leaderboard

  1. GPT-5.5 self-reported llm-stats
    85.0%
  2. Gemini 3.1 Pro self-reported llm-stats
    77.1%
  3. GPT-5.4 self-reported llm-stats
    73.3%
  4. Gemini 3.5 Flash self-reported llm-stats
    72.1%
  5. Claude Opus 4.6 self-reported llm-stats
    68.8%
  6. Claude Sonnet 4.6 self-reported llm-stats
    58.3%
  7. GPT-5.2 Pro self-reported llm-stats
    54.2%
  8. GPT-5.2 self-reported llm-stats
    52.9%
  9. Muse Spark self-reported llm-stats
    42.5%
  10. Claude Opus 4.5 self-reported llm-stats
    37.6%
  11. Gemini 3 Flash self-reported llm-stats
    33.6%
  12. Gemini 3 Pro self-reported llm-stats
    31.1%
  13. Grok-4 self-reported llm-stats
    15.9%
  14. Claude Opus 4 self-reported llm-stats
    8.6%
  15. o3 self-reported llm-stats
    6.5%
  16. Gemini 2.5 Pro self-reported llm-stats
    4.9%