Graphwalks parents >128k

reasoning

A graph reasoning benchmark that evaluates language models' ability to find parent nodes in graphs with context length over 128k tokens, testing long-context reasoning and graph structure understanding.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: long_context, reasoning, spatial_reasoning. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Claude Opus 4.6 self-reported llm-stats
    95.4%
  2. Claude Opus 4.8 self-reported llm-stats
    83.3%
  3. GPT-5.5 self-reported llm-stats
    58.5%
  4. GPT-5.4 self-reported llm-stats
    32.4%
  5. GPT-4.1 self-reported llm-stats
    25.0%
  6. GPT-4.1 mini self-reported llm-stats
    11.0%
  7. GPT-4.1 nano self-reported llm-stats
    5.6%