Graphwalks parents <128k

reasoning

A graph reasoning benchmark that evaluates language models' ability to find parent nodes in graphs with context length under 128k tokens, requiring understanding of graph structure and edge relationships.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: reasoning, spatial_reasoning. Language: en. Verified by llm-stats: no.

Leaderboard

  1. GPT-5.4 self-reported llm-stats
    89.8%
  2. GPT-5.2 self-reported llm-stats
    89.0%
  3. GPT-5 self-reported llm-stats
    73.3%
  4. GPT-4.5 self-reported llm-stats
    72.6%
  5. GPT-5.4 mini self-reported llm-stats
    71.5%
  6. GPT-4.1 mini self-reported llm-stats
    60.5%
  7. o3-mini self-reported llm-stats
    58.3%
  8. GPT-4.1 self-reported llm-stats
    58.0%
  9. GPT-5.4 nano self-reported llm-stats
    50.8%
  10. GPT-4o self-reported llm-stats
    35.4%
  11. GPT-4.1 nano self-reported llm-stats
    9.4%