Graphwalks BFS <128k

reasoning

A graph reasoning benchmark that evaluates language models' ability to perform breadth-first search (BFS) operations on graphs with context length under 128k tokens, returning nodes reachable at specified depths.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: reasoning, spatial_reasoning. Language: en. Verified by llm-stats: no.

Leaderboard

  1. GPT-5.2 self-reported llm-stats
    94.0%
  2. GPT-5.4 self-reported llm-stats
    93.0%
  3. GPT-5 self-reported llm-stats
    78.3%
  4. GPT-5.4 mini self-reported llm-stats
    76.3%
  5. GPT-5.4 nano self-reported llm-stats
    73.4%
  6. GPT-4.5 self-reported llm-stats
    72.3%
  7. GPT-4.1 self-reported llm-stats
    61.7%
  8. GPT-4.1 mini self-reported llm-stats
    61.7%
  9. o3-mini self-reported llm-stats
    51.0%
  10. GPT-4o self-reported llm-stats
    41.7%
  11. GPT-4.1 nano self-reported llm-stats
    25.0%