Graphwalks BFS >128k

reasoning

A graph reasoning benchmark that evaluates language models' ability to perform breadth-first search (BFS) operations on graphs with context length over 128k tokens, testing long-context reasoning capabilities.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: long_context, reasoning, spatial_reasoning. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Claude Mythos Preview self-reported llm-stats
    80.0%
  2. Claude Opus 4.8 self-reported llm-stats
    68.1%
  3. Claude Opus 4.6 self-reported llm-stats
    61.5%
  4. GPT-5.5 self-reported llm-stats
    45.4%
  5. GPT-5.4 self-reported llm-stats
    21.4%
  6. GPT-4.1 self-reported llm-stats
    19.0%
  7. GPT-4.1 mini self-reported llm-stats
    15.0%
  8. GPT-4.1 nano self-reported llm-stats
    2.9%