Graphwalks parents >128k

reasoning

A graph reasoning benchmark that evaluates language models' ability to find parent nodes in graphs with context length over 128k tokens, testing long-context reasoning and graph structure understanding.

Leaderboard

Showing 7 of 7 results

Claude Opus 4.6

95.4%

i
Claude Opus 4.8

83.3%

i
GPT-5.5

58.5%

i
GPT-5.4

32.4%

i
GPT-4.1

25.0%

i
GPT-4.1 mini

11.0%

i
GPT-4.1 nano

5.6%

i