Graphwalks BFS >128k

reasoning

A graph reasoning benchmark that evaluates language models' ability to perform breadth-first search (BFS) operations on graphs with context length over 128k tokens, testing long-context reasoning capabilities.

Leaderboard

Showing 8 of 8 results

Claude Mythos Preview

80.0%

i
Claude Opus 4.8

68.1%

i
Claude Opus 4.6

61.5%

i
GPT-5.5

45.4%

i
GPT-5.4

21.4%

i
GPT-4.1

19.0%

i
GPT-4.1 mini

15.0%

i
GPT-4.1 nano

2.9%

i