Graphwalks BFS <128k

reasoning

A graph reasoning benchmark that evaluates language models' ability to perform breadth-first search (BFS) operations on graphs with context length under 128k tokens, returning nodes reachable at specified depths.

Leaderboard

Showing 11 of 11 results

GPT-5.2

94.0%

i
GPT-5.4

93.0%

i
GPT-5

78.3%

i
GPT-5.4 mini

76.3%

i
GPT-5.4 nano

73.4%

i
GPT-4.5

72.3%

i
GPT-4.1

61.7%

i
GPT-4.1 mini

61.7%

i
o3-mini

51.0%

i
GPT-4o

41.7%

i
GPT-4.1 nano

25.0%

i