BrowseComp Long Context 128k

reasoning

A challenging benchmark for evaluating web browsing agents' ability to persistently navigate the internet and find hard-to-locate, entangled information. Comprises 1,266 questions requiring strategic reasoning, creative search, and interpretation of retrieved content, with short and easily verifiable answers.

Leaderboard

Showing 5 of 5 results

GPT-5.2

92.0%

i
GPT-5

90.0%

i
GPT-5.1

90.0%

i
GPT-5.1 Instant

90.0%

i
GPT-5.1 Thinking

90.0%

i