BrowseComp

reasoning

BrowseComp is a benchmark comprising 1,266 questions that challenge AI agents to persistently navigate the internet in search of hard-to-find, entangled information. The benchmark measures agents' ability to exercise persistence in information gathering, demonstrate creativity in web navigation, and find concise, verifiable answers. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers.

Leaderboard

Showing 20 of 48 results

GPT-5.5 Pro

90.1%

i
Claude Mythos Preview

86.9%

i
Kimi K2.6

86.3%

i
Gemini 3.1 Pro

85.9%

i
GPT-5.5

84.4%

i
Claude Opus 4.8

84.3%

i
Claude Opus 4.6

84.0%

i
MiniMax M3

83.5%

i
DeepSeek-V4-Pro-Max

83.4%

i
GPT-5.4

82.7%

i
Claude Opus 4.7

79.3%

i
GLM-5.1

79.3%

i
GPT-5.2 Pro

77.9%

i
Seed 2.0 Pro

77.3%

i
MiniMax M2.5

76.3%

i
GLM-5

75.9%

i
Kimi K2.5

74.9%

i
Claude Sonnet 4.6

74.7%

i
DeepSeek-V4-Flash-Max

73.2%

i
Qwen3.5-397B-A17B

69.0%

i