BrowseComp-zh

reasoning

A high-difficulty benchmark purpose-built to comprehensively evaluate LLM agents on the Chinese web, consisting of 289 multi-hop questions spanning 11 diverse domains including Film & TV, Technology, Medicine, and History. Questions are reverse-engineered from short, objective, and easily verifiable answers, requiring sophisticated reasoning and information reconciliation beyond basic retrieval. The benchmark addresses linguistic, infrastructural, and censorship-related complexities in Chinese web environments.

Leaderboard

Showing 13 of 13 results

Qwen3.5-397B-A17B

70.3%

i
Qwen3.5-122B-A10B

69.9%

i
Qwen3.5-35B-A3B

69.5%

i
LongCat-Flash-Thinking-2601

69.0%

i
GLM-4.7

66.6%

i
DeepSeek-V3.2 (Thinking)

65.0%

i
DeepSeek-V3.2

65.0%

i
Kimi K2-Thinking-0905

62.3%

i
Qwen3.5-27B

62.1%

i
DeepSeek-V3.1

49.2%

i
MiniMax M2

48.5%

i
DeepSeek-V3.2-Exp

47.9%

i
DeepSeek-R1-0528

35.7%

i