BrowseComp-zh
reasoning official site →
A high-difficulty benchmark purpose-built to comprehensively evaluate LLM agents on the Chinese web, consisting of 289 multi-hop questions spanning 11 diverse domains including Film & TV, Technology, Medicine, and History. Questions are reverse-engineered from short, objective, and easily verifiable answers, requiring sophisticated reasoning and information reconciliation beyond basic retrieval. The benchmark addresses linguistic, infrastructural, and censorship-related complexities in Chinese web environments.
Methodology
Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: reasoning, search. Language: zh. Verified by llm-stats: no.