BrowseComp-zh

reasoning official site →

A high-difficulty benchmark purpose-built to comprehensively evaluate LLM agents on the Chinese web, consisting of 289 multi-hop questions spanning 11 diverse domains including Film & TV, Technology, Medicine, and History. Questions are reverse-engineered from short, objective, and easily verifiable answers, requiring sophisticated reasoning and information reconciliation beyond basic retrieval. The benchmark addresses linguistic, infrastructural, and censorship-related complexities in Chinese web environments.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: reasoning, search. Language: zh. Verified by llm-stats: no.

Leaderboard

  1. Qwen3.5-397B-A17B self-reported llm-stats
    70.3%
  2. Qwen3.5-122B-A10B self-reported llm-stats
    69.9%
  3. Qwen3.5-35B-A3B self-reported llm-stats
    69.5%
  4. LongCat-Flash-Thinking-2601 self-reported llm-stats
    69.0%
  5. GLM-4.7 self-reported llm-stats
    66.6%
  6. DeepSeek-V3.2 (Thinking) self-reported llm-stats
    65.0%
  7. DeepSeek-V3.2 self-reported llm-stats
    65.0%
  8. Kimi K2-Thinking-0905 self-reported llm-stats
    62.3%
  9. Qwen3.5-27B self-reported llm-stats
    62.1%
  10. DeepSeek-V3.1 self-reported llm-stats
    49.2%
  11. MiniMax M2 self-reported llm-stats
    48.5%
  12. DeepSeek-V3.2-Exp self-reported llm-stats
    47.9%
  13. DeepSeek-R1-0528 self-reported llm-stats
    35.7%