BrowseComp

reasoning official site →

BrowseComp is a benchmark comprising 1,266 questions that challenge AI agents to persistently navigate the internet in search of hard-to-find, entangled information. The benchmark measures agents' ability to exercise persistence in information gathering, demonstrate creativity in web navigation, and find concise, verifiable answers. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: agents, reasoning, search. Language: en. Verified by llm-stats: no.

Leaderboard

  1. GPT-5.5 Pro self-reported llm-stats
    90.1%
  2. Claude Mythos Preview self-reported llm-stats
    86.9%
  3. Kimi K2.6 self-reported llm-stats
    86.3%
  4. Gemini 3.1 Pro self-reported llm-stats
    85.9%
  5. GPT-5.5 self-reported llm-stats
    84.4%
  6. Claude Opus 4.8 self-reported llm-stats
    84.3%
  7. Claude Opus 4.6 self-reported llm-stats
    84.0%
  8. MiniMax M3 self-reported llm-stats
    83.5%
  9. DeepSeek-V4-Pro-Max self-reported llm-stats
    83.4%
  10. GPT-5.4 self-reported llm-stats
    82.7%
  11. Claude Opus 4.7 self-reported llm-stats
    79.3%
  12. GLM-5.1 self-reported llm-stats
    79.3%
  13. GPT-5.2 Pro self-reported llm-stats
    77.9%
  14. Seed 2.0 Pro self-reported llm-stats
    77.3%
  15. MiniMax M2.5 self-reported llm-stats
    76.3%
  16. GLM-5 self-reported llm-stats
    75.9%
  17. Kimi K2.5 self-reported llm-stats
    74.9%
  18. Claude Sonnet 4.6 self-reported llm-stats
    74.7%
  19. DeepSeek-V4-Flash-Max self-reported llm-stats
    73.2%
  20. Qwen3.5-397B-A17B self-reported llm-stats
    69.0%