BrowseComp Long Context 128k

reasoning official site →

A challenging benchmark for evaluating web browsing agents' ability to persistently navigate the internet and find hard-to-locate, entangled information. Comprises 1,266 questions requiring strategic reasoning, creative search, and interpretation of retrieved content, with short and easily verifiable answers.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: reasoning, search. Language: en. Verified by llm-stats: no.

Leaderboard

  1. GPT-5.2 self-reported llm-stats
    92.0%
  2. GPT-5 self-reported llm-stats
    90.0%
  3. GPT-5.1 self-reported llm-stats
    90.0%
  4. GPT-5.1 Instant self-reported llm-stats
    90.0%
  5. GPT-5.1 Thinking self-reported llm-stats
    90.0%