BrowseComp Long Context 256k

reasoning official site →

BrowseComp is a benchmark for measuring the ability of agents to browse the web, comprising 1,266 questions that require persistently navigating the internet in search of hard-to-find, entangled information. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers. The benchmark focuses on questions where answers are obscure, time-invariant, and well-supported by evidence scattered across the open web.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: reasoning, search. Language: en. Verified by llm-stats: no.

Leaderboard

  1. GPT-5.2 self-reported llm-stats
    89.8%
  2. GPT-5 self-reported llm-stats
    88.8%