What Is BrowseComp? Measuring Agentic Web Search
BrowseComp tests whether AI agents can find hard-to-locate facts buried deep on the open web using multi-step browsing. Learn how the benchmark works and why it matters.
Most retrieval benchmarks ask a model to look something up. BrowseComp asks whether a model can actually find information — the kind of fact buried on page seven of a niche forum, behind a redirect, or spread across sources that must be cross-referenced before the answer becomes clear.
What BrowseComp measures
BrowseComp is an agentic web-search benchmark introduced by OpenAI. Each task presents an agent with a question whose answer exists somewhere on the open web but is genuinely hard to retrieve. The questions are designed so that a single keyword search does not surface the answer — finding it requires navigating multiple pages, synthesising clues across sources, and sometimes backtracking when a promising lead turns out to be wrong.
Correct answers are verified against a fixed ground truth. The metric is simply accuracy: what fraction of questions did the agent answer correctly? Because questions are adversarially chosen for difficulty, even frontier models score well below 100%, which keeps the benchmark from saturating quickly.
You can inspect current model scores on the BrowseComp benchmark page or compare it to all other evaluations in the live benchmark comparison table.
Single-agent vs. multi-agent variants
BrowseComp results are reported under two conditions that reveal very different things about a model's capability:
- Single-agent — one model instance controls the browser, issues searches, reads pages, and returns a final answer. This tests raw browsing skill and the model's ability to maintain a coherent research thread.
- Multi-agent — a orchestrator spawns parallel subagents, each exploring a different search thread simultaneously, then merges their findings. This typically produces higher accuracy because dead ends are explored in parallel rather than sequentially.
The gap between single- and multi-agent scores is itself informative: a model that improves greatly in the multi-agent setting benefits from parallelism, while one that improves less likely has a stronger single-pass browsing strategy. See our primer on agentic evaluations for context on why this distinction matters.
Why multi-step browsing is hard
Simple retrieval — typing a query and reading the top result — is a solved problem for current LLMs. BrowseComp is hard for several compounding reasons:
- Query formulation — the model must choose search terms that surface relevant but non-obvious pages, iterating when early searches fail.
- State tracking — across many pages and redirects, the model must remember which leads were promising, which were dead ends, and what partial information has been gathered.
- Evidence synthesis — the final answer often requires combining facts from several pages, none of which alone is sufficient.
- Hallucination pressure — when genuine evidence is scarce, a model under pressure to produce an answer may confabulate instead of correctly saying it could not find the information.
How BrowseComp relates to other agentic benchmarks
BrowseComp sits in the same family of evaluations as MCP-Atlas (tool orchestration) and OSWorld-Verified (GUI computer use), all of which measure an agent completing multi-step tasks in an environment rather than answering a one-shot question. The distinguishing feature of BrowseComp is that the environment is the live open web, with all its noise, inconsistency, and adversarial SEO.
For a broader map of how benchmark categories fit together, read the complete guide to LLM benchmarks. If you want to understand a complementary eval that tests reasoning over large retrieved documents rather than search itself, see what is AA-LCR.
Key takeaways
- BrowseComp tests whether an AI agent can locate genuinely obscure facts on the open web, not just retrieve obvious answers.
- Questions require multi-step navigation, evidence synthesis, and tolerance for dead ends — skills distinct from single-turn question answering.
- The single-agent vs. multi-agent split shows how much a model benefits from parallelism versus raw browsing ability.
- Because answers are adversarially chosen for difficulty, the benchmark remains discriminative even as frontier models improve.
- BrowseComp is best read alongside other agentic benchmarks (MCP-Atlas, OSWorld) to build a full picture of an agent's real-world capability.