BrowseComp

Agentic search

BrowseComp measures an agent’s ability to find hard-to-locate information on the open web, requiring multi-step browsing, query reformulation and cross-referencing of sources to arrive at a verifiable answer.

Model scores

Fable 5—
Opus 4.884.3%
Sonnet 584.7%
GPT-5.6 Sol90.4% (default) / 92.2% (ultra)
GPT-5.584.4%
Composer 2.5—
Opus 4.779.8%
Gemini 3.1 Pro85.9%
Mythos Preview86.9%

Official source: BrowseComp paper (arXiv)

Model scores

Related reading