All benchmarks

BrowseComp

Agentic search

BrowseComp measures an agent’s ability to find hard-to-locate information on the open web, requiring multi-step browsing, query reformulation and cross-referencing of sources to arrive at a verifiable answer.

Model scores

  • Opus 4.884.3%
  • Opus 4.779.8%
  • GPT-5.584.4%
  • Gemini 3.1 Pro85.9%
  • Mythos Preview86.9%

Official source: BrowseComp paper (arXiv)

Related reading