All benchmarks

SWE-bench Pro

Agentic coding

A harder variant of SWE-bench: real bug-fix and feature tasks drawn from actively-maintained repositories, with larger multi-file diffs and no public ground-truth leakage. It measures how reliably a model can resolve genuine GitHub issues end to end.

Model scores

  • Opus 4.869.2%
  • Opus 4.764.3%
  • GPT-5.558.6%
  • Gemini 3.1 Pro54.2%
  • Mythos Preview77.8%

Official source: SWE-bench Pro leaderboard (Scale)

Related reading