All benchmarks

SWE-bench Verified

Agentic coding

A 500-problem subset of SWE-bench, each task hand-verified by human engineers as solvable. It is the most widely-cited coding benchmark and a standard proxy for real-world software engineering ability.

Model scores

  • Opus 4.888.6%
  • Opus 4.787.6%
  • GPT-5.588.7%
  • Gemini 3.1 Pro80.6%
  • Mythos Preview93.9%

Official source: swebench.com

Related reading