SWE-bench Pro

Agentic coding

A harder variant of SWE-bench: real bug-fix and feature tasks drawn from actively-maintained repositories, with larger multi-file diffs and no public ground-truth leakage. It measures how reliably a model can resolve genuine GitHub issues end to end.

Model scores

Fable 580.3%
Opus 4.869.2%
Sonnet 563.2%
GPT-5.6 Sol64.6%
GPT-5.559.4%
Composer 2.5—
Opus 4.764.3%
Gemini 3.1 Pro54.2%
Mythos Preview77.8%

Official source: SWE-bench Pro leaderboard (Scale)

Model scores

Related reading