SWE-bench Verified

Agentic coding

A 500-problem subset of SWE-bench, each task hand-verified by human engineers as solvable. It is the most widely-cited coding benchmark and a standard proxy for real-world software engineering ability.

Model scores

Fable 595.0%
Opus 4.888.6%
Sonnet 579.6%
GPT-5.6 Sol96.2%
GPT-5.582.6%
Composer 2.579.6%
Opus 4.782.0%
Gemini 3.1 Pro78.8%
Mythos Preview—

Official source: Vals.ai SWE-bench Verified leaderboard

Model scores

Related reading