SWE-bench Multilingual

Agentic coding

SWE-bench Multilingual extends SWE-bench beyond Python to real bug-fix tasks across many programming languages, each scored by whether the model’s patch passes the repository’s hidden tests. It is the headline agentic-coding benchmark Cursor reports for its Composer models.

Model scores

Fable 5—
Opus 4.8—
Sonnet 5—
GPT-5.6 Sol—
GPT-5.577.8%
Composer 2.579.8%
Opus 4.780.5%
Gemini 3.1 Pro—
Mythos Preview—

Official source: Cursor — Introducing Composer 2.5

Model scores

Related reading