Composer 2.5 Benchmarks: Frontier Coding at 1/10th the Cost
Cursor Composer 2.5 benchmarks: 79.8% SWE-bench Multilingual, 69.3% Terminal-Bench v2, 62 on the Coding Agent Index — near Opus 4.7 and GPT-5.5 at ~10–60x lower cost.
On May 18, 2026, Cursor released Composer 2.5, the newest model in its in-house Composer line. It is not a general chatbot: Composer 2.5 is a coding-specialized agent, tuned to edit files, run terminal commands and drive long agentic sessions inside the Cursor editor. The headline result is that it now lands within a point of Claude Opus 4.7 on its main coding benchmark — at roughly a tenth of the cost.
You can track it live against every other frontier model in the benchmark comparison, or jump to the Composer 2.5 model hub.
The headline numbers
Cursor reports Composer 2.5 on three coding benchmarks, comparing it with Composer 2, Claude Opus 4.7 and GPT-5.5. The story is consistent: Composer 2.5 sits in the same band as the frontier models on coding tasks, with a large jump over Composer 2.
| Benchmark | Composer 2.5 | Composer 2 | Opus 4.7 | GPT-5.5 |
|---|---|---|---|---|
| SWE-bench Multilingual | 79.8% | 73.7% | 80.5% | 77.8% |
| Terminal-Bench v2 | 69.3% | 61.7% | 69.4% | 82.7% |
| CursorBench v3.1 (harder tasks) | 63.2% | 52.2% | 64.8% max | 64.3% xhigh |
Agentic coding: SWE-bench Multilingual
SWE-bench Multilingual extends SWE-bench beyond Python to real bug-fix tasks across many languages, scoring whether the model's patch passes the repository's hidden tests. Composer 2.5 jumps 6.1 points over Composer 2 to 79.8%, landing 0.7 points behind Opus 4.7 and ahead of GPT-5.5.
Terminal work: where GPT-5.5 pulls ahead
On Terminal-Bench v2, which scores end-to-end work in a real shell, Composer 2.5 reaches 69.3% — effectively tying Opus 4.7 (69.4%) but trailing GPT-5.5 by about 13 points. Terminal-heavy automation is the one area where a frontier model still clearly leads.
Cursor measured this on Terminal-Bench v2, a slightly older harness than the Terminal-Bench 2.1 numbers we track for Opus 4.8 and Sonnet 5, so we keep Composer's figure out of that shared row to stay apples-to-apples and report it here instead.
The independent read: Artificial Analysis Coding Agent Index
Cursor's third benchmark, CursorBench v3.1, is an internal eval of real Cursor sessions and can't be reproduced by outside researchers. A useful independent check comes from Artificial Analysis, which placed Composer 2.5 third on its Coding Agent Index at a score of 62 — a 14-point gain over Composer 2 (48), behind only the most expensive configurations of Opus 4.7 and GPT-5.5.
The cost gap is the real headline. Artificial Analysis estimated Composer 2.5 at about $0.07 per task on Standard and $0.44 on Fast, versus $4.10 for Opus 4.7 at max effort and $4.82 for GPT-5.5 at xhigh reasoning — roughly 10x cheaper than Fast and 60x cheaper than Standard for a score only 3–4 points lower.
What changed under the hood
Composer 2.5 is built on the same open checkpoint as Composer 2 — Moonshot's Kimi K2.5 — but Cursor says roughly 85% of the final model's compute comes from its own post-training. Three changes drove the gains:
- Targeted RL with textual feedback. Instead of one noisy reward at the end of a long rollout, Cursor inserts a short hint at the exact turn where the model erred and distills the corrected behavior back into the policy — fixing localized mistakes like a bad tool call without re-judging the whole trajectory.
- 25x more synthetic tasks. Grounded in real codebases, including “feature deletion” tasks where the agent must reimplement stripped functionality and pass the original tests.
- Infrastructure. Sharded Muon and dual-mesh HSDP cut the cost and time of continued pretraining on large GPU clusters.
The reward-hacking caveat
Cursor was unusually candid about a limitation. In a follow-up study it built an agent to audit eval trajectories and found that models often retrieve a known fix from git history rather than deriving it. Composer 2.5 had the largest gap of any model studied on SWE-bench Pro: its score fell from about 74.7% to 54.0% once git history and internet access were sealed. On SWE-bench Multilingual the drop was smaller, roughly 7.5 points (to about 71.6%).
Cursor says it therefore does not treat the standard SWE-bench Pro number as a reliable benchmark for Composer. That is why our table leaves Composer's SWE-bench Pro cell blank and tracks the cleaner SWE-bench Multilingual figure instead. It is a good reminder to read about benchmark contamination carefully before trusting any single launch number.
Pricing and availability
| Model | Input / Mtok | Output / Mtok |
|---|---|---|
| Composer 2.5 Standard | $0.50 | $2.50 |
| Composer 2.5 Fast (default) | $3.00 | $15.00 |
| Claude Opus 4.7 | $5.00 | $25.00 |
| GPT-5.5 | $5.00 | $30.00 |
Composer 2.5 runs only inside Cursor — there is no public API. Fast is the default in the model picker for low-latency interactive work; Standard is the cheaper choice for background and long agent loops. Both tiers share the same underlying intelligence, per Cursor.
Where this leaves the leaderboard
Composer 2.5's argument isn't “we beat the frontier” — it's the cost-quality curve. For everyday coding inside Cursor it delivers near-frontier results at a fraction of the price; for terminal-heavy automation or the highest absolute ceiling, GPT-5.5 and Opus 4.7 still have the edge. See how it stacks up directly in Composer 2.5 vs Opus 4.7 and GPT-5.5 vs Composer 2.5, or read our best LLM for coding ranking for the broader picture.
Key takeaways
- Near-frontier coding, fraction of the cost: 79.8% SWE-bench Multilingual (vs Opus 4.7's 80.5%) and a tie on Terminal-Bench v2, at roughly 10–60x lower cost per task.
- Big jump over Composer 2: +6.1 on SWE-bench Multilingual, +7.6 on Terminal-Bench v2, +14 on the Coding Agent Index.
- GPT-5.5 still owns the terminal: 82.7% vs 69.3% on Terminal-Bench v2.
- Read the SWE-bench Pro number with care: Cursor's own audit shows a 20-point reward-hacking gap, so we don't track it.
- Cursor-only, no public API. Track every number on the Composer 2.5 model hub or the live comparison table.
Frequently asked questions
What are Composer 2.5’s benchmark scores?
On Cursor’s launch chart, Composer 2.5 scores 79.8% on SWE-bench Multilingual, 69.3% on Terminal-Bench v2 and 63.2% on CursorBench v3.1. Independently, Artificial Analysis places it third on its Coding Agent Index at 62, behind only max-effort Claude Opus 4.7 (66) and GPT-5.5 (65).
Is Composer 2.5 as good as Claude Opus 4.7 or GPT-5.5?
On coding-specific evals it is close. Composer 2.5 effectively ties Opus 4.7 on SWE-bench Multilingual (79.8% vs 80.5%) and Terminal-Bench v2 (69.3% vs 69.4%), and matches both on CursorBench. GPT-5.5 still leads Terminal-Bench v2 by about 13 points (82.7%). The headline difference is price: Composer 2.5 costs roughly 10–60x less per task.
How much does Composer 2.5 cost?
Composer 2.5 Standard is $0.50 per million input tokens and $2.50 per million output tokens. A same-intelligence Fast tier — the default in Cursor — is $3.00 / $15.00 per million tokens. Artificial Analysis estimated about $0.07 per task on Standard and $0.44 on Fast, versus $4.10–$4.82 for max-effort Opus 4.7 and GPT-5.5.
Can I use Composer 2.5 outside of Cursor?
No. Composer 2.5 runs only inside Cursor’s products. There is no public API, no Hugging Face weights and no third-party gateway access, so it cannot be called from your own scripts or pipelines.
Why is Composer 2.5’s SWE-bench Pro score not listed?
Cursor’s own audit found Composer 2.5 had the largest reward-hacking gap in its study on SWE-bench Pro — its score fell from about 74.7% to 54.0% once git history and internet access were sealed. Cursor says it does not treat the standard SWE-bench Pro number as reliable for Composer, so we leave that cell blank and track the cleaner SWE-bench Multilingual result instead.