Claude Sonnet 5 Benchmarks: Opus-Class Agentics, Sonnet Pricing

On June 30, 2026, Anthropic released Claude Sonnet 5, billed as its most agentic Sonnet model yet — one that can make plans, drive tools like browsers and terminals, and run autonomously at a level that recently required larger, more expensive models. The pitch is simple: Sonnet 5's performance sits close to Claude Opus 4.8, but at a much lower price.

This post walks through every number from the launch, charts the family side by side, and explains where Sonnet 5 fits on the leaderboard. You can track it live against every other frontier model in the benchmark comparison, or jump to the Claude Sonnet 5 model hub.

The headline numbers

Anthropic's launch table compares Sonnet 5 with its predecessor, Sonnet 4.6, and with Opus 4.8 for reference. Across the board, Sonnet 5 is a clear step up from Sonnet 4.6 and lands within a few points of Opus 4.8:

Benchmark	Sonnet 5	Sonnet 4.6	Opus 4.8
SWE-bench Pro (agentic coding)	63.2%	58.1%	69.2%
Terminal-Bench 2.1 (agentic coding)	80.4%	67.0%	82.7%
Humanity's Last Exam (no tools)	43.2%	34.6%	49.8%
Humanity's Last Exam (with tools)	57.4%	46.8%	57.9%
OSWorld-Verified (computer use)	81.2%	78.5%	83.4%
GDPval-AA v2 (knowledge work, Elo)	1618	1395	1615

Coding and agentic tasks

Coding is where Sonnet 5 makes its biggest leap. On SWE-bench Pro — the hardest real-world coding eval — it jumps 5.1 points over Sonnet 4.6 to 63.2%, closing most of the distance to Opus 4.8's 69.2%. On Terminal-Bench 2.1, which scores end-to-end work in a real terminal, the gain is even larger — a 13.4-point lift to 80.4%, all but matching Opus 4.8.

SWE-bench Pro — agentic codingVendor-reported pass rate (%). Higher is better.

Opus 4.869.2%

Sonnet 563.2%

Sonnet 4.658.1%

Terminal-Bench 2.1 — agentic terminal codingVendor-reported pass rate (%). Higher is better.

Opus 4.882.7%

Sonnet 580.4%

Sonnet 4.667.0%

Early-access testers echoed the numbers, describing a model that finishes multi-step jobs where previous Sonnets would stall, and that checks its own output without being asked — one even reported it writing a reproducing test, fixing a bug, then stashing the fix to confirm the bug returned, all in a single pass.

Computer use and reasoning

On OSWorld-Verified, the desktop computer-use benchmark, Sonnet 5 reaches 81.2% at its highest effort level — just 2.2 points behind Opus 4.8 and well ahead of Sonnet 4.6. On Humanity's Last Exam, the gap to Opus 4.8 nearly disappears with tools enabled: 57.4% vs 57.9%.

OSWorld-Verified — agentic computer useVendor-reported pass rate (%) at max effort. Higher is better.

Opus 4.883.4%

Sonnet 581.2%

Sonnet 4.678.5%

Humanity's Last Exam (with tools) — multidisciplinary reasoningVendor-reported accuracy (%) with tools. Higher is better.

Opus 4.857.9%

Sonnet 557.4%

Sonnet 4.646.8%

On the GDPval-AA v2 knowledge-work eval — an Elo-style rating of real economic deliverables — Sonnet 5 scores 1618, edging out Opus 4.8's 1615 and far above Sonnet 4.6's 1395. It is the one headline metric where the cheaper model nudges ahead of the flagship.

GDPval-AA v2 — knowledge work (Elo)Vendor-reported Elo-style rating. Higher is better.

Sonnet 51618

Opus 4.81615

Sonnet 4.61395

Beyond the headline: what the system card adds

The launch post shows only six evals; Anthropic's Claude Sonnet 5 System Card reports a far wider suite (Table 8.1.A). Several of those results land right on top of the GPT-5.5 numbers we already track, so they drop cleanly into the comparison. The biggest is BrowseComp, the agentic-search eval: Sonnet 5 hits 84.7% with a single agent (and 86.6% with a multi-agent setup) at max effort — a big jump over Sonnet 4.6 and essentially level with the rest of the frontier.

BrowseComp — agentic searchSystem Card Table 8.1.A. Sonnet 5 single-agent, max effort. Higher is better.

Sonnet 584.7%

GPT-5.584.4%

Sonnet 4.676.2%

The card also fills in several professional and tool-use evals. Notably, on HealthBench Professional Sonnet 5 (57.8%) actually edges out Opus 4.8 (56.9%) in our table — a rare win for the cheaper model — and on CharXiv Reasoning it scores 77.0% without tools and 88.3% with tools.

Benchmark	Sonnet 5	Sonnet 4.6	GPT-5.5
BrowseComp (single agent)	84.7%	76.2%	84.4%
AutomationBench	13.5%	5.3%	12.9%
Legal Agent Benchmark (Harvey held-out)	5.8%	5.4%	2.1%
HealthBench Professional	57.8%	44.2%	51.8%
CharXiv Reasoning (with tools)	88.3%	—	—

Pricing and availability

Sonnet 5 launched at an introductory $2 per million input tokens and $10 per million output tokens through August 31, 2026, after which it moves to standard pricing of $3 / $15 per Mtok. Either way it sits well below Opus 4.8's $5 / $25. It is the default model on Free and Pro plans, available to Max, Team and Enterprise users, and shipped in Claude Code and on the Claude Platform, where developers call it as claude-sonnet-5.

One pricing nuance: Sonnet 5 uses an updated tokenizer (as Opus 4.7 did), so the same input can map to roughly 1.0–1.35× more tokens depending on content. Anthropic set the introductory price so the move from Sonnet 4.6 is roughly cost-neutral. The model also ships with real-time cyber safeguards enabled by default.

Where this leaves the leaderboard

Sonnet 5 collapses the old Sonnet-versus-Opus decision into a single price-performance curve. If you need the top score on the hardest agentic tasks, Opus 4.8 still wins by a small margin. If you want most of that capability at roughly half the price, Sonnet 5 is now the obvious default. Compare them directly in the Opus 4.8 vs Sonnet 5 head-to-head, see how it stacks up against Anthropic's flagship in Fable 5 vs Sonnet 5, or against OpenAI in Sonnet 5 vs GPT-5.5. For the broader picture, see our best LLM for coding ranking.

A note on our comparison table: we only fill in Sonnet 5 cells where a shared reference model matches the figure already in our data, so every head-to-head stays apples-to-apples. With the system card numbers folded in, that now covers SWE-bench Pro, Terminal-Bench, OSWorld-Verified, Humanity's Last Exam, BrowseComp, AutomationBench, the Legal Agent Benchmark, HealthBench Professional and CharXiv (for Terminal-Bench we moved the whole row onto the card's harness, re-measuring Opus 4.8 at 82.7 and GPT-5.5 at 83.4). FrontierCode (Diamond vs the card's v1) and GDPval-AA (v1 vs v2) stay blank because the reference points don't line up. As always, treat vendor launch numbers with caution — see the complete guide to LLM benchmarks for how to read them.

Key takeaways

Opus-class agentics, Sonnet pricing: within ~2–3 points of Opus 4.8 on coding and computer use, at $2/$10–$3/$15 per Mtok vs $5/$25.
Big jump over Sonnet 4.6: +13.4 on Terminal-Bench, +5.1 on SWE-bench Pro, +10.6 on Humanity's Last Exam with tools.
One outright win: GDPval-AA v2 knowledge work, 1618 vs Opus 4.8's 1615.
Tokenizer change: the same text can cost 1.0–1.35× more tokens; intro pricing is set to keep the switch roughly cost-neutral.
Track every number on the Sonnet 5 model hub or the live comparison table.

Frequently asked questions

Is Claude Sonnet 5 better than Claude Opus 4.8?

Not quite — but it is close. Opus 4.8 still leads on the hardest evaluations (SWE-bench Pro 69.2% vs 63.2%, OSWorld-Verified 83.4% vs 81.2%, Terminal-Bench 2.1 82.7% vs 80.4%), and on Humanity’s Last Exam with tools the two are nearly tied (57.9% vs 57.4%). Sonnet 5 lands within a few points of Opus 4.8 on most agentic tasks while costing far less, so Opus 4.8 remains the accuracy pick and Sonnet 5 the price-performance pick.

How much does Claude Sonnet 5 cost?

Sonnet 5 is $3 per million input tokens and $15 per million output tokens at standard pricing, with an introductory rate of $2 / $10 per million tokens through August 31, 2026. That is well below Opus 4.8 at $5 / $25. Note that Sonnet 5 uses an updated tokenizer, so the same text can map to roughly 1.0–1.35× more tokens than on Sonnet 4.6.

What is Claude Sonnet 5’s SWE-bench Pro score?

Anthropic reports Claude Sonnet 5 at 63.2% on SWE-bench Pro, up from 58.1% for Sonnet 4.6. Opus 4.8 still leads the family at 69.2%.

Is Claude Sonnet 5 good for coding and agentic work?

Yes. Sonnet 5 is described as the most agentic Sonnet model yet, scoring 80.4% on Terminal-Bench 2.1 and 63.2% on SWE-bench Pro — large gains over Sonnet 4.6 (67.0% and 58.1%) and close to Opus-class results, which makes it a strong default for sustained coding, tool use and debugging.

What changed between Claude Sonnet 4.6 and Sonnet 5?

Sonnet 5 improves across reasoning, tool use, coding and knowledge work, with gains of roughly 5–13 points on the benchmarks Anthropic reported. It also ships with selectable effort levels, real-time cyber safeguards enabled by default, and an updated tokenizer.