Claude Opus 4.8 Benchmarks: A Full Breakdown

Claude Opus 4.8 is Anthropic's current flagship model and one of the strongest general-purpose LLMs available. Across nine major benchmarks it consistently places at or near the top among models that are actually shipping to developers. This post walks through every score, explains what each benchmark measures, and tells you when Opus 4.8 is the right choice — and when it is not.

Update (June 9, 2026): Anthropic has released Claude Fable 5, a generally available Mythos-class model that now tops the SWE-bench leaderboards. Opus 4.8 remains the stronger value pick at half the price — see the Claude Fable 5 benchmark breakdown for how the two compare.

You can verify all figures in the live benchmark comparison or read the complete guide to LLM benchmarks to understand how these numbers are produced.

Agentic coding: where Opus 4.8 shines brightest

Coding evals are the clearest signal for production engineering teams. On SWE-bench Pro, the hardest real-world software-engineering benchmark available, Opus 4.8 scores 69.2%. That comfortably beats the other established flagships — GPT-5.5 scores 58.6%, Gemini 3.1 Pro scores 54.2%, and Claude Opus 4.7 scores 64.3%. The models above it are Anthropic's own Mythos-class tier: Mythos Preview at 77.8% and the newly released Claude Fable 5 at 80.3%, which costs twice as much per token.

On SWE-bench Verified, the easier subset, Opus 4.8 scores 88.6% — nearly identical to GPT-5.5's 88.7% and well above Gemini 3.1 Pro's 80.6%. SWE-bench Verified is saturating; SWE-bench Pro is where real differentiation shows. For a deeper comparison against GPT-5.5 specifically, see Claude Opus 4.8 vs GPT-5.5.

The bottom line for engineering teams: if your agentic pipeline needs to resolve complex GitHub issues with minimal scaffolding, Opus 4.8 is the strongest model at its price point — and Fable 5 is the option above it if budget allows. See the best LLM for coding for a broader comparison including more models.

Tool use and MCP integration

MCP-Atlas tests structured tool-calling workflows — the kind of multi-step API orchestration that agentic applications depend on. Opus 4.8 scores 82.2%, the highest among all benchmarked models. GPT-5.5 scores 75.3% and Gemini 3.1 Pro scores 78.2%. Opus 4.7 scores 79.1%.

A 6.9-point lead over GPT-5.5 on this benchmark is material for teams building systems that chain together external APIs, search, and code execution. The lead holds even against Gemini, which otherwise competes closely on many other dimensions.

Computer use and GUI automation

OSWorld-Verified evaluates a model's ability to control a desktop GUI — clicking, typing, navigating applications. Opus 4.8 scores 83.4%, well clear of GPT-5.5 (78.7%) and Gemini 3.1 Pro (76.2%). Only Anthropic's Mythos-class models score higher — Mythos Preview at 85.4% and Fable 5 at 85.0% per the June 2026 launch table.

This matters for browser automation, RPA-style workflows, and agents that need to interact with software that does not expose an API. Opus 4.8's lead over the other flagships here is consistent and clear.

Reasoning and knowledge

On GPQA Diamond (graduate-level science questions), Opus 4.8 scores 93.6%. That ties GPT-5.5 exactly and falls just below Gemini 3.1 Pro's 94.3% and Mythos's 94.6%. All models with reported scores are clustered tightly here — the gap is less than 1.1 points.

Humanity's Last Exam is the more discriminating reasoning benchmark. Opus 4.8 scores 49.8% without tools and 57.9% with tools. GPT-5.5 scores 41.4% / 52.2%, and Gemini 3.1 Pro scores 44.4% / 51.4%. On the hardest reasoning frontier, Opus 4.8 has a consistent and meaningful lead. Read the best LLM for reasoning breakdown for more analysis.

Where Opus 4.8 falls short

Not every benchmark favors Opus 4.8. On Terminal-Bench 2.1, GPT-5.5 scores 83.4% versus Opus 4.8's 82.7% — a gap that all but vanished once Anthropic re-measured Opus 4.8 on the newer harness (it previously reported 74.6%). It is now effectively a tie for shell-heavy automation.

On AA-LCR (long-context retrieval), GPT-5.5 scores 74.3% against Opus 4.8's 67.7%. If your workload centers on searching through very large documents or codebases, GPT-5.5's architecture gives it a real edge.

On BrowseComp (web search), Opus 4.8 scores 84.3% — essentially identical to GPT-5.5's 84.4% and just below Gemini 3.1 Pro's 85.9%. This is a near-tie across all three flagship models.

Key takeaways

Best value for agentic coding: Opus 4.8 scores 69.2% on SWE-bench Pro, 10.6 points ahead of GPT-5.5 and 15 points ahead of Gemini 3.1 Pro — only Anthropic's pricier Mythos-class models (Fable 5, Mythos Preview) score higher.
Best-in-class tool use: 82.2% on MCP-Atlas is the highest score among all benchmarked models, including research previews.
Strong in computer use: 83.4% on OSWorld-Verified beats GPT-5.5 and Gemini by 4.7+ points; only the Mythos-class models edge ahead.
Strong but not dominant on reasoning: 93.6% on GPQA Diamond ties GPT-5.5, and the lead on HLE is real but not massive.
Weaker on terminal tasks and long-context retrieval: GPT-5.5 leads on Terminal-Bench 2.1 and AA-LCR — choose accordingly.
Explore the full model profile on the Claude Opus 4.8 hub page or compare directly in the live benchmark comparison table.