Guide

Claude Opus 4.8 Benchmarks: A Full Breakdown

A full breakdown of Claude Opus 4.8 benchmark scores across coding, agentic tasks, reasoning, and tool use. See where Anthropic's flagship leads and where it trails.

8 min read

Claude Opus 4.8 is Anthropic's current flagship model and one of the strongest general-purpose LLMs available. Across nine major benchmarks it consistently places at or near the top among models that are actually shipping to developers. This post walks through every score, explains what each benchmark measures, and tells you when Opus 4.8 is the right choice — and when it is not.

You can verify all figures in the live benchmark comparison or read the complete guide to LLM benchmarks to understand how these numbers are produced.

Agentic coding: where Opus 4.8 shines brightest

Coding evals are the clearest signal for production engineering teams. On SWE-bench Pro, the hardest real-world software-engineering benchmark available, Opus 4.8 scores 69.2%. That is the highest score among shipping models — GPT-5.5 scores 58.6%, Gemini 3.1 Pro scores 54.2%, and Claude Opus 4.7 scores 64.3%. The only model that beats it is Mythos Preview at 77.8%, which is a research preview rather than a production release.

On SWE-bench Verified, the easier subset, Opus 4.8 scores 88.6% — nearly identical to GPT-5.5's 88.7% and well above Gemini 3.1 Pro's 80.6%. SWE-bench Verified is saturating; SWE-bench Pro is where real differentiation shows. For a deeper comparison against GPT-5.5 specifically, see Claude Opus 4.8 vs GPT-5.5.

The bottom line for engineering teams: if your agentic pipeline needs to resolve complex GitHub issues with minimal scaffolding, Opus 4.8 is the strongest shipping model available. See the best LLM for coding for a broader comparison including more models.

Tool use and MCP integration

MCP-Atlas tests structured tool-calling workflows — the kind of multi-step API orchestration that agentic applications depend on. Opus 4.8 scores 82.2%, the highest among all benchmarked models. GPT-5.5 scores 75.3% and Gemini 3.1 Pro scores 78.2%. Opus 4.7 scores 79.1%.

A 6.9-point lead over GPT-5.5 on this benchmark is material for teams building systems that chain together external APIs, search, and code execution. The lead holds even against Gemini, which otherwise competes closely on many other dimensions.

Computer use and GUI automation

OSWorld-Verified evaluates a model's ability to control a desktop GUI — clicking, typing, navigating applications. Opus 4.8 scores 83.4%, the highest among all five benchmarked models. GPT-5.5 trails at 78.7%, Gemini 3.1 Pro at 76.2%, and Mythos Preview at 79.6%.

This matters for browser automation, RPA-style workflows, and agents that need to interact with software that does not expose an API. Opus 4.8's lead here is consistent and clear.

Reasoning and knowledge

On GPQA Diamond (graduate-level science questions), Opus 4.8 scores 93.6%. That ties GPT-5.5 exactly and falls just below Gemini 3.1 Pro's 94.3% and Mythos's 94.6%. All five models are clustered tightly here — the gap is less than 1.1 points.

Humanity's Last Exam is the more discriminating reasoning benchmark. Opus 4.8 scores 49.8% without tools and 57.9% with tools. GPT-5.5 scores 41.4% / 52.2%, and Gemini 3.1 Pro scores 44.4% / 51.4%. On the hardest reasoning frontier, Opus 4.8 has a consistent and meaningful lead. Read the best LLM for reasoning breakdown for more analysis.

Where Opus 4.8 falls short

Not every benchmark favors Opus 4.8. On Terminal-Bench 2.1, GPT-5.5 scores 78.2% versus Opus 4.8's 74.6% — a 3.6-point gap that matters for teams building agentic CLI pipelines or shell-heavy automation.

On AA-LCR (long-context retrieval), GPT-5.5 scores 74.3% against Opus 4.8's 67.7%. If your workload centers on searching through very large documents or codebases, GPT-5.5's architecture gives it a real edge.

On BrowseComp (web search), Opus 4.8 scores 84.3% — essentially identical to GPT-5.5's 84.4% and just below Gemini 3.1 Pro's 85.9%. This is a near-tie across all three flagship models.

Key takeaways

  • Strongest shipping model for agentic coding: Opus 4.8 scores 69.2% on SWE-bench Pro, 10.6 points ahead of GPT-5.5 and 15 points ahead of Gemini 3.1 Pro.
  • Best-in-class tool use: 82.2% on MCP-Atlas is the highest score among all benchmarked models, including research previews.
  • Clear leader in computer use: 83.4% on OSWorld-Verified beats every competitor by at least 3.8 points.
  • Strong but not dominant on reasoning: 93.6% on GPQA Diamond ties GPT-5.5, and the lead on HLE is real but not massive.
  • Weaker on terminal tasks and long-context retrieval: GPT-5.5 leads on Terminal-Bench 2.1 and AA-LCR — choose accordingly.
  • Explore the full model profile on the Claude Opus 4.8 hub page or compare directly in the live benchmark comparison table.

Keep reading