GPT-5.5 Benchmarks: A Full Breakdown

GPT-5.5 is OpenAI's current flagship model and one of the most capable LLMs available for production use. It leads on several critical benchmarks — particularly terminal automation, long-context retrieval, and routine agentic coding — while remaining highly competitive on reasoning. This post covers every major score, what the numbers mean in practice, and when GPT-5.5 is the right model for your work.

All figures can be verified in the live benchmark comparison. For an introduction to how benchmarks work, see the complete guide to LLM benchmarks.

Terminal automation: GPT-5.5's strongest category

Terminal-Bench 2.1 simulates real shell sessions — writing scripts, navigating filesystems, debugging pipelines, and running multi-step CLI workflows. GPT-5.5 scores 83.4%, the highest among the three major shipping models. Claude Opus 4.8 scores 82.7%, Gemini 3.1 Pro scores 70.3%, and Opus 4.7 scores 66.1%.

GPT-5.5 still edges Opus 4.8 — though only by 0.7 points now that Anthropic has re-measured Opus 4.8 on the newer harness (up from 74.6%) — and stays roughly 13 points clear of Gemini 3.1 Pro. If your agent spends most of its time in a shell rather than a GUI, GPT-5.5 and Opus 4.8 are effectively tied at the top.

Long-context retrieval: a clear architecture advantage

AA-LCR (Long Context Retrieval) measures how well a model can find and reason over information buried in very large documents. GPT-5.5 scores 74.3%, the highest among all benchmarked models. Opus 4.8 scores 67.7% and Opus 4.7 scores 70.3%. Gemini 3.1 Pro and Mythos Preview are not reported on this benchmark.

A 6.6-point lead over Claude Opus 4.8 is hard to ignore for workloads that involve searching through large codebases, long legal documents, or extended conversation histories. This is where GPT-5.5's long-context architecture earns its keep.

Agentic coding: strong but not dominant

On SWE-bench Verified, GPT-5.5 scores 88.7% — the highest among all five benchmarked models and just ahead of Claude Opus 4.8's 88.6%. This is essentially a tie; both models perform exceptionally well on the verified subset.

The picture changes on SWE-bench Pro, the harder real-world eval. GPT-5.5 scores 58.6%, which is 10.6 points behind Opus 4.8's 69.2%. SWE-bench Verified is saturating above 88%; SWE-bench Pro is where differentiation is meaningful. For complex, production-grade issue resolution, Opus 4.8 has a real advantage.

For a full head-to-head breakdown, see the Claude Opus 4.8 vs GPT-5.5 comparison or the best LLM for coding.

Reasoning and knowledge

On GPQA Diamond, GPT-5.5 scores 93.6% — exactly tied with Opus 4.8 and just below Gemini 3.1 Pro's 94.3% and Mythos's 94.6%. All models are tightly clustered in the 93–95% range. GPQA Diamond no longer separates the top tier.

Humanity's Last Exam is the harder test. GPT-5.5 scores 41.4% without tools and 52.2% with tools. Opus 4.8 scores 49.8% / 57.9% and Gemini 3.1 Pro scores 44.4% / 51.4%. On the hardest reasoning tasks, GPT-5.5 trails Opus 4.8 by about 8 points unaided and 5.7 points with tools. Read the best LLM for reasoning for a deeper analysis.

Tool use and web search

On BrowseComp (web search and research), GPT-5.5 scores 84.4% — essentially tied with Opus 4.8's 84.3% and just below Gemini 3.1 Pro's 85.9%. Web search quality is competitive across all three flagship models.

On MCP-Atlas (structured tool-calling), GPT-5.5 scores 75.3% — below Opus 4.8's 82.2% and Gemini 3.1 Pro's 78.2%. For agentic systems that chain together external APIs and tools, this gap matters.

Key takeaways

Best shipping model for terminal automation: 83.4% on Terminal-Bench 2.1, narrowly leading a re-measured Opus 4.8 (82.7%) and roughly 13 points clear of Gemini.
Strongest long-context retrieval: 74.3% on AA-LCR, 6.6 points ahead of Opus 4.8 — a real advantage for large-document workloads.
Ties on SWE-bench Verified: 88.7% is effectively tied with Opus 4.8; use SWE-bench Pro to see meaningful differences on harder coding tasks.
Solid but not leading on reasoning: 93.6% on GPQA Diamond ties Opus 4.8; HLE scores trail Opus 4.8 by roughly 5–8 points.
Trails on tool use and computer use: MCP-Atlas and OSWorld-Verified both favor Opus 4.8 by meaningful margins.
Explore the full model profile on the GPT-5.5 hub page or browse all scores in the live benchmark comparison table.