GPT-5.5 Benchmarks: A Full Breakdown
A complete breakdown of GPT-5.5 benchmark scores across coding, terminal use, long-context retrieval, and reasoning. See where OpenAI's flagship leads the field.
GPT-5.5 is OpenAI's current flagship model and one of the most capable LLMs available for production use. It leads on several critical benchmarks — particularly terminal automation, long-context retrieval, and routine agentic coding — while remaining highly competitive on reasoning. This post covers every major score, what the numbers mean in practice, and when GPT-5.5 is the right model for your work.
All figures can be verified in the live benchmark comparison. For an introduction to how benchmarks work, see the complete guide to LLM benchmarks.
Terminal automation: GPT-5.5's clearest lead
Terminal-Bench 2.1 simulates real shell sessions — writing scripts, navigating filesystems, debugging pipelines, and running multi-step CLI workflows. GPT-5.5 scores 78.2%, the highest among the three major shipping models. Claude Opus 4.8 scores 74.6%, Gemini 3.1 Pro scores 70.3%, and Opus 4.7 scores 66.1%.
A 3.6-point lead over Opus 4.8 and nearly 8 points over Gemini 3.1 Pro is meaningful for engineering teams building terminal-heavy automation. If your agent spends most of its time in a shell rather than a GUI, GPT-5.5 is the strongest shipping choice.
Long-context retrieval: a clear architecture advantage
AA-LCR (Long Context Retrieval) measures how well a model can find and reason over information buried in very large documents. GPT-5.5 scores 74.3%, the highest among all benchmarked models. Opus 4.8 scores 67.7% and Opus 4.7 scores 70.3%. Gemini 3.1 Pro and Mythos Preview are not reported on this benchmark.
A 6.6-point lead over Claude Opus 4.8 is hard to ignore for workloads that involve searching through large codebases, long legal documents, or extended conversation histories. This is where GPT-5.5's long-context architecture earns its keep.
Agentic coding: strong but not dominant
On SWE-bench Verified, GPT-5.5 scores 88.7% — the highest among all five benchmarked models and just ahead of Claude Opus 4.8's 88.6%. This is essentially a tie; both models perform exceptionally well on the verified subset.
The picture changes on SWE-bench Pro, the harder real-world eval. GPT-5.5 scores 58.6%, which is 10.6 points behind Opus 4.8's 69.2%. SWE-bench Verified is saturating above 88%; SWE-bench Pro is where differentiation is meaningful. For complex, production-grade issue resolution, Opus 4.8 has a real advantage.
For a full head-to-head breakdown, see the Claude Opus 4.8 vs GPT-5.5 comparison or the best LLM for coding.
Reasoning and knowledge
On GPQA Diamond, GPT-5.5 scores 93.6% — exactly tied with Opus 4.8 and just below Gemini 3.1 Pro's 94.3% and Mythos's 94.6%. All models are tightly clustered in the 93–95% range. GPQA Diamond no longer separates the top tier.
Humanity's Last Exam is the harder test. GPT-5.5 scores 41.4% without tools and 52.2% with tools. Opus 4.8 scores 49.8% / 57.9% and Gemini 3.1 Pro scores 44.4% / 51.4%. On the hardest reasoning tasks, GPT-5.5 trails Opus 4.8 by about 8 points unaided and 5.7 points with tools. Read the best LLM for reasoning for a deeper analysis.
Tool use and web search
On BrowseComp (web search and research), GPT-5.5 scores 84.4% — essentially tied with Opus 4.8's 84.3% and just below Gemini 3.1 Pro's 85.9%. Web search quality is competitive across all three flagship models.
On MCP-Atlas (structured tool-calling), GPT-5.5 scores 75.3% — below Opus 4.8's 82.2% and Gemini 3.1 Pro's 78.2%. For agentic systems that chain together external APIs and tools, this gap matters.
Key takeaways
- Best shipping model for terminal automation: 78.2% on Terminal-Bench 2.1, leading Opus 4.8 by 3.6 points and Gemini by nearly 8 points.
- Strongest long-context retrieval: 74.3% on AA-LCR, 6.6 points ahead of Opus 4.8 — a real advantage for large-document workloads.
- Ties on SWE-bench Verified: 88.7% is effectively tied with Opus 4.8; use SWE-bench Pro to see meaningful differences on harder coding tasks.
- Solid but not leading on reasoning: 93.6% on GPQA Diamond ties Opus 4.8; HLE scores trail Opus 4.8 by roughly 5–8 points.
- Trails on tool use and computer use: MCP-Atlas and OSWorld-Verified both favor Opus 4.8 by meaningful margins.
- Explore the full model profile on the GPT-5.5 hub page or browse all scores in the live benchmark comparison table.