GPT-5.5 vs Opus 4.7

GPT-5.5 and Opus 4.7 are evenly matched, each winning 6 of the directly-comparable benchmarks.

Head-to-head record: GPT-5.5 6 · 6 Opus 4.7

GPT-5.5vsOpus 4.7
Fable 5
$10/$50
Opus 4.8
$5/$25
GPT-5.5
$5/$30
Opus 4.7
$5/$25
Gemini 3.1 Pro
$2/$12
Mythos Preview
Agentic coding
SWE-bench Pro
80.3%
69.2%
58.6%
64.3%
54.2%
77.8%
Agentic coding
SWE-bench Verified
95.0%
88.6%
88.7%
87.6%
80.6%
93.9%
Agentic coding
FrontierCode (Diamond)
29.3%
13.4%
5.7%
Long context reasoning
AA-LCR
67.7%
74.3%
70.3%
Agentic terminal coding
Terminal-Bench 2.1
88.0%
74.6%
78.2%
66.1%
70.3%
82.0%
Multidisciplinary reasoning
Humanity's Last Exam
59.0%
no tools
64.5%
with tools
49.8%
no tools
57.9%
with tools
41.4%
no tools (Pro)
52.2%
with tools (Pro)
46.9%
no tools
54.7%
with tools
44.4%
no tools
51.4%
with tools
56.8%
no tools
64.7%
with tools
Agentic search
BrowseComp
84.3%
84.4%
79.8%
85.9%
86.9%
Scaled tool use
MCP-Atlas
82.2%
75.3%
79.1%
78.2%
Tool use
AutomationBench
17.4%
15.5%
12.9%
9.6%
Agentic computer use
OSWorld-Verified
85.0%
83.4%
78.7%
82.8%
76.2%
85.4%
Spatial reasoning
Blueprint-Bench 2
38.6%
14.5%
36.2%
26.5%
Agentic financial analysis
Finance Agent v2
53.9%
51.8%
51.5%
43.0%
Knowledge work
GDPval-AA
1932
1890
1769
1314
Knowledge work vision
GDPpdf
29.8%
22.5%
24.9%
16.7%
Legal
Legal Agent Benchmark
13.3%
10.4%
2.1%
0.0%
Cybersecurity vulnerability reproduction
CyberGym
83.8%
78.8%
81.8%
73.1%
83.1%
Cybersecurity
ExploitBench (Cap%)
78.0%
40.0%
34.0%
69.0%
Biology
BioMysteryBench
46.1%
hard
83.9%
human solved
40.0%
hard
80.4%
human solved
29.6%
hard
82.6%
human solved
Health
HealthBench Professional
66.0%
56.9%
51.8%
64.7%
Graduate-level reasoning
GPQA Diamond
93.6%
93.6%
94.2%
94.3%
94.6%
Visual reasoning
CharXiv Reasoning
80.5%
no tools
89.9%
with tools
81.3%
no tools
90.1%
with tools
86.1%
no tools
93.2%
with tools
Multilingual Q&A
MMMLU
83.2%
91.5%
92.6%