GPT-5.5 vs Opus 4.7

GPT-5.5 and Opus 4.7 are evenly matched, each winning 6 of the directly-comparable benchmarks.

Head-to-head record: GPT-5.5 6 · 6 Opus 4.7

GPT-5.5vsOpus 4.7

Fable 5

$10/$50 per Mtok

Opus 4.8

$5/$25 per Mtok

GPT-5.5

$5/$30 per Mtok

Opus 4.7

$5/$25 per Mtok

Gemini 3.1 Pro

$2/$12 per Mtok

Mythos Preview

Agentic coding

SWE-bench Pro

80.3%

69.2%

58.6%

64.3%

54.2%

77.8%

Agentic coding

SWE-bench Verified

95.0%

88.6%

88.7%

87.6%

80.6%

93.9%

Agentic coding

FrontierCode (Diamond)

29.3%

13.4%

5.7%

—

—

—

Long context reasoning

AA-LCR

—

67.7%

74.3%

70.3%

—

—

Agentic terminal coding

Terminal-Bench 2.1

88.0%

74.6%

78.2%

66.1%

70.3%

82.0%

Multidisciplinary reasoning

Humanity's Last Exam

59.0%

no tools

64.5%

with tools

49.8%

no tools

57.9%

with tools

41.4%

no tools (Pro)

52.2%

with tools (Pro)

46.9%

no tools

54.7%

with tools

44.4%

no tools

51.4%

with tools

56.8%

no tools

64.7%

with tools

Agentic search

BrowseComp

—

84.3%

84.4%

79.8%

85.9%

86.9%

Scaled tool use

MCP-Atlas

—

82.2%

75.3%

79.1%

78.2%

—

Tool use

AutomationBench

17.4%

15.5%

12.9%

—

9.6%

—

Agentic computer use

OSWorld-Verified

85.0%

83.4%

78.7%

82.8%

76.2%

85.4%

Spatial reasoning

Blueprint-Bench 2

38.6%

14.5%

36.2%

—

26.5%

—

Agentic financial analysis

Finance Agent v2

—

53.9%

51.8%

51.5%

43.0%

—

Knowledge work

GDPval-AA

1932

1890

1769

—

1314

—

Knowledge work vision

GDPpdf

29.8%

22.5%

24.9%

—

16.7%

—

Legal

Legal Agent Benchmark

13.3%

10.4%

2.1%

—

0.0%

—

Cybersecurity vulnerability reproduction

CyberGym

83.8%

78.8%

81.8%

73.1%

—

83.1%

Cybersecurity

ExploitBench (Cap%)

78.0%

40.0%

34.0%

—

—

69.0%

Biology

BioMysteryBench

46.1%

hard

83.9%

human solved

40.0%

hard

80.4%

human solved

—

—

—

29.6%

hard

82.6%

human solved

Health

HealthBench Professional

66.0%

56.9%

51.8%

—

—

64.7%

Graduate-level reasoning

GPQA Diamond

—

93.6%

93.6%

94.2%

94.3%

94.6%

Visual reasoning

CharXiv Reasoning

—

80.5%

no tools

89.9%

with tools

—

81.3%

no tools

90.1%

with tools

—

86.1%

no tools

93.2%

with tools

Multilingual Q&A

MMMLU

—

—

83.2%

91.5%

92.6%

—