GPT-5.5 vs Mythos Preview

Mythos Preview leads, winning 8 of 9 directly-comparable benchmarks against GPT-5.5.

Head-to-head record: GPT-5.5 1 · 8 Mythos Preview

GPT-5.5vsMythos Preview

Fable 5

$10/$50 per Mtok

Opus 4.8

$5/$25 per Mtok

Sonnet 5

$3/$15 per Mtok

GPT-5.6 Sol

$5/$30 per Mtok

GPT-5.5

$5/$30 per Mtok

Composer 2.5

$0.5/$2.5 per Mtok

Opus 4.7

$5/$25 per Mtok

Gemini 3.1 Pro

$2/$12 per Mtok

Mythos Preview

Agentic coding

DeepSWE v1.1

69.7%

59.0%

54.0%

72.7%

67.0%

—

—

11.8%

—

Agentic coding

SWE-bench Pro

80.3%

69.2%

63.2%

64.6%

59.4%

—

64.3%

54.2%

77.8%

Agentic coding

SWE-bench Verified

95.0%

88.6%

79.6%

96.2%

82.6%

79.6%

82.0%

78.8%

—

Agentic coding

SWE-bench Multilingual

—

—

—

—

77.8%

79.8%

80.5%

—

—

Agentic coding

FrontierCode (Diamond)

29.3%

13.4%

—

—

5.7%

—

—

—

—

Long context reasoning

AA-LCR

—

67.7%

—

—

74.3%

—

70.3%

—

—

Agentic terminal coding

Terminal-Bench 2.1

88.0%

82.7%

80.4%

88.8%

max

91.9%

ultra

83.4%

—

66.1%

70.3%

82.0%

Multidisciplinary reasoning

Humanity's Last Exam

59.0%

no tools

64.5%

with tools

49.8%

no tools

57.9%

with tools

43.2%

no tools

57.4%

with tools

—

41.4%

no tools (Pro)

52.2%

with tools (Pro)

—

46.9%

no tools

54.7%

with tools

44.4%

no tools

51.4%

with tools

56.8%

no tools

64.7%

with tools

Agentic search

BrowseComp

—

84.3%

84.7%

90.4%

default

92.2%

ultra

84.4%

—

79.8%

85.9%

86.9%

Scaled tool use

MCP-Atlas

—

82.2%

—

—

75.3%

—

79.1%

78.2%

—

Tool use

AutomationBench

17.4%

15.5%

13.5%

18.1%

12.9%

—

—

9.6%

—

Agentic computer use

OSWorld-Verified

85.0%

83.4%

81.2%

—

78.7%

—

82.8%

76.2%

85.4%

Spatial reasoning

Blueprint-Bench 2

38.6%

14.5%

—

—

36.2%

—

—

26.5%

—

Agentic financial analysis

Finance Agent v2

—

53.9%

—

—

51.8%

—

51.5%

43.0%

—

Knowledge work

GDPval-AA

1932

1890

—

—

1769

—

—

1314

—

Knowledge work vision

GDPpdf

29.8%

22.5%

—

30.7%

26.0%

—

—

16.7%

—

Legal

Legal Agent Benchmark

13.3%

10.4%

5.8%

—

2.1%

—

—

0.0%

—

Cybersecurity vulnerability reproduction

CyberGym

83.8%

78.8%

—

84.5%

81.8%

—

73.1%

—

83.1%

Cybersecurity

ExploitBench (Cap%)

78.0%

40.0%

—

73.5%

47.9%

—

—

—

69.0%

Biology

BioMysteryBench

46.1%

hard

83.9%

human solved

40.0%

hard

80.4%

human solved

—

—

—

—

—

—

29.6%

hard

82.6%

human solved

Health

HealthBench Professional

66.0%

56.9%

57.8%

60.5%

49.5%

—

—

—

64.7%

Graduate-level reasoning

GPQA Diamond

94.1%

93.6%

—

94.6%

93.6%

—

94.2%

94.3%

94.6%

Visual reasoning

CharXiv Reasoning

—

80.5%

no tools

89.9%

with tools

77.0%

no tools

88.3%

with tools

—

—

—

81.3%

no tools

90.1%

with tools

—

86.1%

no tools

93.2%

with tools

Multilingual Q&A

MMMLU

—

—

—

—

83.2%

—

91.5%

92.6%

—