LLM Boss — Frontier LLM Benchmarks Compared

Fable 5vsGPT-5.6 Sol

Fable 5

$10/$50 per Mtok

Opus 4.8

$5/$25 per Mtok

Sonnet 5

$3/$15 per Mtok

GPT-5.6 Sol

$5/$30 per Mtok

Agentic coding

DeepSWE v1.1

69.7%

59.0%

54.0%

72.7%

Agentic coding

SWE-bench Pro

80.3%

69.2%

63.2%

64.6%

Agentic coding

SWE-bench Verified

95.0%

88.6%

79.6%

96.2%

Agentic coding

SWE-bench Multilingual

—

—

—

—

Agentic coding

FrontierCode (Diamond)

29.3%

13.4%

—

—

Long context reasoning

AA-LCR

—

67.7%

—

—

Agentic terminal coding

Terminal-Bench 2.1

88.0%

82.7%

80.4%

88.8%

max

91.9%

ultra

Multidisciplinary reasoning

Humanity's Last Exam

59.0%

no tools

64.5%

with tools

49.8%

no tools

57.9%

with tools

43.2%

no tools

57.4%

with tools

—

Agentic search

BrowseComp

—

84.3%

84.7%

90.4%

default

92.2%

ultra

Scaled tool use

MCP-Atlas

—

82.2%

—

—

Tool use

AutomationBench

17.4%

15.5%

13.5%

18.1%

Agentic computer use

OSWorld-Verified

85.0%

83.4%

81.2%

—

Spatial reasoning

Blueprint-Bench 2

38.6%

14.5%

—

—

Agentic financial analysis

Finance Agent v2

—

53.9%

—

—

Knowledge work

GDPval-AA

1932

1890

—

—

Knowledge work vision

GDPpdf

29.8%

22.5%

—

30.7%

Legal

Legal Agent Benchmark

13.3%

10.4%

5.8%

—

Cybersecurity vulnerability reproduction

CyberGym

83.8%

78.8%

—

84.5%

Cybersecurity

ExploitBench (Cap%)

78.0%

40.0%

—

73.5%

Biology

BioMysteryBench

46.1%

hard

83.9%

human solved

40.0%

hard

80.4%

human solved

—

—

Health

HealthBench Professional

66.0%

56.9%

57.8%

60.5%

Graduate-level reasoning

GPQA Diamond

94.1%

93.6%

—

94.6%

Visual reasoning

CharXiv Reasoning

—

80.5%

no tools

89.9%

with tools

77.0%

no tools

88.3%

with tools

—

Multilingual Q&A

MMMLU

—

—

—

—