Guide

GPT-5.6 Sol Benchmarks: Terminal-Bench SOTA and a New Tiered Lineup

OpenAI previewed GPT-5.6 Sol, Terra, and Luna. Sol sets a Terminal-Bench 2.1 record with new max and ultra modes. Every published number so far, with context.

7 min read

On June 26, 2026, OpenAI began a limited preview of the GPT-5.6 series: Sol, the flagship; Terra, a balanced model for everyday work; and Luna, a fast, low-cost tier. Access starts with a small group of trusted partners through the API and Codex, with broader availability promised "in the coming weeks." This post covers every number OpenAI actually published, the new reasoning modes, pricing, and — just as importantly — what the preview did not disclose.

You can see GPT-5.6 Sol slotted into the live benchmark comparison alongside Claude Fable 5, Opus 4.8, Gemini 3.1 Pro, and Mythos Preview. For a refresher on how these evals work, start with the complete guide to LLM benchmarks.

A new naming system: generation plus capability tier

GPT-5.6 introduces a naming change worth understanding. The number (5.6) marks the generation, while Sol, Terra, and Luna are durable capability tiers that can advance on their own cadence — roughly flagship, balanced, and fast. OpenAI says Terra matches GPT-5.5 quality at about half the price, and Luna brings strong capability at the lowest cost. Our comparison tracks the flagship, GPT-5.6 Sol.

Two new reasoning modes: max and ultra

GPT-5.6 ships with a new max reasoning effort that gives the model the most time to think within a single agent, plus an ultra mode that goes beyond one agent by spinning up subagents to parallelize complex work. These two modes explain the split scores you will see below: max is the deep single-agent number, and ultra is the multi-agent ceiling.

Terminal-Bench 2.1: the headline result

The clearest win is on Terminal-Bench 2.1, which tests command-line workflows that require planning, iteration, and tool coordination. GPT-5.6 Sol sets a new state of the art: 88.8% at the new max effort and 91.9% with ultra mode. That tops every other model in the table, including Claude Fable 5 (88.0%), Mythos Preview (82.0%), GPT-5.5 (78.2%), and Opus 4.8 (74.6%).

Terminal-Bench was already GPT-5.5's strongest category (see the GPT-5.5 benchmark breakdown), so a generational jump here lands squarely in OpenAI's wheelhouse. For teams building CI/CD automation, DevOps pipelines, or shell-driven agents, this is the number that matters most.

Cybersecurity: a real step change

OpenAI frames GPT-5.6 Sol as its most capable cybersecurity model yet, with a focus on defensive work — vulnerability research, patching, and code review. On ExploitBench, OpenAI reports Sol as competitive with Mythos Preview while using only about one-third of the output tokens — a large efficiency gain at a similar capability level (roughly 69% Cap%, versus GPT-5.5's 34%). It also showed strong, reasoning-scaled gains on ExploitGym, a benchmark built by UC Berkeley researchers with OpenAI and other labs.

Critically, OpenAI says Sol does not cross the "Cyber Critical" threshold of its Preparedness Framework: in Chromium and Firefox tests it found bugs and exploitation primitives but did not autonomously produce a full-chain exploit. The model launches with a layered safeguard stack and a phased release, backed by more than 700,000 A100-equivalent GPU hours of automated red-teaming.

Biology and health

OpenAI highlighted biology gains on GeneBench v1, a long-horizon genomics and quantitative-biology eval, where Sol reportedly beats GPT-5.5 while using fewer tokens (no clean public number was published). On HealthBench Professional, the physician-graded tier, Sol scores 60.5 (length-adjusted) — up 8.7 points over GPT-5.5's 51.8, and within striking distance of Mythos Preview (64.7) and Fable 5 (66.0).

What OpenAI did not publish

This is a preview, not a full launch, and OpenAI was explicit that it would "share an expanded suite of evaluation results" at general availability. As of the preview, there are no published GPT-5.6 Sol numbers for SWE-bench Verified, SWE-bench Pro, GPQA Diamond, Humanity's Last Exam, MMMLU, or most agentic and tool-use evals. We have left those rows blank in the table rather than guess — any specific figure circulating for those benchmarks today is unsourced. We will fill them in as OpenAI releases verified results.

Pricing and availability

GPT-5.6 is priced per million tokens across the three tiers: Sol at $5 input / $30 output (matching GPT-5.5), Terra at $2.50 / $15, and Luna at $1 / $6. The release also brings more predictable prompt caching — explicit cache breakpoints and a 30-minute minimum cache life, with cache writes billed at 1.25x the uncached input rate and cache reads keeping the 90% discount. OpenAI also plans to launch Sol on Cerebras at up to 750 tokens per second in July.

During the preview, the models are available only to select partners via the API and Codex. For head-to-head context, see Claude Opus 4.8 vs GPT-5.6 Sol and Claude Fable 5 vs GPT-5.6 Sol.

Key takeaways

  • New Terminal-Bench 2.1 record: 88.8% (max) and 91.9% (ultra), the top of the table and ahead of Fable 5, Mythos Preview, and GPT-5.5.
  • Cybersecurity efficiency leap: competitive with Mythos Preview on ExploitBench using roughly a third of the output tokens, without crossing OpenAI's Cyber Critical threshold.
  • Health gains: 60.5 on HealthBench Professional, up 8.7 points over GPT-5.5.
  • New modes: a deeper max reasoning effort and a multi-agent ultra mode, plus a Sol / Terra / Luna tier system.
  • Narrow preview: coding, biology, cyber, and safety only — no SWE-bench, GPQA, HLE, or MMMLU numbers yet. A fuller suite arrives at general availability.
  • Explore the full profile on the GPT-5.6 Sol hub page or browse every score in the live benchmark comparison table.

Keep reading