Frontier models we benchmark

Per-model scorecards across coding, agentic, reasoning and multilingual evaluations.

AnthropicFable 5Anthropic’s Mythos-class flagship and its most capable generally available model, with safeguards that fall back to Opus 4.8 in sensitive domains.
AnthropicOpus 4.8Anthropic’s Opus-class flagship, strongest on agentic coding, tool use and computer use.
AnthropicSonnet 5Anthropic’s most agentic Sonnet yet, narrowing the gap to Opus 4.8 on coding, tool use and reasoning while staying far cheaper.
OpenAIGPT-5.6 SolOpenAI’s GPT-5.6 flagship (Sol tier), with a deeper “max” reasoning effort and a multi-agent “ultra” mode; sets new state-of-the-art results on terminal coding, agentic browsing and coding-agent evals.
OpenAIGPT-5.5OpenAI’s previous-generation frontier model, strong on terminal coding and several agentic evals.
CursorComposer 2.5Cursor’s in-house, coding-specialized agent model — built on Moonshot’s open Kimi K2.5 checkpoint and tuned with targeted RL for fast, low-cost agentic work inside the Cursor editor.
AnthropicOpus 4.7The previous-generation Claude Opus, still highly competitive on reasoning and coding.
Google DeepMindGemini 3.1 ProGoogle DeepMind’s frontier model, strong on search and graduate-level reasoning.
AnthropicMythos PreviewA preview-class frontier model, topping many benchmarks where results are reported.