All benchmarks

FrontierCode (Diamond)

Agentic coding

Cognition’s FrontierCode evaluation tests whether models can complete difficult coding tasks while meeting the standards of high-quality production codebases — code that is correct, maintainable and reviewable, not merely passing tests. Diamond is the hardest tier.

Model scores

  • Fable 529.3%
  • Opus 4.813.4%
  • GPT-5.55.7%
  • Opus 4.7
  • Gemini 3.1 Pro
  • Mythos Preview

Official source: Anthropic — Fable 5 / Mythos 5 announcement

Related reading