All benchmarks

Terminal-Bench 2.1

Agentic terminal coding

Terminal-Bench evaluates models on real-world tasks in a terminal and command-line environment — installing dependencies, debugging, running builds and orchestrating tools — where each step depends on the result of the last.

Model scores

  • Opus 4.874.6%
  • Opus 4.766.1%
  • GPT-5.578.2%
  • Gemini 3.1 Pro70.3%
  • Mythos Preview82.0%

Official source: Terminal-Bench (tbench.ai)

Related reading