Benchmark explained

What Is Terminal-Bench? Benchmarking Agents in the Shell

Terminal-Bench evaluates AI agents on real command-line tasks inside a live shell environment. Learn what it measures, why it tests long-horizon agentic behavior, and how setup affects scores.

7 min read

Terminal-Bench puts an AI agent inside a real Linux shell and gives it tasks that require many sequential commands to complete — making it one of the most authentic tests of whether a model can actually operate in a computing environment, not just talk about it.

What Terminal-Bench evaluates

Each Terminal-Bench task is a self-contained challenge delivered in a sandboxed shell session. The agent can run arbitrary commands: it can inspect files, install packages, write scripts, call system utilities, and check its own output. Success is determined programmatically — either a specific file exists with the correct content, a service is running in the expected state, or a program produces the right output when called.

Tasks span a wide range of difficulty: some require a single correct command, while harder instances demand a multi-step workflow spanning dozens of interactions with the shell, compilers, package managers, or network utilities. This range is intentional — it lets the benchmark measure both basic competency and long-horizon planning.

Current model scores are available on the Terminal-Bench benchmark page and you can compare models side by side on our live benchmark comparison table.

Why long-horizon agentic behavior is hard to measure

Most benchmarks evaluate a model on a single turn: one prompt, one response. Terminal-Bench is different because the model must maintain a coherent goal across many turns, remember what it has already tried, recover from failed commands, and know when it has finished. These skills — planning, error recovery, state tracking — are exactly what separates a useful coding agent from a capable autocomplete tool.

For an overview of how agentic evaluations differ from static benchmarks, see our agentic evals explainer. For a direct comparison with code-level agentic tasks, see What is SWE-bench?.

How inference setup affects Terminal-Bench scores

Terminal-Bench scores are extremely sensitive to the agent harness surrounding the model. The same base model can produce very different results depending on:

  • Context window management — long shell sessions accumulate output quickly; truncating the context at the wrong moment causes the agent to lose track of earlier steps.
  • Tool call budget — limits on the number of shell commands allowed per task cap the complexity of problems a model can solve, regardless of its reasoning ability.
  • Error handling strategy — whether the harness re-prompts on a command error, or lets the model see the raw stderr and self-correct, significantly changes outcomes on tasks that require debugging.
  • Timeout and retry policies — tasks involving compilation or network calls introduce real-world latency; overly aggressive timeouts disqualify otherwise correct solutions.

This sensitivity means that comparing two Terminal-Bench scores from different evaluation setups is unreliable. Always look for scores produced under the same harness configuration.

Terminal-Bench vs SWE-bench and OSWorld

Terminal-Bench, SWE-bench, and OSWorld all evaluate agents in real environments, but they target different skill sets. SWE-bench is focused on Python patch generation in a software repository. OSWorld (see What is OSWorld?) tests graphical computer use — clicking, typing in GUI applications. Terminal-Bench sits in between: it requires real environment interaction but focuses on the command line rather than a graphical desktop or a single codebase. It is the most relevant benchmark for evaluating DevOps automation, shell scripting agents, and infrastructure tooling.

For the full picture of how these evaluations relate to each other, read the complete guide to LLM benchmarks.

Key takeaways

  • Terminal-Bench runs an AI agent in a live shell, evaluating it on tasks that require many sequential commands and long-horizon planning.
  • Success is determined programmatically — file state, process state, or program output — not by a human rater.
  • Scores are highly sensitive to harness configuration: context management, tool call budgets, and error-handling policies all materially affect results.
  • Terminal-Bench is the most relevant eval for shell scripting and DevOps automation use cases; pair it with SWE-bench for a rounded view of coding agent capability.

Keep reading