Agentic Evals: How We Benchmark Tool-Using LLMs

Most classic benchmarks treat the model as a function: give it text, read its text output, score it. That design cannot measure whether a model can navigate a real codebase, execute shell commands, or retrieve live information from the web. Agentic evaluations were built to close that gap.

What makes an evaluation "agentic"

An agentic eval is one where the model must take a sequence of actions — typically tool calls — to accomplish a goal that cannot be reached in a single inference step. The model may read files, write and run code, open a browser, call an API, or send commands to an operating system. The eval harness captures what happened, then checks whether the goal was achieved.

The distinction from static QA is fundamental. In static QA, the model's world is frozen at inference time: it can only use knowledge encoded in its weights plus whatever fits in the context window. In an agentic eval, the world is live and mutable. The model can fetch new information, create side effects, and respond to feedback from the environment over many turns.

Environments and harnesses

Every agentic benchmark pairs a task set with an environment that the model operates in. The environment determines what actions are available and provides an oracle for success checking.

Terminal environments — the model runs shell commands in a sandboxed Linux container. Terminal-Bench uses this design, scoring the model on whether a set of terminal-native tasks — system administration, file manipulation, scripting — reach the specified end state.
Desktop and GUI environments — the model controls a full operating system via screenshots and action primitives. OSWorld-Verified snapshots real application states and checks whether the model reaches a target configuration across apps like Calc, Chrome and VS Code.
Tool-call environments — the model calls structured functions (search, read, write, HTTP) and the harness intercepts and logs each call. MCP-Atlas evaluates whether models use the Model Context Protocol correctly across a diverse set of tool-calling tasks.

How scoring works in agentic settings

Agentic scoring is harder than static answer-matching. The model's final output is rarely a single string that can be compared to a gold answer. Instead, evaluators check end state: did the file get created, does the unit test pass, does the webpage render correctly, does the database contain the expected row?

This creates a binary outcome per task — success or failure — which maps cleanly to a percentage of tasks solved. Some evaluations add partial credit for correct sub-steps. Others track efficiency (steps taken, tokens consumed, wall-clock time) as a secondary dimension. Because each task may take dozens of model calls, costs per evaluation run are orders of magnitude higher than for static benchmarks, which limits how many trials most labs can afford — a direct reason why pass@k analysis is less common in agentic settings.

Why agentic evals differ from static QA

Several properties set agentic evals apart in ways that matter for interpretation. First, they are sensitive to the scaffold — the code that wraps the model, formats tool outputs, and decides when to stop. Two labs using the same model with different scaffolds can report meaningfully different numbers on the same benchmark. Second, they are expensive, so sample sizes are smaller and variance is higher. Third, they are harder to contaminate: an environment that runs real code cannot be "memorised" in the way that a static question can. For more on contamination in static benchmarks, see benchmark contamination.

The practical upshot is that agentic scores are more predictive of real agent deployments, but they require more careful reading. Always check which scaffold was used, whether tool access was restricted, and how task success was defined. The full checklist is in how to read LLM benchmark scores without being fooled. For a full map of every benchmark category, see the complete guide to LLM benchmarks.

Key takeaways

Agentic evals require the model to act over multiple steps in a live environment, not just produce text.
Environments range from terminal sandboxes to full GUI desktops to structured tool-call harnesses.
Scoring checks end state rather than string matching — harder to game but also harder to standardise.
Scaffold choice significantly affects results; compare only scores from identical scaffolds.
Agentic benchmarks are more expensive to run, so sample sizes are smaller and variance is higher.
See the live benchmark comparison table to compare agentic scores across models.

What makes an evaluation "agentic"

Environments and harnesses

How scoring works in agentic settings

Why agentic evals differ from static QA

Key takeaways

Keep reading