What Is an LLM Agent? Tools, Planning and Evaluation

When most people think of a large language model they picture a chat interface: you type a message, the model replies, done. An LLM agent is something different. It is a model that perceives its environment, selects actions, executes them, observes the results, and repeats — all in service of a goal that may take dozens or hundreds of steps to reach. Understanding what an agent is, how it plans, and how we measure its capabilities is increasingly important as more software relies on autonomous AI behaviour.

What separates an agent from a chatbot

A conversational model answers questions. An agent does things. The key difference is the presence of an action loop: the model outputs not just text for a human to read but structured actions for a system to execute — searching the web, reading a file, running code, calling an API, clicking a button in a GUI. After each action the environment returns an observation (search results, file contents, command output, an error message) which the model folds into its next decision.

This loop — observe, decide, act, observe — is sometimes called a ReAct loop (Reasoning + Acting) or simply an agent loop. It transforms the model from a passive text generator into an active participant that changes the state of the world. The scaffold that implements the loop (the code deciding when to call the model, how to format tool outputs, when to stop) is just as important as the model itself.

The three pillars: tools, memory, and planning

Most agent architectures are built from three components.

Tools — functions the model can call. Common examples include web search, code execution, file read/write, HTTP requests, and database queries. The model chooses which tool to call and with what arguments; the scaffold executes the call and returns the result. The Model Context Protocol (MCP) is an emerging standard for defining and exposing tools — see MCP-Atlas for how models are evaluated on it.
Memory — information that persists across steps. In its simplest form this is just the context window, accumulating tool outputs as the session grows. More sophisticated designs use external stores (vector databases, key-value caches) to handle tasks that exceed what fits in a single context.
Planning — the ability to decompose a high-level goal into a sequence of smaller steps. Some models do this implicitly, producing the next action by continuing the conversation. Others use explicit plan generation: the model first writes out a numbered plan, then executes each step, checking success along the way.

How multi-step planning works in practice

Consider asking an agent to "find the five highest-grossing films released in 2024 and save their Rotten Tomatoes scores to a CSV." A competent agent might plan: search for a box-office ranking, extract film titles, search for each film's Rotten Tomatoes score, format the results, and write the file. Each step depends on the previous one, and errors compound — if step two returns the wrong titles, every subsequent step is wrong.

This error compounding is why multi-step planning is hard. Models that perform well on single-step benchmarks often degrade sharply as task length increases. Researchers quantify this with task-completion curves: success rate as a function of the number of required actions. Strong agents maintain high success rates out to 20, 30, or more steps; weaker ones fall off quickly. See agentic evals explained for a deeper look at how these evaluations are structured.

How LLM agents are evaluated

Evaluating agents requires a fundamentally different approach from scoring a multiple-choice test. The most widely used method is end-state verification: the harness checks whether the world is in the target configuration after the agent finishes, not whether any particular intermediate action was taken. Did the file appear? Does the test suite pass? Is the target webpage rendered correctly?

OSWorld-Verified is a good example: it places the agent inside a real operating system with real applications (LibreOffice, Chrome, VS Code) and checks whether a natural-language instruction was carried out correctly. Tasks span file management, spreadsheet editing, web browsing, and coding. Because the environment is live, not static, it is much harder to "memorise" than a written question — contamination risk is lower, which makes the scores more trustworthy. For a broader map of agent benchmarks, consult the complete guide to LLM benchmarks.

It is also worth reading what is MCP-Atlas to understand how tool-calling specifically is measured, and checking the glossary for definitions of terms like scaffold, harness, and end-state verification that come up repeatedly in agent evaluation.

Limitations and open challenges

Agents are powerful but brittle in ways that simple chat models are not. Long-horizon tasks require the model to maintain coherent goals across many context window refreshes, which today's models handle imperfectly. Tool errors can derail a plan entirely; most current agents lack robust recovery strategies. And because scaffold design matters so much, scores reported by different labs for the same model are often incomparable — a gap that the research community is actively working to close with standardised harnesses.

To compare agent-capable models side by side on benchmarks like OSWorld-Verified and MCP-Atlas, visit the live benchmark comparison. Model pages such as Claude Opus 4.8 show which agentic benchmarks each model has been evaluated on.

Key takeaways

An LLM agent runs an observe-decide-act loop rather than producing a single reply.
Tools, memory, and planning are the three core components of most agent architectures.
Multi-step planning is hard because errors compound; success rate typically declines as task length grows.
Agent evaluation uses end-state verification, not answer-matching — making it both harder to game and harder to standardise.
Scaffold design significantly affects reported scores; only compare results from identical scaffolds.
OSWorld-Verified and MCP-Atlas are leading benchmarks that test real-world agentic capability.