LLM benchmark glossary

Plain-English definitions of the terms you'll meet when reading LLM benchmark results.

Benchmark: A fixed set of tasks with a defined scoring rule, run against a model under controlled conditions to produce a single comparable number.
pass@k: The probability that at least one of k sampled attempts solves a task. pass@1 measures single-shot reliability; higher k rewards models that can succeed given multiple tries.
maj@k: Majority voting over k samples: the model answers k times and the most common answer is scored. It smooths out sampling noise on tasks with a single correct answer.
Agentic evaluation: A benchmark where the model acts over multiple steps in an environment — calling tools, running commands, browsing — rather than answering a single static question.
Benchmark contamination: When test questions (or close variants) leak into a model’s training data, inflating scores because the model has effectively seen the answers.
Saturation: When top models cluster near the maximum score on a benchmark, leaving little headroom and making the benchmark poor at separating frontier models.
LLM-as-judge: Using a language model to grade another model’s open-ended outputs against a rubric, in place of (or alongside) exact-match scoring.
Elo rating: A rating derived from pairwise win/loss outcomes (often human preference votes), used to rank models on a single relative scale.
Tool use: A model’s ability to call external functions or services — search, code execution, APIs — and incorporate their results, often via the Model Context Protocol (MCP).
Long-context reasoning: Reasoning that requires synthesizing information spread across a very large input, rather than retrieving a single passage.

New to benchmarks? Start with the complete guide to LLM benchmarks.