- Benchmark
- A fixed set of tasks with a defined scoring rule, run against a model under controlled conditions to produce a single comparable number.
- pass@k
- The probability that at least one of k sampled attempts solves a task. pass@1 measures single-shot reliability; higher k rewards models that can succeed given multiple tries.
- maj@k
- Majority voting over k samples: the model answers k times and the most common answer is scored. It smooths out sampling noise on tasks with a single correct answer.
- Agentic evaluation
- A benchmark where the model acts over multiple steps in an environment — calling tools, running commands, browsing — rather than answering a single static question.
- Benchmark contamination
- When test questions (or close variants) leak into a model’s training data, inflating scores because the model has effectively seen the answers.
- Saturation
- When top models cluster near the maximum score on a benchmark, leaving little headroom and making the benchmark poor at separating frontier models.
- LLM-as-judge
- Using a language model to grade another model’s open-ended outputs against a rubric, in place of (or alongside) exact-match scoring.
- Elo rating
- A rating derived from pairwise win/loss outcomes (often human preference votes), used to rank models on a single relative scale.
- Tool use
- A model’s ability to call external functions or services — search, code execution, APIs — and incorporate their results, often via the Model Context Protocol (MCP).
- Long-context reasoning
- Reasoning that requires synthesizing information spread across a very large input, rather than retrieving a single passage.