pass@k, maj@k and Sampling: LLM Eval Metrics Explained
Understand pass@1, pass@k, and maj@k: what each metric measures, how temperature and sampling affect them, and what the choice of metric reveals about real-world performance.
When a benchmark reports a coding accuracy figure, it is rarely obvious whether the model succeeded on its first attempt or needed several tries. The metrics pass@1, pass@k, and maj@k make that distinction explicit — and the difference matters more than most leaderboard readers realise.
What pass@1 means
pass@1 is the probability that a single model sample solves a task correctly. The model is given a problem, generates one response, and that response is either right or wrong. The benchmark score is the fraction of problems where the single attempt passed.
This is the metric closest to real-world single-turn usage: if you send one message and expect one answer, pass@1 is the relevant figure. It is also the most conservative metric — any stochasticity in generation works against the model. Most leaderboards default to pass@1, which is why it is used as the baseline for comparison.
What pass@k means
pass@k asks: if the model is allowed k independent samples, what is the probability that at least one of them is correct? Mathematically, the estimator subtracts from 1 the probability that all k samples fail. As k grows, pass@k approaches 100% for any problem the model can occasionally solve.
pass@k captures a model's ceiling rather than its floor. A model with a wide distribution of outputs — sometimes brilliant, often mediocre — will show a large gap between pass@1 and pass@10. A model that is consistently correct shows a small gap. This spread is itself informative: wide gaps suggest the model "knows" the answer in some sense but has noisy decoding, while narrow gaps indicate reliable generation. Be cautious when a lab reports only pass@10 without pass@1 — the higher number may overstate the real single-shot experience. For more on reading scores carefully, see how to read LLM benchmark scores without being fooled.
Majority voting: maj@k
maj@k (also written as majority@k) is a middle ground between pass@1 and pass@k. The model generates k samples, the most common answer among them is selected as the final answer, and that answer is checked for correctness. This mirrors a realistic deployment pattern: generate several outputs, self-aggregate, report the consensus.
Majority voting works well on problems with discrete, verifiable answers (multiple choice, short numeric outputs, code that either passes tests or does not). It works poorly on open-ended generation where answers are diverse by nature. On reasoning benchmarks like GPQA Diamond, maj@32 or maj@64 is sometimes reported because the task is multiple choice and voting is tractable. On agentic benchmarks like Terminal-Bench, majority voting is rarely practical — the model must act in an environment, not just select among options.
How temperature and sampling affect these metrics
pass@1 typically uses greedy decoding (temperature 0 or very low) to get the single most probable output. pass@k and maj@k require diversity among samples, so a higher temperature is standard — commonly 0.6 to 0.8 for coding tasks. Higher temperature improves pass@k by exploring more of the output distribution, but may hurt pass@1 by introducing noise.
When a paper is silent on temperature, be sceptical. Reporting pass@k at temperature 1.0 versus 0.4 can shift the metric by several points. This is another reason why cross-lab comparisons require identical evaluation conditions — a point covered in the complete guide to LLM benchmarks.
Which metric to care about
The answer depends on your deployment scenario. If users interact with the model once per query and accept the first response, optimise for pass@1. If you run an automated pipeline that can generate multiple candidates and pick the best via a verifier or test suite, pass@k is more relevant. If you use a voting ensemble, maj@k reflects your actual setup.
Agentic workflows blur these categories further — a model that can retry a failed tool call is effectively running pass@k in production. Our guide to agentic evals explains how multi-step benchmarks handle this. See also benchmark contamination for why even accurate pass@1 numbers can mislead. The live benchmark comparison table notes the sampling conditions used for each reported figure.
Key takeaways
- pass@1 is the single-attempt success rate — the most conservative and realistic metric for single-turn use.
- pass@k measures the ceiling: the chance at least one of k samples is correct.
- maj@k selects the plurality answer across k samples — practical for verifiable tasks.
- Temperature and sampling settings must match when comparing metrics across papers.
- Match the metric to your deployment: single-shot, multi-sample, or ensemble.