How to Read LLM Benchmark Scores Without Being Fooled

A benchmark number without context is marketing. Before you let a leaderboard percentage influence a model decision, there are five questions you need to answer about how that number was produced.

Which subset was actually tested?

Most benchmarks ship in multiple tiers. SWE-bench has a full set, a Verified subset and a Pro variant — the scores are not interchangeable. SWE-bench Verified filters to tasks that human engineers confirmed are solvable, so it is harder and more meaningful than the raw set. When a lab publishes "scored 45% on SWE-bench" without specifying which split, the number is ambiguous. Always look for the exact split name and sample size before comparing.

The same trap appears on GPQA Diamond (Diamond is the hardest tier, not the default) and on MMLU (different 5-shot vs 0-shot protocols produce visibly different numbers). Treat benchmark names as families, not single measurements.

Did the model have access to tools?

Many modern evaluations allow models to execute code, browse the web or call external APIs. A model scored with tool access and one scored in a plain chat-completion context are not comparable, even on the same task set. Check whether the evaluation used: code execution, a web-search harness, file system access, or a multi-agent scaffold. If the lab is silent on this, assume the advantageous condition was used. Our guide to agentic evals goes deeper on what tool-enabled benchmarks actually measure.

How many trials were averaged?

LLM outputs are stochastic. A single pass over a test set can swing several percentage points. Responsible evaluations report a score that is the mean of multiple independent runs — typically three to five — and include standard deviation or confidence intervals. When you see only a single number with no variance information, the score is likely a best-of pick rather than an expected value. The metric pass@k makes this sampling question explicit: pass@1 is a single attempt, pass@k allows k samples. Make sure you know which regime the published score uses.

Is the benchmark saturated?

Saturation happens when the top models all cluster near 90% or above. At that point differences between models are within noise, and the benchmark no longer discriminates well. MMLU is now widely considered saturated — the frontier models score 88-91% and the gap between them is smaller than the variance from run to run. When every model scores high, look for a harder variant or a newer benchmark. Humanity's Last Exam was explicitly designed as a post-saturation replacement, with even the best current models scoring below 30%.

Are scores comparable across labs?

Even identical benchmark names hide evaluation differences: prompt format, system prompt presence, temperature, number of few-shot examples, and whether answers are greedy-decoded or sampled. A 2% gap between two models reported by different labs can easily be explained by prompt format alone. The most reliable comparison comes from a single third-party evaluation run on both models under identical conditions. The live benchmark comparison table on this site standardises conditions where possible and flags when numbers come from different sources. For a broader framework, see the complete guide to LLM benchmarks.

Key takeaways

Always identify the exact benchmark split, not just the family name.
Tool-assisted and plain-text scores are not comparable.
Demand variance information; a single-run number is not trustworthy.
High scores on a saturated benchmark are noise, not signal.
Cross-lab comparisons require identical evaluation conditions to be valid.