How LLM Leaderboards Work: Elo, Arenas and Pitfalls
LLM leaderboards use Elo ratings, human preference arenas, and automated evals to rank models. Learn how Elo and Bradley-Terry work, why rankings shift, and what pitfalls to watch for.
A leaderboard appears to answer a simple question: which model is best? In practice, each leaderboard answers a slightly different question depending on how it was built — what tasks it covers, how it scores responses, and how it aggregates results into a single ranking. Before you trust a position on any leaderboard, it helps to understand the machinery underneath it.
Elo and Bradley-Terry: the maths behind rankings
Many leaderboards borrow their ranking system from competitive games. Elo was originally designed for chess: each player has a rating, and after each match both ratings update based on the result and the pre-match probability of winning. Beat a much stronger opponent and your rating rises a lot; beat a weaker one and it rises a little. Lose to a weaker opponent and your rating drops sharply.
Applied to LLMs, each "match" is a pairwise comparison between two model responses to the same prompt, judged by a human or another model. The Bradley-Terry model is a statistically cleaner alternative that fits a probability to every pairwise outcome simultaneously rather than updating ratings sequentially. Both systems produce a ranking on a numeric scale (Chatbot Arena uses Elo; some research systems use Bradley-Terry) that can be interpreted as "expected win probability against any other model in the field."
For a plain-language definition of Elo and how it differs from raw accuracy scores, check the Elo entry in the glossary.
Human-preference arenas
The most influential human-preference arena is Chatbot Arena (LMSYS), which collects millions of real user votes. A user submits a prompt, receives two anonymous responses from randomly selected models, and votes for whichever they prefer (or calls it a tie). Votes feed into an Elo system. Because the prompts come from real users, the distribution is more representative of actual use cases than any curated test set.
Human arenas have real advantages: they are hard to game with targeted fine-tuning, they reflect user preferences directly, and they accumulate enormous sample sizes over time. They also have real weaknesses. The voter population is not uniform — it skews toward technically sophisticated users who use the platform. Preferences for style (longer, more confident answers) can swamp preferences for accuracy. And because the same user submits a prompt and votes on the result, selection effects are baked in.
Automated leaderboards and static benchmarks
Alongside arenas, a second class of leaderboard runs models against fixed test sets and reports accuracy on each. These are faster, cheaper, and reproducible — you can rerun them and get the same number. Their weakness is that fixed test sets age: as the research community focuses on a benchmark, models improve on it, and eventually the benchmark stops distinguishing between strong models. This is the saturation problem, explored in depth in why LLM benchmarks saturate.
The way scores are reported also matters enormously. Different prompting strategies (zero-shot, few-shot, chain-of-thought), different answer parsers, and different normalisation choices can shift a score by several percentage points on the same underlying model. The full reading guide is at how to read LLM benchmark scores without being fooled.
Why rankings shift and what contamination means for leaderboards
Leaderboard rankings are not stable facts about models; they are measurements that change as the evaluation ecosystem evolves. A ranking can shift because a new model enters the arena and takes votes away from established models, changing their Elo. It can shift because the test set is refreshed. It can shift because a model's training data inadvertently included benchmark examples — a problem known as contamination.
Contamination is particularly damaging for static leaderboards. If a model was trained on data that overlaps with the test set, its score reflects memorisation rather than generalisation. Human arenas are more robust to this because real user prompts are not published in advance, but they are not immune — a model trained to produce human-preferred outputs specifically can inflate arena scores without being more capable in any principled sense. See benchmark contamination for a detailed treatment of detection methods and the research community's responses.
Benchmarks like GPQA Diamond were designed with contamination resistance in mind — questions written by domain experts, never published online before the benchmark launched. Even so, as the benchmark ages and its contents circulate, contamination risk rises over time.
How to read a leaderboard position critically
A few questions to ask before acting on any leaderboard ranking:
- What tasks does the leaderboard cover? A coding leaderboard and a general-knowledge leaderboard will rank the same models differently.
- Who or what is doing the judging? Human votes, an LLM judge, and automated answer-matching each introduce different biases.
- How large is the sample? Arena Elo ratings stabilise after thousands of votes; a model with only a few hundred comparisons has a wide confidence interval.
- When was the benchmark created relative to the model?A model released after the benchmark was published has a higher contamination risk.
- Are confidence intervals shown? A one-point Elo difference with wide error bars is statistically meaningless.
For a full breakdown of every major benchmark category — and guidance on which ones to trust for which use cases — see the complete guide to LLM benchmarks. Browse current standings in the live benchmark comparison or dive into specific model pages like GPT-5.5 to see how rankings play out across different evaluation systems.
Key takeaways
- Elo and Bradley-Terry turn pairwise comparisons into a single numerical ranking; both are borrowed from competitive game theory.
- Human-preference arenas collect real user votes and are harder to game, but they reflect user style preferences as much as capability.
- Automated benchmarks are reproducible and fast but vulnerable to saturation and contamination as they age.
- Rankings shift when new models enter the arena, test sets refresh, or contamination inflates a model's score.
- Always check sample size, confidence intervals, task coverage, and judging method before acting on a leaderboard position.