LLM-as-Judge: How Models Grade Models

Grading a multiple-choice answer is trivial — compare the model's letter to the answer key. Grading an essay, a code review, a long-form analysis, or a nuanced conversation is much harder. Human annotators are expensive, slow, and inconsistent. LLM-as-judge is a technique that uses a capable language model to score open-ended outputs at scale, bridging the gap between the richness of real tasks and the practicality of automated evaluation.

What LLM-as-judge means

In its simplest form, LLM-as-judge works like this: you have a student model that produced some output, and you pass that output — along with the original prompt, optional reference material, and a scoring rubric — to a separate judge model. The judge reads everything and returns a score, a verdict, or a pairwise preference (response A is better than response B). The judge's output is then used as a proxy for human judgement.

This technique powers some of the most widely cited evaluation systems in use today, including MT-Bench, Alpaca Eval, and the Arena-Hard leaderboard. It also underlies many internal evaluation pipelines at AI labs, where automated quality checks need to run across millions of model responses. Understanding how it works — and where it fails — is essential for reading benchmark results correctly. The full context for interpreting scores is in how to read LLM benchmark scores without being fooled.

How rubrics and prompts are structured

The quality of an LLM judge depends heavily on the rubric it receives. A weak rubric asks the judge to "rate this response from 1 to 10." A strong rubric specifies what each score means, which dimensions to evaluate (accuracy, clarity, completeness, safety), and how to handle edge cases like partially correct answers or responses that excel on one dimension while failing on another.

Common rubric structures include:

Absolute scoring — the judge assigns a numeric score or Likert rating to a single response against explicit criteria. Useful when you want a comparable score across many responses.
Pairwise preference — the judge is shown two responses (A and B) and asked which is better, or whether they are tied. Less sensitive to scale calibration but harder to aggregate into a single ranking without a system like Elo.
Reference-guided scoring — a gold reference answer is provided alongside the student response, and the judge evaluates factual accuracy relative to it. This reduces hallucination risk in the judge itself.

Known biases and failure modes

LLM judges are not neutral arbiters. Several systematic biases have been documented in the research literature.

Self-enhancement bias — models tend to prefer responses that match their own style and training distribution. A GPT-4 judge may systematically favour GPT-family outputs over others, and vice versa.
Verbosity bias — longer responses are often scored higher even when concision would be more appropriate. Judges seem to interpret length as effort or completeness.
Position bias — in pairwise comparisons, some judges prefer whichever response appears first (or second) in the prompt, independent of quality. Mitigated by averaging over both orderings.
Sycophancy — if the judge model was trained with RLHF optimised for human approval, it may rate confident-sounding but incorrect responses highly.
Hallucination in the judge — the judge itself can produce inaccurate reasoning about factual claims, especially if no reference answer is provided.

For definitions of these terms and others, see the glossary. The overlap between judge bias and the broader contamination problem is explored in benchmark contamination.

When LLM-as-judge is reliable

Despite its limitations, LLM-as-judge is a genuinely useful tool when applied carefully. Agreement with human raters is highest in tasks where the judge can verify correctness directly — code that runs, maths that checks out, factual claims that can be looked up. It degrades on subjective tasks (tone, creativity, cultural nuance) and on domains where the judge model has weak expertise.

Practical guidelines for trustworthy LLM-as-judge setups:

Use a judge that is substantially stronger than the student model being evaluated.
Provide explicit, detailed rubrics rather than open-ended rating instructions.
Run pairwise comparisons in both orderings and average the results to cancel position bias.
Cross-validate a random sample against human raters to measure judge–human agreement.
Avoid using a model as its own judge; always use a separate model family where possible.

LLM-as-judge in modern leaderboards

Several of the most prominent leaderboards rely on LLM-as-judge either entirely or in part. Chatbot Arena uses human pairwise votes rather than a model judge, but Arena-Hard replaces human voters with GPT-4 judgements to reduce cost. AlpacaEval uses a GPT-4 judge to decide whether a student model's response is preferred over a fixed reference output.

The practical implication: when you see a leaderboard score, check whether humans or a model did the grading. Model-judged leaderboards are faster and cheaper but carry the biases above. Human-judged leaderboards are slower and noisier in different ways — driven by annotator fatigue and inconsistency rather than systematic model bias. Neither is perfect; both are useful. Compare current model standings in the live benchmark comparison and consult the complete guide to LLM benchmarks for context on which benchmarks use which grading methods.

Key takeaways

LLM-as-judge uses a model to score open-ended outputs at scale, replacing slow and expensive human annotation.
Rubric quality is critical; well-structured rubrics with explicit criteria produce far more reliable scores.
Known biases include self-enhancement, verbosity, position, and sycophancy — each can be mitigated with careful design.
LLM judges work best when the task has verifiable correct answers and the judge model is significantly stronger than the student.
Cross-validating a sample against human raters is the only reliable way to measure how trustworthy a given judge setup is.