Benchmark explained

What Is GPQA Diamond? Graduate-Level Reasoning Explained

GPQA Diamond tests LLMs on expert-written graduate-level science questions that even domain PhDs struggle with. Learn what the benchmark measures and why scores are nearing saturation.

6 min read

GPQA Diamond is the benchmark that researchers reach for when they need to stress-test frontier reasoning: its questions are hard enough that the PhD-level experts who wrote them only answer about 65% correctly.

What GPQA stands for and how it works

GPQA stands for Graduate-Level Google-Proof Q&A. The "Google-proof" label is important: each question was specifically designed so that a smart non-expert cannot answer it by running a quick web search. Getting it right demands genuine understanding of the underlying science, not retrieval skill.

Questions are multiple-choice with four options. Each was written by a domain expert (typically a PhD student or postdoc), then validated by other experts who confirmed the answer and rated difficulty. The dataset covers biology, chemistry, and physics at a level typically reached in the second or third year of a PhD programme.

What the Diamond subset is

The full GPQA dataset has 448 questions across three difficulty tiers. GPQA Diamond is the hardest 198-question subset — items where the expert validators most strongly agreed the question was both difficult and unambiguous. It is the Diamond subset that leaderboards almost always report, because the easier tiers now offer too little separation between strong models.

You can see how current models stack up on GPQA Diamond in our live benchmark comparison.

Why expert-level difficulty matters

Most knowledge benchmarks saturate quickly. When every frontier model scores above 85%, the benchmark stops telling you which model is better. GPQA Diamond was deliberately calibrated so that expert humans sit at roughly 65%, giving models a realistic ceiling to chase rather than a floor they have already cleared.

The practical implication: a model that scores well on GPQA Diamond is likely to handle complex multi-step scientific reasoning in production — literature review, hypothesis generation, identifying errors in technical documents. It does not guarantee accuracy on all science questions, because 198 items is a small sample, but it is a meaningful signal for reasoning depth.

The saturation problem

Frontier models have improved rapidly on GPQA Diamond. Several now score above 70%, and the gap between the top cluster and expert humans has almost closed. Once a benchmark is saturated — when even the weakest contenders score near the ceiling — it loses discriminative power. The research community is already developing harder successors; see Humanity's Last Exam for the most extreme version of that impulse. For a broader discussion of what saturation means for interpretation, read how to read LLM benchmark scores.

GPQA Diamond vs other reasoning benchmarks

GPQA Diamond occupies a specific niche: closed-book, multiple-choice, graduate-level science. It is harder than MMLU but easier than Humanity's Last Exam, and unlike agentic evals it requires no tool use or multi-turn reasoning — just a single well-formed answer. That makes it a clean, reproducible signal for pure reasoning quality. For context on where it fits in the wider evaluation ecosystem, see the complete guide to LLM benchmarks.

Key takeaways

  • GPQA Diamond is 198 expert-validated, Google-proof science questions at PhD level; random chance scores 25%, expert humans average ~65%.
  • The "Diamond" label refers to the hardest tier of the full 448-question GPQA dataset, and it is the tier that leaderboards report.
  • Good scores signal genuine multi-step scientific reasoning, not just pattern matching or retrieval.
  • Frontier models are approaching expert-level performance, making the benchmark close to saturation — harder successors like Humanity's Last Exam are taking over the frontier role.

Keep reading