Benchmark explained

What Is Humanity’s Last Exam? The Frontier Reasoning Benchmark

Humanity's Last Exam (HLE) is a 2,500-question benchmark of extreme-difficulty problems authored by domain experts. Learn how it works, the with-tools vs no-tools split, and why it resists saturation.

8 min read

Humanity's Last Exam was built on a provocative premise: collect the hardest questions that credentialed experts across every academic discipline can write, and see how long AI takes to solve them. The answer, so far, is "not yet."

What Humanity's Last Exam is

Humanity's Last Exam (HLE) is a benchmark of 2,500 questions sourced from domain experts — mathematicians, physicists, chemists, historians, economists, linguists, and many others — through an open submission process. Each question was accepted only if: it had an unambiguous correct answer that could be verified automatically, it was not easily answerable by a web search, and it sat at the outer edge of what the submitting expert considered solvable.

The result is a dataset that spans mathematics, natural sciences, humanities, law, medicine, and engineering at a level of difficulty well beyond standard graduate coursework. Some questions require combining insights from multiple disciplines; others demand calculations that an expert would take hours to verify.

You can see how current models perform on Humanity's Last Exam in our live benchmark comparison table.

With-tools vs no-tools scoring

HLE is evaluated under two conditions, and the distinction matters enormously for interpreting a score:

  • No-tools (closed-book) — the model receives only the question text. It must rely entirely on parametric knowledge encoded during training. This is the purest test of reasoning ability. Scores in this setting are substantially lower.
  • With-tools — the model can use a code interpreter, a web search tool, or a calculator. Many HLE questions involve computation or fact-checking steps that a tool handles trivially once the model knows the right approach. Scores in this setting can be significantly higher, but the benchmark then partially measures tool-use strategy rather than raw reasoning.

When comparing HLE scores across models or papers, always confirm which condition applies. A with-tools score and a no-tools score are measuring different things and should not be placed on the same scale.

Why HLE resists saturation

Most benchmarks saturate within a few years of release: training data expands, models improve, and scores converge near the ceiling. HLE was engineered to delay that outcome in several ways.

First, the questions are novel — they were submitted after the training cutoffs of most models, so contamination is minimal. Second, the difficulty is calibrated to what experts find hard, not what students find hard. Third, correct answers require multi-step reasoning chains that cannot be shortcut by pattern matching.

Early frontier model scores on HLE (no-tools) clustered in the single digits. As of 2026, the best models are approaching 20-30% on no-tools, but no model is yet close to expert-level performance across the full dataset. For context on why saturation undermines benchmarks generally, read how to read LLM benchmark scores.

How HLE compares to GPQA Diamond

GPQA Diamond and Humanity's Last Exam both target expert-level difficulty, but they differ in scope and structure. GPQA Diamond is 198 questions focused narrowly on biology, chemistry, and physics at PhD level. HLE is 2,500 questions spanning every academic discipline, with no subject weighting — a model that excels at science but struggles in law or history will still score poorly overall.

GPQA Diamond is now close to saturation for the top frontier models; HLE is not. If you are evaluating a model's reasoning ceiling in 2026, HLE is the more informative benchmark. For a survey of where both fit in the broader evaluation landscape, see the complete guide to LLM benchmarks.

Limitations of Humanity's Last Exam

HLE is not perfect. The question set skews toward subjects where a precise, verifiable answer exists — mathematics and formal sciences are over-represented relative to, say, interpretive humanities. The automated verification requirement also rules out open-ended questions where expert judgment would be needed to score responses, which excludes some of the hardest intellectual tasks humans perform. Additionally, as AI labs invest in training against HLE specifically, it will eventually face the same contamination pressure as every other benchmark.

Key takeaways

  • Humanity's Last Exam is 2,500 expert-authored questions spanning all major academic disciplines, calibrated to be at or beyond the outer edge of what domain experts can reliably answer.
  • Scores must be read with the evaluation condition in mind: no-tools and with-tools results are not comparable.
  • HLE resists saturation through novelty, breadth, and calibrated difficulty — early frontier models scored in single digits and the best 2026 models are still below 30% on no-tools.
  • It complements rather than replaces GPQA Diamond: GPQA is narrower and more saturated; HLE is broader and further from the ceiling.
  • Contamination remains a future risk as training data expands to include public benchmark questions.

Keep reading