What Is HumanEval? The Classic Code-Generation Benchmark

HumanEval was the benchmark that made automated code evaluation mainstream: when OpenAI published it alongside Codex in 2021, it gave the field a reproducible, executable way to measure whether a model could write correct Python functions from a docstring alone.

How HumanEval works

HumanEval consists of 164 handwritten Python programming problems. Each problem provides a function signature and a docstring describing what the function should do; the model must complete the function body. A hidden set of unit tests then executes the generated code and checks whether it produces the correct output for a range of inputs.

The problems cover basic algorithms, data structures, string manipulation, and simple mathematics — roughly the level of an introductory programming course or a straightforward coding interview question. There are no external libraries, no multi-file projects, and no interaction with a runtime environment beyond running Python.

The pass@k metric: HumanEval's lasting contribution

HumanEval introduced pass@k as the standard reporting metric for code generation. The idea is elegant: generate k independent code samples for each problem, and report the fraction of problems for which at least one of those k samples passes all tests. This is estimated without bias using a combinatorial formula rather than literally submitting all k samples.

pass@1 (a single attempt succeeds) is the most practical metric for real use. pass@10 or pass@100 tells you about the model's ceiling: how often the right solution is somewhere in the distribution, even if it does not surface on the first try. For a thorough explanation of the math and the tradeoffs, see our pass@k explainer. This metric spread far beyond HumanEval and is now used in math benchmarks like AIME — covered in What Is AIME? — and agentic evaluations alike.

Why HumanEval saturated

By 2023, frontier models were regularly scoring above 85% pass@1 on HumanEval. By 2024, scores above 90% were common, and the benchmark had effectively ceased to differentiate between the best models. Saturation happened for two reasons:

Training data overlap: HumanEval problems and their solutions circulated widely on GitHub, Stack Overflow, and tutorials, making it likely that models had seen near-identical problems during pretraining.
Problem difficulty: The 164 problems were designed for introductory programmers, not for testing the limits of a system that can read millions of lines of code. Frontier models outgrew the benchmark quickly.

Saturation is a recurring challenge across the evaluation landscape. The why benchmarks saturate post explores the general pattern. The benchmark contamination explainer goes deeper on the training-data overlap problem specifically.

From HumanEval to SWE-bench: raising the bar

The community responded to saturation by building harder coding evals. The most significant successor is SWE-bench Verified, which replaces self-contained function-completion with resolving real GitHub issues in real codebases. Where HumanEval asks a model to write 10 lines of code given a docstring, SWE-bench asks it to navigate a 50,000-line project, identify the root cause of a reported bug, and produce a patch that makes the existing test suite pass — with no hints about which files to touch. For the full story, see our SWE-bench explainer.

LiveCodeBench takes a different approach to the saturation problem — rather than increasing task complexity it continuously sources fresh problems to avoid training-data leakage. You can read more in What Is LiveCodeBench?.

Is HumanEval still worth reporting?

Despite saturation, HumanEval scores remain useful as a sanity-check baseline. A model that struggles on HumanEval will almost certainly struggle everywhere; a model that aces it has cleared a minimum bar but nothing more. Research papers still report it for comparability with historical results. For practitioners choosing a model today, pair it with a harder eval: SWE-bench for agentic coding tasks, or our best LLM for coding roundup which aggregates multiple signals. You can browse current scores for models like Claude Opus 4.8 and GPT-5.5 on the live benchmark comparison. For the broader context of how coding benchmarks fit into the evaluation ecosystem, the complete guide to LLM benchmarks is the best starting point.

Key takeaways

HumanEval is 164 Python function-completion problems evaluated by hidden unit tests — the benchmark that established automated code eval.
It introduced pass@k, which became the standard metric for code generation and spread to math and agentic benchmarks.
Frontier models now score above 90% pass@1, making HumanEval a floor-check rather than a differentiator.
Saturation was driven by both problem simplicity and training-data overlap — the benchmark was not designed to resist either.
SWE-bench and LiveCodeBench are the current standards for rigorous coding evaluation; HumanEval is retained mainly for historical comparability.