What Is HumanEval? The Classic Code-Generation Benchmark
HumanEval is OpenAI's original function-synthesis benchmark that defined pass@k evaluation. Learn its design, why it saturated, and how SWE-bench replaced it as the coding standard.
HumanEval was the benchmark that made automated code evaluation mainstream: when OpenAI published it alongside Codex in 2021, it gave the field a reproducible, executable way to measure whether a model could write correct Python functions from a docstring alone.
How HumanEval works
HumanEval consists of 164 handwritten Python programming problems. Each problem provides a function signature and a docstring describing what the function should do; the model must complete the function body. A hidden set of unit tests then executes the generated code and checks whether it produces the correct output for a range of inputs.
The problems cover basic algorithms, data structures, string manipulation, and simple mathematics — roughly the level of an introductory programming course or a straightforward coding interview question. There are no external libraries, no multi-file projects, and no interaction with a runtime environment beyond running Python.
The pass@k metric: HumanEval's lasting contribution
HumanEval introduced pass@k as the standard reporting metric for code generation. The idea is elegant: generate k independent code samples for each problem, and report the fraction of problems for which at least one of those k samples passes all tests. This is estimated without bias using a combinatorial formula rather than literally submitting all k samples.
pass@1 (a single attempt succeeds) is the most practical metric for real use. pass@10 or pass@100 tells you about the model's ceiling: how often the right solution is somewhere in the distribution, even if it does not surface on the first try. For a thorough explanation of the math and the tradeoffs, see our pass@k explainer. This metric spread far beyond HumanEval and is now used in math benchmarks like AIME — covered in What Is AIME? — and agentic evaluations alike.
Why HumanEval saturated
By 2023, frontier models were regularly scoring above 85% pass@1 on HumanEval. By 2024, scores above 90% were common, and the benchmark had effectively ceased to differentiate between the best models. Saturation happened for two reasons:
- Training data overlap: HumanEval problems and their solutions circulated widely on GitHub, Stack Overflow, and tutorials, making it likely that models had seen near-identical problems during pretraining.
- Problem difficulty: The 164 problems were designed for introductory programmers, not for testing the limits of a system that can read millions of lines of code. Frontier models outgrew the benchmark quickly.
Saturation is a recurring challenge across the evaluation landscape. The why benchmarks saturate post explores the general pattern. The benchmark contamination explainer goes deeper on the training-data overlap problem specifically.
From HumanEval to SWE-bench: raising the bar
The community responded to saturation by building harder coding evals. The most significant successor is SWE-bench Verified, which replaces self-contained function-completion with resolving real GitHub issues in real codebases. Where HumanEval asks a model to write 10 lines of code given a docstring, SWE-bench asks it to navigate a 50,000-line project, identify the root cause of a reported bug, and produce a patch that makes the existing test suite pass — with no hints about which files to touch. For the full story, see our SWE-bench explainer.
LiveCodeBench takes a different approach to the saturation problem — rather than increasing task complexity it continuously sources fresh problems to avoid training-data leakage. You can read more in What Is LiveCodeBench?.
Is HumanEval still worth reporting?
Despite saturation, HumanEval scores remain useful as a sanity-check baseline. A model that struggles on HumanEval will almost certainly struggle everywhere; a model that aces it has cleared a minimum bar but nothing more. Research papers still report it for comparability with historical results. For practitioners choosing a model today, pair it with a harder eval: SWE-bench for agentic coding tasks, or our best LLM for coding roundup which aggregates multiple signals. You can browse current scores for models like Claude Opus 4.8 and GPT-5.5 on the live benchmark comparison. For the broader context of how coding benchmarks fit into the evaluation ecosystem, the complete guide to LLM benchmarks is the best starting point.
Key takeaways
- HumanEval is 164 Python function-completion problems evaluated by hidden unit tests — the benchmark that established automated code eval.
- It introduced pass@k, which became the standard metric for code generation and spread to math and agentic benchmarks.
- Frontier models now score above 90% pass@1, making HumanEval a floor-check rather than a differentiator.
- Saturation was driven by both problem simplicity and training-data overlap — the benchmark was not designed to resist either.
- SWE-bench and LiveCodeBench are the current standards for rigorous coding evaluation; HumanEval is retained mainly for historical comparability.