Benchmark explained

What Is HumanEval? The Classic Code-Generation Benchmark

HumanEval is OpenAI's original function-synthesis benchmark that defined pass@k evaluation. Learn its design, why it saturated, and how SWE-bench replaced it as the coding standard.

8 min read

HumanEval was the benchmark that made automated code evaluation mainstream: when OpenAI published it alongside Codex in 2021, it gave the field a reproducible, executable way to measure whether a model could write correct Python functions from a docstring alone.

How HumanEval works

HumanEval consists of 164 handwritten Python programming problems. Each problem provides a function signature and a docstring describing what the function should do; the model must complete the function body. A hidden set of unit tests then executes the generated code and checks whether it produces the correct output for a range of inputs.

The problems cover basic algorithms, data structures, string manipulation, and simple mathematics — roughly the level of an introductory programming course or a straightforward coding interview question. There are no external libraries, no multi-file projects, and no interaction with a runtime environment beyond running Python.

The pass@k metric: HumanEval's lasting contribution

HumanEval introduced pass@k as the standard reporting metric for code generation. The idea is elegant: generate k independent code samples for each problem, and report the fraction of problems for which at least one of those k samples passes all tests. This is estimated without bias using a combinatorial formula rather than literally submitting all k samples.

pass@1 (a single attempt succeeds) is the most practical metric for real use. pass@10 or pass@100 tells you about the model's ceiling: how often the right solution is somewhere in the distribution, even if it does not surface on the first try. For a thorough explanation of the math and the tradeoffs, see our pass@k explainer. This metric spread far beyond HumanEval and is now used in math benchmarks like AIME — covered in What Is AIME? — and agentic evaluations alike.

Why HumanEval saturated

By 2023, frontier models were regularly scoring above 85% pass@1 on HumanEval. By 2024, scores above 90% were common, and the benchmark had effectively ceased to differentiate between the best models. Saturation happened for two reasons:

  • Training data overlap: HumanEval problems and their solutions circulated widely on GitHub, Stack Overflow, and tutorials, making it likely that models had seen near-identical problems during pretraining.
  • Problem difficulty: The 164 problems were designed for introductory programmers, not for testing the limits of a system that can read millions of lines of code. Frontier models outgrew the benchmark quickly.

Saturation is a recurring challenge across the evaluation landscape. The why benchmarks saturate post explores the general pattern. The benchmark contamination explainer goes deeper on the training-data overlap problem specifically.

From HumanEval to SWE-bench: raising the bar

The community responded to saturation by building harder coding evals. The most significant successor is SWE-bench Verified, which replaces self-contained function-completion with resolving real GitHub issues in real codebases. Where HumanEval asks a model to write 10 lines of code given a docstring, SWE-bench asks it to navigate a 50,000-line project, identify the root cause of a reported bug, and produce a patch that makes the existing test suite pass — with no hints about which files to touch. For the full story, see our SWE-bench explainer.

LiveCodeBench takes a different approach to the saturation problem — rather than increasing task complexity it continuously sources fresh problems to avoid training-data leakage. You can read more in What Is LiveCodeBench?.

Is HumanEval still worth reporting?

Despite saturation, HumanEval scores remain useful as a sanity-check baseline. A model that struggles on HumanEval will almost certainly struggle everywhere; a model that aces it has cleared a minimum bar but nothing more. Research papers still report it for comparability with historical results. For practitioners choosing a model today, pair it with a harder eval: SWE-bench for agentic coding tasks, or our best LLM for coding roundup which aggregates multiple signals. You can browse current scores for models like Claude Opus 4.8 and GPT-5.5 on the live benchmark comparison. For the broader context of how coding benchmarks fit into the evaluation ecosystem, the complete guide to LLM benchmarks is the best starting point.

Key takeaways

  • HumanEval is 164 Python function-completion problems evaluated by hidden unit tests — the benchmark that established automated code eval.
  • It introduced pass@k, which became the standard metric for code generation and spread to math and agentic benchmarks.
  • Frontier models now score above 90% pass@1, making HumanEval a floor-check rather than a differentiator.
  • Saturation was driven by both problem simplicity and training-data overlap — the benchmark was not designed to resist either.
  • SWE-bench and LiveCodeBench are the current standards for rigorous coding evaluation; HumanEval is retained mainly for historical comparability.

Keep reading