Benchmark explained

What Is LiveCodeBench? Contamination-Free Coding Evals

LiveCodeBench uses a rolling time window of competitive programming problems to prevent training-data leakage. Learn how it works, why contamination matters, and how it compares to SWE-bench.

7 min read

LiveCodeBench was built to answer a simple but uncomfortable question: if a model's training data contains the solutions to your benchmark problems, are you measuring reasoning or recall?

The contamination problem LiveCodeBench solves

Static coding benchmarks like HumanEval publish their problem sets once. Over time those problems — and their solutions — spread across GitHub, forums, and tutorials. Models trained on data collected after a benchmark's release will have seen many of those solutions, making their scores a mix of genuine capability and memorization. This is the core concern explained in depth in our benchmark contamination explainer.

LiveCodeBench sidesteps the problem structurally: it sources problems exclusively from competitive programming contests (LeetCode, Codeforces, AtCoder) that were published after a configurable cutoff date. When a model's training data has a knowledge cutoff of, say, April 2024, evaluating it on problems from May 2024 onward removes most overlap by construction. The benchmark is "live" in the sense that the problem pool grows continuously as new contests are held.

How the time-windowed evaluation works

Each LiveCodeBench evaluation specifies a date range— for example, problems published between 2024-05-01 and 2024-10-31. This means two things:

  • Scores are only comparable when the same date range is used. A model evaluated on a 6-month window ending in late 2024 cannot be directly compared to one evaluated on a 12-month window ending in early 2025.
  • As time passes and new contests are held, the benchmark automatically refreshes. Researchers can re-run evaluations on the latest window to check whether a model's relative standing has changed as new training data was added.

Problems are multi-part: each contest problem comes with a natural language description, example inputs and outputs, and hidden judge test cases. Evaluation is pass@1 against those hidden tests, using the same binary pass/fail logic as SWE-bench Verified.

LiveCodeBench vs SWE-bench: different failure modes

LiveCodeBench and SWE-bench both evaluate code correctness via automated tests, but they stress different skills:

  • LiveCodeBench favors algorithmic problem-solving: dynamic programming, graph algorithms, combinatorics. Problems are self-contained and typically solved in under 100 lines of code. The skill tested is closest to a competitive programmer's ability to identify the right algorithm quickly.
  • SWE-bench favors software engineering in context: navigating large existing codebases, understanding test infrastructure, and making targeted edits. The skill tested is closer to what a software engineer does day-to-day. For a detailed comparison, see our SWE-bench explainer.

A model that excels at one does not necessarily excel at the other. Some reasoning-heavy models score very well on LiveCodeBench's algorithm puzzles but drop on SWE-bench because they struggle with long-context codebase navigation. Checking both gives a more complete picture of coding capability, which you can do on the live benchmark comparison.

Current model performance and what to watch

As of mid-2026, top frontier models score between 55% and 75% on recent LiveCodeBench windows, substantially lower than their near-perfect HumanEval scores — evidence that the time-windowed design is working as a difficulty floor. Models like Claude Opus 4.8 and Gemini 3.1 Pro compete closely on this benchmark, making it one of the more informative current coding evals. The best LLM for coding roundup uses LiveCodeBench as one of its primary signals.

One limitation worth noting: competitive programming problems skew toward mathematical algorithm design and away from system-level or web-development coding. A high LiveCodeBench score does not imply proficiency with frameworks, APIs, or large-scale software architecture. For the broader context of what coding evals do and do not cover, the complete guide to LLM benchmarks is the right starting point.

Key takeaways

  • LiveCodeBench sources problems exclusively from competitive programming contests after a model's training cutoff, structurally preventing most training-data contamination.
  • The "live" aspect means the problem pool refreshes continuously; always specify the date window when comparing scores across reports.
  • It tests algorithmic problem-solving rather than codebase navigation — complementary to SWE-bench, not a replacement for it.
  • Frontier models score 55–75% on recent windows, well below their HumanEval ceilings, confirming the benchmark still discriminates.
  • High LiveCodeBench scores signal algorithmic reasoning but do not predict performance on framework-heavy or systems-level coding tasks.

Keep reading