What Is SWE-bench? Agentic Coding Benchmarks Explained

SWE-bench has become the benchmark that separates genuine coding agents from demos: it asks a model to actually fix a real bug from a real open-source repository, and the only thing that counts is whether the test suite passes afterwards.

What SWE-bench measures

Each SWE-bench instance is a GitHub issue drawn from a popular Python project, accompanied by a failing test suite that captures the reported bug. The model receives the issue description and the repository source code, then must produce a patch. Evaluation is binary: the patch either makes the failing tests pass (and does not break others) or it does not. There are no partial scores and no rubric for "close" answers.

This mirrors what a junior engineer actually does on their first ticket: understand a real codebase, reproduce a problem, and ship a fix. That authenticity is why the benchmark quickly became the de-facto standard for agentic coding capability.

SWE-bench Verified vs SWE-bench Pro

The original SWE-bench released in 2023 contained 2,294 tasks, but early analysis showed that some instances had ambiguous or under-specified tests. SWE-bench Verified is a curated subset of 500 problems that human software engineers confirmed are solvable and well-specified. Most leaderboard comparisons now report Verified scores because they are more reliable.

SWE-bench Pro raises the stakes further. It sources problems from newer repository commits that postdate most model training cutoffs, reducing the risk that a model simply recalls a patch it saw during pretraining. Pro instances tend to require multi-file edits and deeper understanding of a codebase's conventions, making scores significantly lower than on Verified. You can see current model scores on SWE-bench Verified side by side with Pro on our live benchmark comparison.

Multilingual SWE-bench

The original benchmark only covered Python. Multilingual SWE-bench extends the framework to repositories written in Java, JavaScript, TypeScript, Go, Rust, and C++. Because most frontier models are trained on far less non-Python code than Python code, multilingual scores reveal coverage gaps that the Python-only version hides. A model that scores 60% on Python tasks may score below 30% on equivalent Go tasks.

Why inference setup matters so much

SWE-bench is an agentic eval: the model is not just completing a single prompt, it is running a loop of tool calls — reading files, running tests, editing code, checking output. The scaffolding around that loop (which tools are available, how many steps are allowed, whether the model can browse documentation) has an enormous effect on the final score. A model run with a well-tuned agent harness can score 10-15 percentage points higher than the same model run with a basic scaffold. Always check whether a reported score is "unassisted" or used a proprietary agent framework. For a deeper look at how agentic setups affect results, see our agentic evals explainer.

Limitations and what SWE-bench does not cover

SWE-bench evaluates patch correctness as judged by existing tests. It does not measure whether the model writes good tests, whether the fix is readable, or whether the approach would pass a code review. It also skews toward well-maintained open-source Python libraries; enterprise codebases with proprietary frameworks or weak test coverage would produce very different results. For general knowledge breadth, pair it with MMLU; for shell and environment tasks, see Terminal-Bench. For the broader context of how coding benchmarks fit into the evaluation landscape, read the complete guide to LLM benchmarks.

Key takeaways

SWE-bench tasks models with resolving real GitHub issues; success is measured by whether the test suite passes.
Verified (500 curated problems) is the standard leaderboard variant; Pro uses post-cutoff issues for a harder, contamination-resistant evaluation.
Multilingual SWE-bench exposes language coverage gaps that Python-only scores hide.
Scaffolding and tooling choices can shift scores by 10-15 percentage points — always check the evaluation setup behind a number.
No benchmark score replaces testing a model on your own codebase and your own test suite.