conduit

Why SWE-bench Verified no longer measures frontier coding capabilities

·12 min read · 0 reactions · 0 comments · 0 views
Why SWE-bench Verified no longer measures frontier coding capabilities

SWE-bench Verified is increasingly contaminated and mismeasures frontier coding progress. Our analysis shows flawed tests and training leakage. We recommend SWE-bench Pro.

Original article
OpenAI
Read full at OpenAI →
Full article excerpt tap to expand

February 23, 2026ResearchPublicationWhy SWE-bench Verified no longer measures frontier coding capabilitiesSWE-bench Verified is increasingly contaminated. We recommend SWE-bench Pro.Loading…ShareSince we first published SWE-bench Verified in August 2024, the industry has widely used it to measure the progress of models on autonomous software engineering tasks. After its release, SWE-bench Verified provided a strong signal of capability progress and became a standard metric reported in frontier model releases. Tracking and forecasting progress of these capabilities is also an important part of OpenAI’s Preparedness Framework. When we created the Verified benchmark initially, we attempted to solve issues in the original evaluation that made certain tasks impossible to accomplish in the SWE-bench dataset⁠(opens in a new window).After initial leaps, state-of-the-art progress on SWE-bench Verified has slowed, improving⁠(opens in a new window) from 74.9% to 80.9% in the last 6 months. This raises the question: do the remaining failures reflect model limitations or properties of the dataset itself?In a new analysis, we found two major issues with the Verified set that indicate the benchmark is no longer suitable for measuring progress on autonomous software engineering capabilities for frontier launches at today’s performance levels:Tests reject correct solutions: We audited a 27.6% subset of the dataset that models often failed to solve and found that at least 59.4% of the audited problems have flawed test cases that reject functionally correct submissions, despite our best efforts in improving on this in the initial creation of SWE-bench Verified.Training on solutions: Because large frontier models can learn information from their training, it is important that they are never trained on problems and solutions they are evaluated on. This is akin to sharing problems and solutions for an upcoming test with students before the test - they may not memorize the answer but students who have seen the answers before will certainly do better than those without. SWE-bench problems are sourced from open-source repositories many model providers use for training purposes. In our analysis we found that all frontier models we tested were able to reproduce the original, human-written bug fix used as the ground-truth reference, known as the gold patch, or verbatim problem statement specifics for certain tasks, indicating that all of them have seen at least some of the problems and solutions during training.We also found evidence that models that have seen the problems during training are more likely to succeed, because they have additional information needed to pass the underspecified tests.This means that improvements on SWE-bench Verified no longer reflect meaningful improvements in models’ real-world software development abilities. Instead, they increasingly reflect how much the model was exposed to the benchmark at training time. This is why we have stopped reporting SWE-bench Verified scores, and we recommend that other model developers do so too.We’re building new, uncontaminated evaluations to better track coding capabilities, and we think this is an important area to focus on for the wider research community. Until we have those, OpenAI recommends reporting results for SWE-bench Pro.BackgroundThe original SWE-bench⁠(opens in a new window) evaluation was released in 2023. Each problem is sourced from a resolved GitHub issue in one of 12 open-source…

This excerpt is published under fair use for community discussion. Read the full article at OpenAI.

Anonymous · no account needed
Share 𝕏 Facebook Reddit LinkedIn Email

Discussion

More from OpenAI