OpenAI Questions Reliability of SWE-bench Verified Coding Benchmark

Details: By Chris Borden; Category: Models; 3 m; 24 February 2026; 157

OpenAI says the programming benchmark SWE-bench Verified has lost much of its value as a reliable measure of coding ability. The company cites two main reasons. First, an internal review found that at least 59.4% of the evaluated tasks were flawed, with tests rejecting correct solutions because they enforce specific implementation details or check for undocumented behavior.

Second, many of the benchmark tasks and their solutions have reportedly ended up in the training data of leading AI models. OpenAI says models such as GPT-5.2, Claude Opus 4.5, and Gemini 3 Flash Preview were able to reproduce near-identical fixes from memory. As a result, progress on SWE-bench Verified may reflect how much a model has already seen rather than how well it can actually program. OpenAI now recommends SWE-bench Pro and says it is developing its own non-public evaluation tests.

There may also be strategic incentives behind OpenAI’s criticism. A “contaminated” benchmark can make rivals — especially open-source models — appear stronger and undermine leaderboard rankings. SWE-bench Verified long served as a key coding benchmark, with OpenAI, Anthropic, and Google competing for marginal gains. More broadly, the episode highlights that while AI benchmarks remain useful, their explanatory power is inherently limited.

About The Hosts

Chris Borden

AI Analyst & Technology Researcher

AI researcher and industry analyst covering decentralized infrastructure, AI systems, and emerging technology markets. Focused on data-driven analysis, long-term trends, and real-world adoption of artificial intelligence.

AI News

Accenture Tracks AI Tool Usage and Ties Adoption to Promotions

Adobe Firefly Introduces Unlimited AI Image and Video Generation for Subscribers

Adobe Unveils CX Enterprise AI Agent Platform as It Searches for a New CEO

AGI May Arrive by 2026–2027, Warns Anthropic CEO Dario Amodei

AI & Society

AI Agents Create a Lobster Religion on Moltbook

AI Boom Drives Cybersecurity Hiring Despite Tech Sector Layoffs

AI Could Trigger a Major U.S. Economic Crisis by 2028, Citrini Research Warns

AI Is Increasing Workload Instead of Reducing It, ActivTrak Study Finds

AI Insights

Adobe Reinvents Document Work with Acrobat Studio and AI

AI agents could disrupt ads and reshape internet commerce

AI as a Role Model for Generation Alpha: Promise, Risks, and the Future of Childhood

AI as a Toy: Why Humanity Always Misuses New Technology First

OpenAI Questions Reliability of SWE-bench Verified Coding Benchmark

About The Hosts

More From Chris Borden

Platforms

DuckDuckGo Installs Up 30% After Google Forces AI on Searchers

Work

Paul Graham Warns AI-Written Emails Can Damage Trust

Work

AI Boom Drives Cybersecurity Hiring Despite Tech Sector Layoffs

Opinion / Interviews

LeCun Denies AI Intelligence While Hassabis Says Singularity Is Here

Models

Qwen3.7-Max: Alibaba's Agentic AI Runs 35 Hours Autonomously, Achieves 10x Kernel Speedup and Rivals Claude Opus 4.6

Models

Google Gemini 3.5 Flash Launches: Agentic AI That Codes, Manages Projects, and Builds Operating Systems

Policy & Security

AI Model Claude Mythos Helps Researchers Bypass Apple MIE Protection in macOS

Policy & Security

OpenAI to Give Malta Residents Free ChatGPT Plus Access After AI Training Course

Platforms

Google DeepMind Prepares AI Cursor Powered by Gemini for Chrome and Googlebook

Work

Developers Say AI Coding Tools Are Creating Technical and Cognitive Debt

Categories

AI News

Categories

AI & Society

Categories

AI Insights

OpenAI Questions Reliability of SWE-bench Verified Coding Benchmark

About The Hosts

More From Chris Borden