Second, many of the benchmark tasks and their solutions have reportedly ended up in the training data of leading AI models. OpenAI says models such as GPT-5.2, Claude Opus 4.5, and Gemini 3 Flash Preview were able to reproduce near-identical fixes from memory. As a result, progress on SWE-bench Verified may reflect how much a model has already seen rather than how well it can actually program. OpenAI now recommends SWE-bench Pro and says it is developing its own non-public evaluation tests.
There may also be strategic incentives behind OpenAI’s criticism. A “contaminated” benchmark can make rivals — especially open-source models — appear stronger and undermine leaderboard rankings. SWE-bench Verified long served as a key coding benchmark, with OpenAI, Anthropic, and Google competing for marginal gains. More broadly, the episode highlights that while AI benchmarks remain useful, their explanatory power is inherently limited.
ES
EN