Second, many of the benchmark tasks and their solutions have reportedly ended up in the training data of leading AI models. OpenAI says models such as GPT-5.2, Claude Opus 4.5, and Gemini 3 Flash Preview were able to reproduce near-identical fixes from memory. As a result, progress on SWE-bench Verified may reflect how much a model has already seen rather than how well it can actually program. OpenAI now recommends SWE-bench Pro and says it is developing its own non-public evaluation tests.

There may also be strategic incentives behind OpenAI’s criticism. A “contaminated” benchmark can make rivals — especially open-source models — appear stronger and undermine leaderboard rankings. SWE-bench Verified long served as a key coding benchmark, with OpenAI, Anthropic, and Google competing for marginal gains. More broadly, the episode highlights that while AI benchmarks remain useful, their explanatory power is inherently limited.