OpenAI launched EVMbench in mid-February in partnership with investment firm Paradigm to evaluate how well AI agents can find, fix, and exploit vulnerabilities in smart contracts.

OpenZeppelin welcomed the initiative, but decided to review it using the same standards applied to the protocols it helps secure, including Aave, Lido, and Uniswap.

Key shortcomings

The main issue concerns training data contamination. EVMbench is built on a set of 120 vulnerabilities identified during audits conducted in 2024 and 2025.

However, the leading models tested on the benchmark have knowledge cutoffs up to August 2025. That means the models could potentially “remember” information about those vulnerabilities from their training data. Even with internet access disabled, this casts doubt on the validity of the experiment, since it is unclear whether the AI can actually detect genuinely new threats.

OpenZeppelin also pointed to factual errors in the EVMbench dataset. At least four of the vulnerabilities classified as “high risk” turned out to be non-exploitable. Despite that, AI agents still received full credit for supposedly identifying them correctly.

“These are not subjective disagreements about severity; these are cases where the described attack simply does not work,” the experts said.

OpenZeppelin acknowledged that AI will play a major role in the future of blockchain security. At the same time, the firm warned that speed of adoption should not come at the expense of data quality and testing standards.

“The question is not whether AI will transform smart contract security — it will. The question is whether the benchmarks and datasets we use to build these tools will be held to the same standards as the contracts they are meant to protect,” OpenZeppelin concluded.