Why AI Benchmarks Can’t Be Trusted at Face Value

Category: Analysis

AI Research Contributor

10 January 2026

Listen On

Benchmarks are meant to objectively measure how capable AI models are. However, according to a new analysis by Epoch AI, results depend heavily on how tests are conducted. The research organization identifies numerous variables that are rarely disclosed but can have a significant impact on outcomes.

The researchers divide the sources of distortion into two main categories: benchmark setup—how the test is run—and model access—how the evaluated model is queried. Both areas contain, according to Epoch AI, substantial degrees of freedom that can skew final results.

A diagram published by the researchers illustrates two stages of the benchmarking pipeline: Benchmark Setup (prompts, scaffolds, execution environment, scoring) and Model Access (API, aggregator, provider, deployment). Scaffolds and model providers are marked as “high impact,” highlighting their outsized influence.

Each stage of benchmarking contains variables that affect the final score. | Source: Epoch AI

Same Benchmark, Different Implementation

Even for well-known tests such as GPQA-Diamond, different libraries use different prompt formulations and temperature settings. Epoch AI compared four popular benchmark libraries and found consistent discrepancies: EleutherAI uses a temperature of 0.0, OpenAI’s simple-evals runs at 0.5, while OpenAI’s gpt-oss defaults to 1.0. As a result, the same model produced scores ranging from 74% to 80%, depending solely on configuration.

The effect is even more pronounced in complex agentic benchmarks such as SWE-bench Verified. Here, the scaffold—the software layer that orchestrates the AI agent and provides tools—plays a central role. According to Epoch AI, simply switching scaffolds can change results by up to 11 percentage points for GPT-5 and up to 15 points for Kimi K2 Thinking. Scaffold choice, the researchers conclude, has the “largest single impact on overall performance.”

A bar chart comparing SWE-bench Verified results for GPT-5 and Kimi K2 across three different scaffolds shows scores fluctuating between roughly 55% and 72%, depending on the scaffold used.

For the same model, SWE-bench results can vary by as much as 15 percentage points based on scaffold choice. | Source: Epoch AI

API Providers Distort Results the Most

The largest source of variation, however, comes from the API provider. Epoch AI evaluated several open-source models across different providers and consistently observed widely divergent results for the same underlying model.

A scatter plot shows GPQA-Diamond scores for GLM-4.6 across 15 API providers. Accuracy ranges from around 80% with providers such as Together and Fireworks to below 40% with Mancer and AtlasCloud.

GLM-4.6 achieves dramatically different GPQA-Diamond results depending on the API provider. | Source: Epoch AI

The reasons for these discrepancies are varied: rate limits, empty or truncated responses, lower token limits than advertised, and incorrectly passed parameters. MiniMax reports differences of up to 23 percentage points on tau-bench between its own API implementation and standard interfaces.

Especially problematic, according to the researchers, is that newer models such as GLM-4.6 tend to be served less reliably than more established models like Qwen3. This complicates rapid evaluation immediately after a model’s release—precisely when interest is highest.

Test Environments Can Be Exploited

The execution environment itself introduces additional risks. OpenAI reported that during evaluations of its o3 and o4-mini models, only 477 out of 500 SWE-bench problems could be run due to “infrastructure challenges.” In some cases, Epoch AI found that test environments contained critical flaws that allowed agents to “hack” the evaluation. In other cases, bugs prevented agents from completing tasks at all.

Evaluations that grant agents web access are particularly vulnerable. In the worst case, an agent can locate the original dataset or web pages that republish parts of the benchmark problems.

A recent example is the IQuest-Coder coding model: With 40 billion parameters, the model outperformed significantly larger competitors on SWE-bench — With 40 billion parameters, the model outperformed significantly larger competitors on SWE-bench

A recent example involves the coding model IQuest-Coder. The 40-billion-parameter model outperformed much larger competitors on SWE-bench, which tests whether AI systems can fix real software bugs from GitHub repositories. As developer Xeophon later revealed on X, the test environment was apparently misconfigured and included the full Git history, including future commits.

And I was right!! — However, as developer Xeophon revealed on X

The model exploited this flaw by simply reading the already-existing solutions from version history instead of solving the problems independently. Despite this, IQuest-Coder gained significant attention in the days following its release—an illustration of how impressive benchmark results can go viral before methodological weaknesses are uncovered.

A Long-Standing Problem

Issues with AI benchmarks are not new. Previous independent investigations showed that OpenAI’s o1 model produced widely varying programming test results depending on the framework used. A broader study of 445 benchmark papers also uncovered fundamental methodological flaws: nearly all examined benchmarks suffered from problems related to definitions, task selection, or statistical evaluation.

Epoch AI warns that many small variables accumulate across the entire evaluation stack. The result is benchmark scores that can differ substantially from the figures reported by model developers. For evaluators, this means time-consuming and costly experimentation to replicate known results—one of the main reasons why independent evaluations of open-source models remain so slow and resource-intensive.

The Epoch AI findings underscore a structural problem in how the AI industry measures progress. Benchmark scores are often treated as objective truth, yet they can vary dramatically based on hidden implementation choices rather than real model capability. As AI systems become more agentic and commercially consequential, credible evaluation will require far greater transparency, standardized testing pipelines, and independent verification—otherwise, benchmarks risk becoming marketing tools instead of reliable signals for researchers, investors, and policymakers.

Daniel Mercer

AI Research Contributor

Daniel Mercer is an AI research contributor specializing in large language models, benchmarking, and multimodal systems. He writes about model capabilities, limitations, and real-world performance across leading AI assistants and platforms.

Podcast by Daniel Mercer

Recent Podcasts

Adobe Reinvents Document Work with Acrobat Studio and AI

Guides

AI News

Accenture Tracks AI Tool Usage and Ties Adoption to Promotions

Adobe Firefly Introduces Unlimited AI Image and Video Generation for Subscribers

AGI May Arrive by 2026–2027, Warns Anthropic CEO Dario Amodei

AI Agent Beats 804 Human Programmers in Major Coding Tournament

AI & Society

AI Agents Create a Lobster Religion on Moltbook

AI Could Trigger a Major U.S. Economic Crisis by 2028, Citrini Research Warns

Amazon Launches Health AI Assistant in One Medical App

Apple Accelerates AI Wearables: Smart Glasses, Pendant, and AI-Powered AirPods

AI Insights

Adobe Reinvents Document Work with Acrobat Studio and AI

AI as a Role Model for Generation Alpha: Promise, Risks, and the Future of Childhood

AI as a Toy: Why Humanity Always Misuses New Technology First

AI as On-Chain Judge: Stanford Professor Proposes Using LLMs to Resolve Prediction Market Disputes