Tsinghua and Microsoft Train X-Coder Using Only Synthetic Data, Beating Larger AI Models

Category: Analysis

AI Research Contributor

24 January 2026

Listen On

Researchers from Tsinghua University and Microsoft have developed a method for training AI models for advanced programming tasks using exclusively synthetic data. Their 7-billion-parameter model, X-Coder, outperforms competitors twice its size on the LiveCodeBench benchmark.

The experiments demonstrate a clear relationship between dataset size and benchmark performance. With 32,000 synthetic programming tasks, the model achieves a pass rate of 43.7%. At 64,000 tasks, performance rises to 51.3%, at 128,000 tasks to 57.2%, and at 192,000 tasks it reaches 62.7%.

The researchers were able to demonstrate that the model performance steadily increases with the number of synthetic tasks. | Image: Wu et al.

At equal computational budgets, task diversity proves more important than the number of solutions per task. A dataset with 64,000 distinct tasks and one solution each performs better than datasets containing 16,000 tasks with four solutions each or 8,000 tasks with eight solutions each.

Tasks built from modular components

Developing high-performance code models often fails due to limited training data. Existing collections of competitive programming tasks are heavily reused and no longer sufficient to drive further improvements. Previous synthetic approaches typically rewrite existing problems, limiting diversity.

The system generates high-quality training data in four steps. After extracting and evolving programming features (1), tasks are created for which solutions (2) and test cases (3) are generated using LLMs. A two-stage validation ("dual verification") ensures the correctness of the synthetic data. | Image: Wu et al.

The new pipeline, called SynthSmith, generates tasks, solutions, and test cases entirely from scratch. The process begins by extracting algorithmic features — including algorithms, data structures, and optimization techniques — from 10,000 existing code samples. Through an evolutionary process, the system expands this feature pool from 27,400 to nearly 177,000 algorithmic components, which are then recombined into new programming tasks of varying styles.

Quality control occurs in two stages. First, majority voting across multiple candidate solutions determines the correct outputs. Then, the best solution is validated on a held-out test set to prevent overfitting.

7B model beats 14B competitors

The X-Coder 7B model achieves an average pass rate of 62.9% on LiveCodeBench v5 and 55.8% on v6, outperforming larger models such as DeepCoder-14B-Preview and AReal-boba²-14B, both of which rely on stronger base models.

The X-Coder consistently relies on synthetic data for fine-tuning (SFT) and reinforcement learning (RL). In benchmark tests on LiveCodeBench (v5 and v6), the 7B model significantly outperforms larger and more established competitors such as the Mimo-7B and Qwen3-8B. | Image: Wu et al.

Compared with the largest publicly available dataset for code reasoning, SynthSmith delivers a 6.7-point improvement, attributed to more complex tasks that require longer reasoning chains. The average reasoning length reaches 17,700 tokens, compared with 8,000 tokens in the reference dataset.

An additional reinforcement-learning phase boosts performance by 4.6 percentage points. Training remains effective even with synthetic test cases containing around 5% error rate. According to the paper, training required 128 H20 GPUs for 220 hours during supervised fine-tuning and 32 H200 GPUs for seven days for reinforcement learning.

Reduced benchmark contamination

A key advantage of the synthetic approach appears in comparisons across benchmark versions. The reference model Qwen3-8B dropped from 88.1 to 57.5 between older and newer LiveCodeBench versions. In contrast, X-Coder declined from 78.2 to 62.9, a smaller drop of 17.2 points, suggesting reduced memorization of benchmark tasks.

Because X-Coder was trained exclusively on synthetic data, it could not have memorized earlier benchmarks. The researchers plan to release the model weights, and the data processing code is already available on GitHub.

Interest in synthetic training data continues to grow across the AI industry. Last year, startup Datology AI introduced BeyondWeb, a framework that rewrites web documents to generate denser training data, while Nvidia increasingly relies on synthetic data in robotics to offset the scarcity of real-world datasets — effectively turning a data problem into a compute problem.

Conclusion:

The results show that synthetic data can rival and even outperform traditional training approaches for advanced coding models. This opens the door to faster, cheaper, and more scalable AI development without dependence on massive real-world datasets. AI Wire Media will continue to track how synthetic training reshapes the future of AI research and deployment.

Daniel Mercer

AI Research Contributor

Daniel Mercer is an AI research contributor specializing in large language models, benchmarking, and multimodal systems. He writes about model capabilities, limitations, and real-world performance across leading AI assistants and platforms.

Podcast by Daniel Mercer

Recent Podcasts

AI as a Role Model for Generation Alpha: Promise, Risks, and the Future of Childhood

Opinion / Interviews

AI News

Adobe Firefly Introduces Unlimited AI Image and Video Generation for Subscribers

AGI May Arrive by 2026–2027, Warns Anthropic CEO Dario Amodei

AI Agent Beats 804 Human Programmers in Major Coding Tournament

AI Agents Can Now Hire Humans: Rentahuman.ai Turns Automation Into a Marketplace

AI & Society

AI Agents Create a Lobster Religion on Moltbook

Amazon Launches Health AI Assistant in One Medical App

DeepMind and Anthropic Warn AI Is Already Cutting Entry-Level Jobs

Doctors Welcome ChatGPT Health, Despite Ongoing Hallucination Risks

AI Insights

AI as a Role Model for Generation Alpha: Promise, Risks, and the Future of Childhood

AI as a Toy: Why Humanity Always Misuses New Technology First

AI as On-Chain Judge: Stanford Professor Proposes Using LLMs to Resolve Prediction Market Disputes

AI Investment Strategies: How Artificial Intelligence Is Reshaping Retail Investing