Category: Analysis
Daniel Mercer
Share
Listen On

Researchers from Tsinghua University and Microsoft have developed a method for training AI models for advanced programming tasks using exclusively synthetic data. Their 7-billion-parameter model, X-Coder, outperforms competitors twice its size on the LiveCodeBench benchmark.

The experiments demonstrate a clear relationship between dataset size and benchmark performance. With 32,000 synthetic programming tasks, the model achieves a pass rate of 43.7%. At 64,000 tasks, performance rises to 51.3%, at 128,000 tasks to 57.2%, and at 192,000 tasks it reaches 62.7%.

The researchers were able to demonstrate that the model performance steadily increases with the number of synthetic tasks. | Image: Wu et al.
The researchers were able to demonstrate that the model performance steadily increases with the number of synthetic tasks. | Image: Wu et al. 

At equal computational budgets, task diversity proves more important than the number of solutions per task. A dataset with 64,000 distinct tasks and one solution each performs better than datasets containing 16,000 tasks with four solutions each or 8,000 tasks with eight solutions each.

Tasks built from modular components

Developing high-performance code models often fails due to limited training data. Existing collections of competitive programming tasks are heavily reused and no longer sufficient to drive further improvements. Previous synthetic approaches typically rewrite existing problems, limiting diversity.

The system generates high-quality training data in four steps. After extracting and evolving programming features (1), tasks are created for which solutions (2) and test cases (3) are generated using LLMs. A two-stage validation (
The system generates high-quality training data in four steps. After extracting and evolving programming features (1), tasks are created for which solutions (2) and test cases (3) are generated using LLMs. A two-stage validation ("dual verification") ensures the correctness of the synthetic data. | Image: Wu et al. 

The new pipeline, called SynthSmith, generates tasks, solutions, and test cases entirely from scratch. The process begins by extracting algorithmic features — including algorithms, data structures, and optimization techniques — from 10,000 existing code samples. Through an evolutionary process, the system expands this feature pool from 27,400 to nearly 177,000 algorithmic components, which are then recombined into new programming tasks of varying styles.

Quality control occurs in two stages. First, majority voting across multiple candidate solutions determines the correct outputs. Then, the best solution is validated on a held-out test set to prevent overfitting.

7B model beats 14B competitors

The X-Coder 7B model achieves an average pass rate of 62.9% on LiveCodeBench v5 and 55.8% on v6, outperforming larger models such as DeepCoder-14B-Preview and AReal-boba²-14B, both of which rely on stronger base models.

The X-Coder consistently relies on synthetic data for fine-tuning (SFT) and reinforcement learning (RL). In benchmark tests on LiveCodeBench (v5 and v6), the 7B model significantly outperforms larger and more established competitors such as the Mimo-7B and Qwen3-8B. | Image: Wu et al.
The X-Coder consistently relies on synthetic data for fine-tuning (SFT) and reinforcement learning (RL). In benchmark tests on LiveCodeBench (v5 and v6), the 7B model significantly outperforms larger and more established competitors such as the Mimo-7B and Qwen3-8B. | Image: Wu et al. 

Compared with the largest publicly available dataset for code reasoning, SynthSmith delivers a 6.7-point improvement, attributed to more complex tasks that require longer reasoning chains. The average reasoning length reaches 17,700 tokens, compared with 8,000 tokens in the reference dataset.

An additional reinforcement-learning phase boosts performance by 4.6 percentage points. Training remains effective even with synthetic test cases containing around 5% error rate. According to the paper, training required 128 H20 GPUs for 220 hours during supervised fine-tuning and 32 H200 GPUs for seven days for reinforcement learning.

Reduced benchmark contamination

A key advantage of the synthetic approach appears in comparisons across benchmark versions. The reference model Qwen3-8B dropped from 88.1 to 57.5 between older and newer LiveCodeBench versions. In contrast, X-Coder declined from 78.2 to 62.9, a smaller drop of 17.2 points, suggesting reduced memorization of benchmark tasks.

Because X-Coder was trained exclusively on synthetic data, it could not have memorized earlier benchmarks. The researchers plan to release the model weights, and the data processing code is already available on GitHub.

Interest in synthetic training data continues to grow across the AI industry. Last year, startup Datology AI introduced BeyondWeb, a framework that rewrites web documents to generate denser training data, while Nvidia increasingly relies on synthetic data in robotics to offset the scarcity of real-world datasets — effectively turning a data problem into a compute problem.

Conclusion:

The results show that synthetic data can rival and even outperform traditional training approaches for advanced coding models. This opens the door to faster, cheaper, and more scalable AI development without dependence on massive real-world datasets. AI Wire Media will continue to track how synthetic training reshapes the future of AI research and deployment.

AI Research Contributor
Daniel Mercer is an AI research contributor specializing in large language models, benchmarking, and multimodal systems. He writes about model capabilities, limitations, and real-world performance across leading AI assistants and platforms.

Recent Podcasts

AI as a Role Model for Generation Alpha: Promise, Risks, and the Future of Childhood

Artificial intelligence is becoming the main role model for Generation Alpha. 2026 may mark a...

AI as a Toy: Why Humanity Always Misuses New Technology First

Artificial intelligence could, in theory, help solve all of humanity’s problems. Stop wars, cure...

OpenAI Launches Prism: GPT-5.2-Powered AI Workspace for Scientific Writing

OpenAI has unveiled Prism, a free scientific writing tool powered by GPT-5.2.