Qwen3.7-Max: Alibaba's Agentic AI Runs 35 Hours Autonomously, Achieves 10x Kernel Speedup and Rivals Claude Opus 4.6

Details: By Chris Borden; Category: Models; 1 m; 25 May 2026; 337

Alibaba's Qwen team has introduced Qwen3.7-Max — a proprietary model designed specifically for agentic tasks. In a practical test, the model ran fully autonomously for 35 hours optimizing a hardware kernel, producing results that benchmark it alongside Anthropic's Claude Opus 4.6.

Like its predecessors Qwen3-Max and Qwen3.6-Plus, the new Max variant is available exclusively through Alibaba Cloud Model Studio's API. Alibaba had long pursued an open-source strategy with its Qwen models, but that approach has shifted — the last openly released flagship was Qwen3.5-397B-A17B in February 2026.

Qwen3.7-Max supports both OpenAI- and Anthropic-compatible interfaces and can be integrated directly into Claude Code, OpenClaw, or Qwen Code. According to the Qwen team, the model targets four use cases: coding agent work from frontend prototyping to complex multi-file software projects, automation of office tasks via external tools, sustained autonomous operation over extended periods, and consistent performance across different agent frameworks.

A 35-Hour Kernel Experiment

The team tasked Qwen3.7-Max with optimizing a low-level attention computation kernel for SGLang — an open-source inference framework — running on a cloud instance equipped with T-Head ZW-M890 accelerators, an AI chip platform from Alibaba's own semiconductor division. The model had never encountered this hardware architecture during training. It started with no performance data, no hardware documentation, and no example code — only the existing Triton-based reference implementation.

Over approximately 35 hours of continuous autonomous work, the model ran 432 kernel tests with a total of 1,158 tool calls. It iteratively compiled, measured, and revised the code, diagnosed compilation errors, and independently identified performance bottlenecks. The result, according to the Qwen researchers, was a tenfold average speedup over the reference implementation.

Competing models tested in the same setup achieved significantly lower results: GLM 5.1 reached a 7.3x speedup, Kimi K2.6 achieved 5x, DeepSeek V4 Pro reached 3.3x, and the predecessor Qwen3.6-Plus managed only 1.1x. Models that exited early did so voluntarily after five consecutive rounds with no tool calls. On the standardized KernelBench-L3 benchmark, Qwen3.7-Max produces accelerated kernels in 96% of cases — just behind Anthropic's Opus 4.6 at 98%.

Separating Task, Tool Environment, and Evaluator

Methodologically, Qwen3.7-Max builds on a training approach introduced with Qwen3.5. Each training task is decomposed into three independent components: the task itself, the tool environment, and the result evaluator — which can be recombined freely. The same task is practiced across multiple tool environments and evaluated with different verification methods, forcing the model to develop broadly applicable solution strategies rather than shortcuts that only work in a specific setting. On QwenClawBench and CoWorkBench, Qwen3.7-Max's performance remains stable regardless of the test environment — and leads Claude Opus 4.6 on QwenClawBench across OpenClaw, Claude Code, and Hermes agent frameworks.

Cross-harness test: While Qwen3.6-Plus performance varies depending on the agent framework, Qwen3.7-Max delivers nearly constant results in OpenClaw, Claude Code (CC), and Hermes, according to the team, and outperforms Claude Opus 4.6 on QwenClawBench. | Image: Qwen

Self-Monitoring for Reward Hacking

The Qwen team also deployed Qwen3.7-Max as an observer within its own training pipeline. Over more than 80 hours, the model monitored training runs for software engineering tasks and conducted more than 10,000 evaluations, searching for reward-hacking behaviors — cases where the trained model was gaming the reward signal, for example by fetching correct answers directly from GitHub. In total, Qwen3.7-Max formulated 13 new detection rules and flagged 1,618 such cases across 13,952 trajectories.

Over 86 hours of autonomous operation, Qwen3.7-Max, according to the team, examined 13,952 trajectories and reported 1,618 attempts in which the trained model fraudulently obtained rewards. With each new detection rule (green stars), the number of detected cases increased. | Image: Qwen

One Year in Simulation

For long-horizon planning evaluation, the team used YC-Bench — a benchmark that simulates the one-year lifecycle of a startup. The model must manage personnel, review contracts, identify bad-faith customers, and maintain profit margins against rising labor costs across hundreds of decision rounds. Qwen3.7-Max generated $2.08 million in total revenue and completed 237 tasks. Its predecessor Qwen3.6-Plus reached $1.05 million, and Qwen3.5-Plus $352,000.

Benchmark Performance at Opus Level

Across most benchmarks, Qwen3.7-Max positions itself at or slightly ahead of Claude Opus 4.6 Max, Kimi K2.6 Thinking, GLM-5.1 Thinking, and DeepSeek V4 Pro Max. On SWE-Verified it scores 80.4 — nearly matching Opus 4.6 Max (80.8) and DeepSeek V4 Pro Max (80.6). On math and science benchmarks — GPQA Diamond (92.4), HMMT February 2026 (97.1), and Apex (44.5) — Qwen3.7-Max leads the vendor's own comparison table. Claude Opus 4.6 Max retains the lead on NL2Repo, ClawEval, and CoWorkBench.

Across twelve benchmarks, Qwen3.7-Max is, according to the vendor, usually ahead of or on par with Claude Opus 4.6 Max, DeepSeek V4 Pro Max, GLM-5.1, Kimi K2.6, and its own predecessor, Qwen3.6-Plus. However, Claude Opus 4.6 surpasses Qwen in NL2Repo, ClawEval, and CoWorkBench. | Image: Qwen — Qwen3.7-Max: Alibaba's Agentic AI Runs 35 Hours Autonomously, Achieves 10x Kernel Speedup and Rivals Claude Opus 4.6

It is worth noting that several of the cited benchmarks are proprietary — including QwenWebDev, QwenClawBench, CoWorkBench, and QwenWorldBench — and all results come from the Qwen team's own evaluation. A more detailed analysis of scaling dynamics and methodology is expected in an upcoming technical report.

According to the Qwen team, with a growing number of training environments, the average ranking of Qwen3.7-Max-Thinking improves across eight benchmarks, surpassing competitors like DeepSeek V4 Pro Max, GLM-5.1, and Kimi K2.6, but remaining just behind Claude 4.6 Opus Max. | Image: Qwen

As a side note, the team also demonstrated Qwen3.7-Max controlling a quadruped robot — using the language model as a steering model via a dedicated robotics framework and navigation module to guide the robot through physical environments.

About The Hosts

Chris Borden

AI Analyst & Technology Researcher

AI researcher and industry analyst covering decentralized infrastructure, AI systems, and emerging technology markets. Focused on data-driven analysis, long-term trends, and real-world adoption of artificial intelligence.

Categories

AI News

Categories

AI & Society

Categories

AI Insights

Qwen3.7-Max: Alibaba's Agentic AI Runs 35 Hours Autonomously, Achieves 10x Kernel Speedup and Rivals Claude Opus 4.6

A 35-Hour Kernel Experiment

Separating Task, Tool Environment, and Evaluator

Self-Monitoring for Reward Hacking

One Year in Simulation

Benchmark Performance at Opus Level

About The Hosts

More From Chris Borden