AI startup Inception has launched what it calls the first diffusion-based reasoning model. The new model, Mercury 2, does not generate text sequentially, word by word, like conventional language models. Instead, it refines multiple text segments in parallel — an approach the company compares to an editor revising an entire draft at once rather than fixing individual words
According to Inception, this architecture makes Mercury 2 more than five times faster than traditional models. It reportedly reaches 1,009 tokens per second on Nvidia Blackwell GPUs, with an end-to-end latency of just 1.7 seconds. By comparison, Gemini 3 Flash takes 14.4 seconds, while Claude Haiku 4.5 with reasoning enabled reaches 23.4 seconds. Inception claims output quality comparable to leading speed-optimized models.
Pricing is also positioned aggressively. Mercury 2 costs $0.25 per million input tokens and $0.75 per million output tokens, making it half the price of Gemini 3 Flash on inputs and four times cheaper on outputs. Compared with Claude Haiku 4.5, Mercury 2 is roughly four times cheaper on input tokens and more than two and a half times cheaper on output.
The model targets latency-sensitive enterprise use cases such as voice assistants, coding tools, and search systems. Mercury 2 supports a 128K context window, tool use, and structured JSON output, and is available via an OpenAI-compatible API. Companies can apply for early access or test the model directly in chat.
Inception raised $50 million in funding last November from investors including Microsoft, Nvidia, and Snowflake. While Google DeepMind has also experimented with diffusion-based language models, interest in transformer alternatives remains early-stage. Whether diffusion-based text generation can challenge the dominance of transformers long term is still an open question.
ES
EN