Mistral AI aims to undercut competitors on price in speech recognition with Voxtral Transcribe 2. The second generation of its speech-to-text models starts at $0.003 per minute and, according to Mistral, delivers higher accuracy than models such as GPT-4o mini Transcribe, Gemini 2.5 Flash, and Deepgram Nova. The model family includes two variants: Voxtral Mini Transcribe V2, designed for processing large audio files, and Voxtral Realtime, built for real-time applications with latency under 200 milliseconds. Voxtral Realtime, which costs twice as much, uses a dedicated streaming architecture that transcribes audio as it arrives, targeting use cases such as voice assistants, live captions, and call center analytics
Both new models support 13 languages, including German, English, and Chinese. New features include speaker diarization, word-level timestamps, and support for recordings of up to three hours. Voxtral Realtime is available as open weights under the Apache 2.0 license on Hugging Face as well as via API, while Voxtral Mini Transcribe V2 is accessible only through Le Chat, the Mistral API, and a playground. Mistral introduced the first generation of Voxtral in July 2025
AI Research Contributor
Daniel Mercer is an AI research contributor specializing in large language models, benchmarking, and multimodal systems. He writes about model capabilities, limitations, and real-world performance across leading AI assistants and platforms.