Key Takeaways
- Voxtral Mini Transcribe V2 from Mistral AI is now available in production on Raisetalk
- Native diarization: automatic speaker identification, with no additional step required
- Benchmark-leading performance: approximately 4% WER, outperforming GPT-4o mini Transcribe, Gemini 2.5 Flash, and Deepgram Nova
- 13 natively supported languages, 100% French technology (Mistral AI, Paris)
- Ultra-competitive pricing: simply the lowest price available on Raisetalk
New: Voxtral Transcribe 2 Available on Raisetalk
In January, we announced the integration of Mistral AI as the LLM engine for conversational analysis. Today, we are taking a new step forward: Mistral AI enters the Speech-to-Text race with Voxtral Transcribe 2, and we have deployed it in production on Raisetalk.
Voxtral Mini Transcribe V2 is a batch transcription model designed for demanding professional environments. It delivers a highly anticipated feature: native diarization — the ability to automatically identify who is speaking and when, built directly into the transcription process.
Voxtral Mini Transcribe V2: Key Specifications
| Feature | Details |
|---|---|
| Type | Batch transcription |
| Diarization | ✅ Native, built-in |
| Maximum duration | Up to 30 minutes per recording |
| Languages | 13 languages: FR, EN, ES, DE, IT, NL, PT, ZH, JA, KO, HI, AR, RU |
| Timestamps | Per segment and word-level |
| Context biasing | ✅ Steers toward domain-specific vocabulary |
| WER | ~4% on FLEURS |
| Translation | ❌ Not available |
| Pseudonymization | ❌ Not available |
Quick Glossary
- Diarization: the ability to identify and distinguish different speakers in a conversation ("who speaks when")
- WER (Word Error Rate): the error rate per word — the lower, the better the transcription. A 4% WER means 96 out of 100 words are correctly transcribed
- Context biasing: the ability to steer the model toward domain-specific vocabulary (product names, technical terms) to improve accuracy
- Per-segment timestamps: timestamping of each transcription segment, enabling precise synchronization with the audio
Diarization: A Decisive Advantage
In our STT model comparison, we highlighted the importance of diarization for conversational analysis. For Quality Monitoring, you need to know whether the agent or the customer said a given sentence. For Sales Compliance, you need to attribute each commitment to the right speaker. For Voice of Customer, you need to isolate the customer's verbatims.
Until now, only Parakeet and Gemini 2.5 Pro achieved 5/5 in diarization in our comparison — but Parakeet offers neither translation nor pseudonymization, and Gemini 2.5 Pro is the slowest and most expensive model.
Voxtral Mini V2 changes the game: it combines top-tier diarization with the lowest cost on the market. It is a particularly compelling option for organizations processing large volumes of conversations that require reliable speaker identification.
Performance: The Numbers
Voxtral Transcribe 2 delivers impressive results in independent benchmarks:
| Criterion | Voxtral Mini V2 | Positioning |
|---|---|---|
| WER (FLEURS) | ~4% | Outperforms GPT-4o mini Transcribe, Gemini 2.5 Flash, Assembly Universal, Deepgram Nova |
| Max duration | 30 minutes | Average |
| Languages | 13 | Broad coverage including Asian languages |
Sovereignty and "Made in France"
The integration of Voxtral is a continuation of our technological independence strategy. Mistral AI is a French company, founded in Paris in 2023 by former researchers from Meta and Google DeepMind.
For French and European organizations, choosing Voxtral for audio transcription addresses the same concerns as for the LLM:
- Digital sovereignty: your audio data is processed by European technology
- Regulatory compliance: a provider subject to the GDPR and the AI Act
- Local ecosystem: supporting the development of European technology champions
Updated Comparison: 8 STT Models on Raisetalk
In January, we compared 7 STT models. Voxtral now enriches this lineup. Here is the updated table:
| # | Model | Region | Durations | Transl. | Pseudo. | Cost | Speed | Quality | Diarization | Drift | Bonus | Total |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Voxtral | EU | < 30 min | ❌ | ❌ | 5/5 | 5/5 | 5/5 | 5/5 | 5/5 | +8 | 33 |
| 2 | Light | EU & US | All | ✅ | ✅ | 4/5 | 5/5 | 4/5 | 3/5 | 5/5 | +20 | 41 |
| 3 | Best | EU & US | All | ✅ | ✅ | 2/5 | 5/5 | 5/5 | 3/5 | 5/5 | +20 | 40 |
| 4 | Gemini 2 Flash | EU & US | < 10 min | ✅ | ✅ | 5/5 | 5/5 | 4/5 | 4/5 | 2/5 | +16 | 36 |
| 5 | Gemini 2.5 Flash | EU & US | < 25 min | ✅ | ✅ | 4/5 | 4/5 | 4/5 | 4/5 | 3/5 | +17 | 36 |
| 6 | Gemini 2.5 Pro | EU & US | < 45 min | ✅ | ✅ | 1/5 | 1/5 | 5/5 | 5/5 | 4/5 | +19 | 35 |
| 7 | Parakeet | EU | All | ❌ | ❌ | 3/5 | 5/5 | 4/5 | 5/5 | 5/5 | +8 | 30 |
| 8 | Whisper | EU | All | ❌ | ❌ | 3/5 | 5/5 | 3/5 | 5/5 | 5/5 | +8 | 29 |
Legend: all scores are out of 5, higher is better. The feature bonus rewards multi-region availability, long duration support, translation, and pseudonymization.
Our recommendation: Voxtral is the ideal choice if you are looking for the best combination of quality + diarization + cost, and you do not need pseudonymization or translation. For use cases where pseudonymization is mandatory, stick with Light or Best.
When Should You Choose Voxtral?
Voxtral is particularly well-suited if:
- Diarization is important for your analyses (Quality Monitoring, Sales Compliance)
- You process large volumes and cost is a deciding factor
- You value technological sovereignty (French-made solution)
- You do not need pseudonymization or translation built into the STT
- Your recordings are moderately long (up to 30 minutes)
Voxtral is not recommended if:
- STT-level pseudonymization is mandatory in your context -- use
LightorBestinstead - You need built-in translation -- use
Light,Best, or the Gemini models instead
How to Enable Voxtral on Raisetalk
On Raisetalk, selecting your STT model is done simply at the point of submitting each analysis. You can:
- Test Voxtral on a sample of conversations and compare the results with your current models
- Mix and match approaches based on your use cases
Our team can also help you identify the optimal configuration based on your volumes, budget, and quality requirements.
Try It Yourself
The best way to judge is to test it firsthand.
Our trial space lets you transcribe your own conversations with the model of your choice: https://app.raisetalk.com/try

