Voxtral, FluidAudio, and Parakeet: A Deep Technical Map of the Modern Local Speech Stack

The speech stack has split into three very different shapes. One shape is a model family: Voxtral. It is Mistral’s audio line, with text-to-speech, speech-to-text, realtime transcription, and API-centered voice workflows. Another shape is a native Apple SDK: FluidAudio. It is not one model. It is a Swift/CoreML pipeline for local transcription, voice activity detection, diarization, and TTS on macOS and iOS. The third shape is a recognition engine: Parakeet. It is NVIDIA’s ASR family, built around FastConformer/TDT variants, optimized for very fast and accurate speech-to-text. ...

May 29, 2026 · 25 min · 5175 words · Pavel Nasovich