
VibeVoice is a family of open-source voice AI models from Microsoft covering both text-to-speech (TTS) and automatic speech recognition (ASR). 42.7k stars, MIT licensed, Python.
The core innovation is continuous speech tokenizers running at an ultra-low frame rate of 7.5 Hz, preserving audio fidelity while keeping long-sequence processing efficient. Uses a next-token diffusion framework with an LLM for textual context and a diffusion head for acoustic generation.
28 Apr 2026 implemented in Transcribee to test. Transcribes at 1x (60mn = 60mn) so not ideal.
Models
| Model | Size | What it does | Weights |
|---|---|---|---|
| VibeVoice-ASR | 7B | 60-min long-form speech-to-text in a single pass | HuggingFace |
| VibeVoice-TTS | 1.5B | Up to 90-min multi-speaker TTS (up to 4 speakers) | HuggingFace |
| VibeVoice-Realtime | 0.5B | Streaming real-time TTS, ~300ms first-audio latency | HuggingFace |
ASR highlights
- Processes up to 60 minutes of continuous audio in one pass (64K token context).
- Outputs structured transcription: who (speaker diarization), when (timestamps), what (content).
- Supports customized hotwords for domain-specific accuracy.
- Natively multilingual, 50+ languages.
- vLLM inference supported for faster throughput.
- Finetuning code available.
- Now integrated into HuggingFace Transformers.
TTS highlights
- Up to 90 minutes of speech in a single pass.
- Multi-speaker (4 distinct voices) with natural turn-taking.
- Cross-lingual support (EN, ZH, and others).
- Can even do spontaneous singing.
Realtime TTS
Lightweight 0.5B model for deployment. Streaming text input, ~300ms latency. Supports experimental multilingual voices in 9 languages (DE, FR, IT, JP, KR, NL, PL, PT, ES) plus 11 English style voices.
Install
pip install vibevoice
Requires Python >= 3.10. CUDA recommended, MPS/Apple Silicon support added for the ASR Gradio demo.
Value
Relevant for anyone building voice-enabled products. The ASR model's ability to handle hour-long audio with speaker diarization in one shot is significant.

