
VibeVoice is a family of open-source voice AI models from Microsoft covering both text-to-speech (TTS) and automatic speech recognition (ASR). 42.7k stars, MIT licensed, Python.
The core innovation is continuous speech tokenizers running at an ultra-low frame rate of 7.5 Hz, preserving audio fidelity while keeping long-sequence processing efficient. Uses a next-token diffusion framework with an LLM for textual context and a diffusion head for acoustic generation.
Models
| Model | Size | What it does | Weights |
|---|---|---|---|
| VibeVoice-ASR | 7B | 60-min long-form speech-to-text in a single pass | HuggingFace |
| VibeVoice-TTS | 1.5B | Up to 90-min multi-speaker TTS (up to 4 speakers) | HuggingFace |
| VibeVoice-Realtime | 0.5B | Streaming real-time TTS, ~300ms first-audio latency | HuggingFace |
ASR highlights
- Processes up to 60 minutes of continuous audio in one pass (64K token context).
- Outputs structured transcription: who (speaker diarization), when (timestamps), what (content).
- Supports customized hotwords for domain-specific accuracy.
- Natively multilingual, 50+ languages.
- vLLM inference supported for faster throughput.
- Finetuning code available.
- Now integrated into HuggingFace Transformers.
TTS highlights
- Up to 90 minutes of speech in a single pass.
- Multi-speaker (4 distinct voices) with natural turn-taking.
- Cross-lingual support (EN, ZH, and others).
- Can even do spontaneous singing.
Realtime TTS
Lightweight 0.5B model for deployment. Streaming text input, ~300ms latency. Supports experimental multilingual voices in 9 languages (DE, FR, IT, JP, KR, NL, PL, PT, ES) plus 11 English style voices.
Install
pip install vibevoice
Requires Python >= 3.10. CUDA recommended, MPS/Apple Silicon support added for the ASR Gradio demo.
Value
Relevant for anyone building voice-enabled products. The ASR model's ability to handle hour-long audio with speaker diarization in one shot is significant. The TTS code was briefly pulled due to misuse concerns but the models remain on HuggingFace. ICLR 2026 Oral for the TTS paper.

