VibeVoice

Microsoft's open-source frontier voice AI with TTS and ASR models.

VibeVoice is a family of open-source voice AI models from Microsoft covering both text-to-speech (TTS) and automatic speech recognition (ASR). 42.7k stars, MIT licensed, Python.

The core innovation is continuous speech tokenizers running at an ultra-low frame rate of 7.5 Hz, preserving audio fidelity while keeping long-sequence processing efficient. Uses a next-token diffusion framework with an LLM for textual context and a diffusion head for acoustic generation.

Models

Model Size What it does Weights
VibeVoice-ASR 7B 60-min long-form speech-to-text in a single pass HuggingFace
VibeVoice-TTS 1.5B Up to 90-min multi-speaker TTS (up to 4 speakers) HuggingFace
VibeVoice-Realtime 0.5B Streaming real-time TTS, ~300ms first-audio latency HuggingFace

ASR highlights

  • Processes up to 60 minutes of continuous audio in one pass (64K token context).
  • Outputs structured transcription: who (speaker diarization), when (timestamps), what (content).
  • Supports customized hotwords for domain-specific accuracy.
  • Natively multilingual, 50+ languages.
  • vLLM inference supported for faster throughput.
  • Finetuning code available.
  • Now integrated into HuggingFace Transformers.

TTS highlights

  • Up to 90 minutes of speech in a single pass.
  • Multi-speaker (4 distinct voices) with natural turn-taking.
  • Cross-lingual support (EN, ZH, and others).
  • Can even do spontaneous singing.

Realtime TTS

Lightweight 0.5B model for deployment. Streaming text input, ~300ms latency. Supports experimental multilingual voices in 9 languages (DE, FR, IT, JP, KR, NL, PL, PT, ES) plus 11 English style voices.

Install

pip install vibevoice

Requires Python >= 3.10. CUDA recommended, MPS/Apple Silicon support added for the ASR Gradio demo.

Value

Relevant for anyone building voice-enabled products. The ASR model's ability to handle hour-long audio with speaker diarization in one shot is significant. The TTS code was briefly pulled due to misuse concerns but the models remain on HuggingFace. ICLR 2026 Oral for the TTS paper.

links

social