VibeVoice

Microsoft's open-source frontier voice AI with TTS and ASR models.

VibeVoice is a family of open-source voice AI models from Microsoft covering both text-to-speech (TTS) and automatic speech recognition (ASR). 42.7k stars, MIT licensed, Python.

The core innovation is continuous speech tokenizers running at an ultra-low frame rate of 7.5 Hz, preserving audio fidelity while keeping long-sequence processing efficient. Uses a next-token diffusion framework with an LLM for textual context and a diffusion head for acoustic generation.

28 Apr 2026 implemented in Transcribee to test. Transcribes at 1x (60mn = 60mn) so not ideal.

Models

Model Size What it does Weights
VibeVoice-ASR 7B 60-min long-form speech-to-text in a single pass HuggingFace
VibeVoice-TTS 1.5B Up to 90-min multi-speaker TTS (up to 4 speakers) HuggingFace
VibeVoice-Realtime 0.5B Streaming real-time TTS, ~300ms first-audio latency HuggingFace

ASR highlights

  • Processes up to 60 minutes of continuous audio in one pass (64K token context).
  • Outputs structured transcription: who (speaker diarization), when (timestamps), what (content).
  • Supports customized hotwords for domain-specific accuracy.
  • Natively multilingual, 50+ languages.
  • vLLM inference supported for faster throughput.
  • Finetuning code available.
  • Now integrated into HuggingFace Transformers.

TTS highlights

  • Up to 90 minutes of speech in a single pass.
  • Multi-speaker (4 distinct voices) with natural turn-taking.
  • Cross-lingual support (EN, ZH, and others).
  • Can even do spontaneous singing.

Realtime TTS

Lightweight 0.5B model for deployment. Streaming text input, ~300ms latency. Supports experimental multilingual voices in 9 languages (DE, FR, IT, JP, KR, NL, PL, PT, ES) plus 11 English style voices.

Install

pip install vibevoice

Requires Python >= 3.10. CUDA recommended, MPS/Apple Silicon support added for the ASR Gradio demo.

Value

Relevant for anyone building voice-enabled products. The ASR model's ability to handle hour-long audio with speaker diarization in one shot is significant.