VibeVoice | Nic's notes

Table of Contents

Models
ASR highlights
TTS highlights
Realtime TTS
Install
Value

microsoft/VibeVoice

Open-Source Frontier Voice AI

VibeVoice is a family of open-source voice AI models from Microsoft covering both text-to-speech (TTS) and automatic speech recognition (ASR). 42.7k stars, MIT licensed, Python.

The core innovation is continuous speech tokenizers running at an ultra-low frame rate of 7.5 Hz, preserving audio fidelity while keeping long-sequence processing efficient. Uses a next-token diffusion framework with an LLM for textual context and a diffusion head for acoustic generation.

28 Apr 2026 implemented in Transcribee to test. Transcribes at 1x (60mn = 60mn) so not ideal.

Models

Model	Size	What it does	Weights
VibeVoice-ASR	7B	60-min long-form speech-to-text in a single pass	HuggingFace
VibeVoice-TTS	1.5B	Up to 90-min multi-speaker TTS (up to 4 speakers)	HuggingFace
VibeVoice-Realtime	0.5B	Streaming real-time TTS, ~300ms first-audio latency	HuggingFace

ASR highlights

Processes up to 60 minutes of continuous audio in one pass (64K token context).
Outputs structured transcription: who (speaker diarization), when (timestamps), what (content).
Supports customized hotwords for domain-specific accuracy.
Natively multilingual, 50+ languages.
vLLM inference supported for faster throughput.
Finetuning code available.
Now integrated into HuggingFace Transformers.

TTS highlights

Up to 90 minutes of speech in a single pass.
Multi-speaker (4 distinct voices) with natural turn-taking.
Cross-lingual support (EN, ZH, and others).
Can even do spontaneous singing.

Realtime TTS

Lightweight 0.5B model for deployment. Streaming text input, ~300ms latency. Supports experimental multilingual voices in 9 languages (DE, FR, IT, JP, KR, NL, PL, PT, ES) plus 11 English style voices.

Install

pip install vibevoice

Requires Python >= 3.10. CUDA recommended, MPS/Apple Silicon support added for the ASR Gradio demo.

Value

Relevant for anyone building voice-enabled products. The ASR model's ability to handle hour-long audio with speaker diarization in one shot is significant.

VibeVoice Project Page

Demos, examples, and documentation

https://microsoft.github.io/VibeVoice

VibeVoice Collection on HuggingFace

All model weights and demos

https://huggingface.co/collections/microsoft/vibevoice-68a2ef24a875c44be47b034f