Speech Synthesis Library
7.9k 2026-04-18

jaywalnut310/vits

VITS is an end-to-end text-to-speech model that generates highly natural-sounding audio with diverse rhythms, outperforming traditional two-stage TTS systems.

Core Features

End-to-end single-stage training and parallel sampling.
Achieves natural-sounding audio quality comparable to ground truth.
Utilizes variational inference, normalizing flows, and adversarial training.
Incorporates a stochastic duration predictor for diverse speech rhythms.
Models the one-to-many relationship between text and speech.

Quick Start

pip install -r requirements.txt

Detailed Introduction

VITS addresses the challenge of achieving high-quality, natural-sounding speech from end-to-end text-to-speech (TTS) models, which often lag behind two-stage systems. By integrating variational inference augmented with normalizing flows and an adversarial training process, VITS significantly enhances generative modeling. Its innovative stochastic duration predictor allows for synthesizing speech with diverse rhythms and pitches, capturing the natural variability of human speech. This approach enables VITS to produce audio quality superior to existing TTS systems, making it a leading solution for advanced speech synthesis.

OSS Alternative

Explore the best open source alternatives to commercial software.

© 2026 OSS Alternative. hotgithub.com - All rights reserved.