yl4579/StyleTTS2
StyleTTS 2 is a text-to-speech model that achieves human-level speech synthesis by leveraging style diffusion and adversarial training with large speech language models.
Core Features
Quick Start
pip install -r requirements.txtDetailed Introduction
StyleTTS 2 is an advanced text-to-speech (TTS) model pushing the boundaries of synthetic speech naturalness. It innovatively combines style diffusion models to generate appropriate speech styles from text alone, eliminating the need for reference audio. Furthermore, it integrates large pre-trained Speech Language Models (SLMs) like WavLM into an adversarial training framework, significantly enhancing the naturalness of the synthesized speech. This approach has enabled StyleTTS 2 to achieve human-level TTS quality on both single and multi-speaker datasets, and demonstrates superior performance in zero-shot speaker adaptation, marking a significant leap in TTS technology.