index-tts/index-tts
An industrial-level, zero-shot text-to-speech system offering precise duration control and disentangled emotional expression for highly natural and controllable speech synthesis.
Core Features
Detailed Introduction
IndexTTS2 is a groundbreaking autoregressive zero-shot text-to-speech system designed to overcome the limitations of existing models, particularly in duration control for applications like video dubbing. It introduces a novel method for precise speech duration management, alongside the ability to independently control emotional expression and speaker identity. The system excels in zero-shot settings, accurately reconstructing target timbres and reproducing specified emotional tones. By integrating GPT latent representations and a unique three-stage training, IndexTTS2 ensures enhanced speech clarity and stability, even with highly emotional expressions. A soft instruction mechanism further simplifies emotional guidance, making it a versatile solution for advanced speech synthesis.