index-tts/index-tts
IndexTTS2 is an industrial-level, zero-shot text-to-speech system offering precise duration control and disentangled emotional expression for highly natural and controllable speech synthesis.
Core Features
Detailed Introduction
IndexTTS2 addresses key limitations of existing autoregressive large-scale text-to-speech models, particularly the difficulty in precisely controlling speech duration, which is crucial for applications like video dubbing. It introduces a novel method for duration control, supporting both explicit token-count specification and free autoregressive generation while preserving prosodic features. Furthermore, IndexTTS2 achieves disentanglement of emotion and speaker identity, allowing independent control. It enhances clarity in emotional speech using GPT latent representations and a three-stage training, and simplifies emotional control via text-based soft instructions.