AI-powered Text-to-Speech System
20.1k 2026-04-18

index-tts/index-tts

IndexTTS2 is an industrial-level, zero-shot text-to-speech system offering precise duration control and disentangled emotional expression for highly natural and controllable speech synthesis.

Core Features

Precise speech duration control with two generation modes.
Disentangled control over emotional expression and speaker identity.
Zero-shot timbre reconstruction and emotional tone reproduction.
Enhanced speech clarity through GPT latent representations and a three-stage training paradigm.
Soft instruction mechanism for emotional control via text descriptions.

Detailed Introduction

IndexTTS2 addresses key limitations of existing autoregressive large-scale text-to-speech models, particularly the difficulty in precisely controlling speech duration, which is crucial for applications like video dubbing. It introduces a novel method for duration control, supporting both explicit token-count specification and free autoregressive generation while preserving prosodic features. Furthermore, IndexTTS2 achieves disentanglement of emotion and speaker identity, allowing independent control. It enhances clarity in emotional speech using GPT latent representations and a three-stage training, and simplifies emotional control via text-based soft instructions.

OSS Alternative

Explore the best open source alternatives to commercial software.

© 2026 OSS Alternative. hotgithub.com - All rights reserved.