OSS Alternative - Discover Top Open Source Alternatives to Popular Software

FunAudioLLM/CosyVoice

CosyVoice is an advanced multi-lingual large language model-based text-to-speech system offering state-of-the-art voice generation, cloning, and full-stack deployment capabilities.

Core Features

Zero-shot multilingual and cross-lingual voice cloning across 9 languages and 18+ Chinese dialects.

Achieves state-of-the-art performance in content consistency, speaker similarity, and prosody naturalness.

Supports pronunciation inpainting for fine-grained control and text normalization without a traditional frontend.

Features bi-streaming for low-latency (150ms) real-time audio output and instruct support for various voice parameters.

Quick Start

git clone --recursive https://github.com/FunAudioLLM/CosyVoice.git

Detailed Introduction

CosyVoice is a cutting-edge AI text-to-speech (TTS) system built upon large language models, designed for high-quality, zero-shot multilingual speech synthesis. It excels in generating natural-sounding voices with remarkable content consistency and speaker similarity across 9 major languages and numerous Chinese dialects. The project provides comprehensive inference, training, and deployment tools, making it suitable for production environments. Its advanced features like pronunciation inpainting, text normalization, and bi-streaming for low-latency output position CosyVoice as a robust solution for diverse voice generation needs, potentially replacing traditional commercial TTS services.