Skip to main content

fork from https://github.com/2noise/ChatTTS to PYPI

Project description

ChatTTS

English | 中文简体

For this fork

  • pip3 install chattts-fork
  • chattts "哈哈" -o test.wav
  • 支持了 seed 固定音色 chattts "哈哈" -o test.wav --seed 222

ChatTTS is a text-to-speech model designed specifically for dialogue scenario such as LLM assistant. It supports both English and Chinese languages. Our model is trained with 100,000+ hours composed of chinese and english. The open-source version on HuggingFace is a 40,000 hours pre trained model without SFT.

For formal inquiries about model and roadmap, please contact us at open-source@2noise.com. You could join our QQ group: 808364215 (Full) 230696694 (Group 2) for discussion. Adding github issues is always welcomed.


Highlights

  1. Conversational TTS: ChatTTS is optimized for dialogue-based tasks, enabling natural and expressive speech synthesis. It supports multiple speakers, facilitating interactive conversations.
  2. Fine-grained Control: The model could predict and control fine-grained prosodic features, including laughter, pauses, and interjections.
  3. Better Prosody: ChatTTS surpasses most of open-source TTS models in terms of prosody. We provide pretrained models to support further research and development.

For the detailed description of the model, you can refer to video on Bilibili


Disclaimer

This repo is for academic purposes only. It is intended for educational and research use, and should not be used for any commercial or legal purposes. The authors do not guarantee the accuracy, completeness, or reliability of the information. The information and data used in this repo, are for academic and research purposes only. The data obtained from publicly available sources, and the authors do not claim any ownership or copyright over the data.

ChatTTS is a powerful text-to-speech system. However, it is very important to utilize this technology responsibly and ethically. To limit the use of ChatTTS, we added a small amount of high-frequency noise during the training of the 40,000-hour model, and compressed the audio quality as much as possible using MP3 format, to prevent malicious actors from potentially using it for criminal purposes. At the same time, we have internally trained a detection model and plan to open-source it in the future.


Usage

Basic usage

import ChatTTS
from IPython.display import Audio

chat = ChatTTS.Chat()
chat.load_models(compile=False) # Set to True for better performance

texts = ["PUT YOUR TEXT HERE",]

wavs = chat.infer(texts, )

torchaudio.save("output1.wav", torch.from_numpy(wavs[0]), 24000)

Advanced usage

###################################
# Sample a speaker from Gaussian.

rand_spk = chat.sample_random_speaker()

params_infer_code = {
  'spk_emb': rand_spk, # add sampled speaker 
  'temperature': .3, # using custom temperature
  'top_P': 0.7, # top P decode
  'top_K': 20, # top K decode
}

###################################
# For sentence level manual control.

# use oral_(0-9), laugh_(0-2), break_(0-7) 
# to generate special token in text to synthesize.
params_refine_text = {
  'prompt': '[oral_2][laugh_0][break_6]'
} 

wav = chat.infer(texts, params_refine_text=params_refine_text, params_infer_code=params_infer_code)

###################################
# For word level manual control.
text = 'What is [uv_break]your favorite english food?[laugh][lbreak]'
wav = chat.infer(text, skip_refine_text=True, params_refine_text=params_refine_text,  params_infer_code=params_infer_code)
torchaudio.save("output2.wav", torch.from_numpy(wavs[0]), 24000)

Example: self introduction

inputs_en = """
chat T T S is a text to speech model designed for dialogue applications. 
[uv_break]it supports mixed language input [uv_break]and offers multi speaker 
capabilities with precise control over prosodic elements [laugh]like like 
[uv_break]laughter[laugh], [uv_break]pauses, [uv_break]and intonation. 
[uv_break]it delivers natural and expressive speech,[uv_break]so please
[uv_break] use the project responsibly at your own risk.[uv_break]
""".replace('\n', '') # English is still experimental.

params_refine_text = {
  'prompt': '[oral_2][laugh_0][break_4]'
} 
# audio_array_cn = chat.infer(inputs_cn, params_refine_text=params_refine_text)
audio_array_en = chat.infer(inputs_en, params_refine_text=params_refine_text)
torchaudio.save("output3.wav", torch.from_numpy(audio_array_en[0]), 24000)

male speaker

female speaker


Roadmap

  • Open-source the 40k hour base model and spk_stats file
  • Open-source VQ encoder and Lora training code
  • Streaming audio generation without refining the text*
  • Open-source the 40k hour version with multi-emotion control
  • ChatTTS.cpp maybe? (PR or new repo are welcomed.)

FAQ

How much VRAM do I need? How about infer speed?

For a 30-second audio clip, at least 4GB of GPU memory is required. For the 4090 GPU, it can generate audio corresponding to approximately 7 semantic tokens per second. The Real-Time Factor (RTF) is around 0.3.

model stability is not good enough, with issues such as multi speakers or poor audio quality.

This is a problem that typically occurs with autoregressive models(for bark and valle). It's generally difficult to avoid. One can try multiple samples to find a suitable result.

Besides laughter, can we control anything else? Can we control other emotions?

In the current released model, the only token-level control units are [laugh], [uv_break], and [lbreak]. In future versions, we may open-source models with additional emotional control capabilities.


Acknowledgements

  • bark, XTTSv2 and valle demostrate a remarkable TTS result by a autoregressive-style system.
  • fish-speech reveals capability of GVQ as audio tokenizer for LLM modeling.
  • vocos which is used as a pretrained vocoder.

Special Appreciation

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chattts_fork-0.0.8.tar.gz (23.1 kB view details)

Uploaded Source

Built Distribution

chattts_fork-0.0.8-py3-none-any.whl (25.8 kB view details)

Uploaded Python 3

File details

Details for the file chattts_fork-0.0.8.tar.gz.

File metadata

  • Download URL: chattts_fork-0.0.8.tar.gz
  • Upload date:
  • Size: 23.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.17

File hashes

Hashes for chattts_fork-0.0.8.tar.gz
Algorithm Hash digest
SHA256 a818be695b6b6691487113e78104d7c3550202c9dfcf6e51fd7eebbe36c0c180
MD5 4c93e1b67a1e59ee930c0c28ecb62b13
BLAKE2b-256 11f47d332ed10875baf48fdd1c85bc44b7d9a02eb94c7c818164fb8b0f4ee8b3

See more details on using hashes here.

File details

Details for the file chattts_fork-0.0.8-py3-none-any.whl.

File metadata

File hashes

Hashes for chattts_fork-0.0.8-py3-none-any.whl
Algorithm Hash digest
SHA256 569084d9f451fbfa6c28f1c15c893a594279a9032d1e720571dab30f30c680be
MD5 4f2e2f7852bfc03454a9e64c06f4cbb7
BLAKE2b-256 eab130518814c5c46d0300f10cdd07bbc2ecd4449c916debcdf0841adb6fd26b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page