# FCH-TTS
**Repository Path**: axellance/FCH-TTS
## Basic Information
- **Project Name**: FCH-TTS
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: encoder
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-04-09
- **Last Updated**: 2025-04-09
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
[简体中文](./README.md) | English
# Parallel TTS
[TOC]
## What's New !
- 2021/04/09 [wavegan](https://github.com/atomicoo/ParallelTTS/tree/wavegan) branch support [PWG](https://arxiv.org/abs/1910.11480) / [MelGAN](https://arxiv.org/abs/1910.06711) / [Multi-band MelGAN](https://arxiv.org/abs/2005.05106) vocoder!
- 2021/04/05 Support [ParallelText2Mel](https://github.com/atomicoo/ParallelTTS/blob/main/models/parallel.py) + [MelGAN](https://arxiv.org/abs/1910.06711) vocoder!
- [ Key Info ] [Speed indicator](#Speed),[Samples](https://github.com/atomicoo/ParallelTTS/tree/main/samples/),[Web Demo](https://github.com/atomicoo/PTTS-WebAPP),[Communication](#Communication) ......
## Repo Structure
```
.
|--- config/ # config file
|--- default.yaml
|--- ...
|--- datasets/ # data process
|--- encoder/ # voice encoder
|--- voice_encoder.py
|--- ...
|--- helpers/ # some helpers
|--- trainer.py
|--- synthesizer.py
|--- ...
|--- logdir/ # training log directory
|--- losses/ # loss function
|--- models/ # synthesizor
|--- layers.py
|--- duration.py
|--- parallel.py
|--- pretrained/ # pretrained (LJSpeech dataset)
|--- samples/ # synthesized samples
|--- utils/ # some common utils
|--- vocoder/ # vocoder
|--- melgan.py
|--- ...
|--- wandb/ # Wandb save directory
|--- extract-duration.py
|--- extract-embedding.py
|--- LICENSE
|--- prepare-dataset.py # prepare dataset
|--- README.md
|--- requirements.txt # dependencies
|--- synthesize.py # synthesize script
|--- train-duration.py # train script
|--- train-parallel.py
```
## Samples
[Here](https://github.com/atomicoo/ParallelTTS/tree/main/samples/) are some synthesized samples.
## Pretrained
[Here](https://github.com/atomicoo/ParallelTTS/tree/main/pretrained/) are some pretrained models.
## Quick Start
**Step (1)**:clone repo
```shell
$ git clone https://github.com/atomicoo/ParallelTTS.git
```
**Step (2)**:install dependencies
```shell
$ conda create -n ParallelTTS python=3.7.9
$ conda activate ParallelTTS
$ pip install -r requirements.txt
```
**Step (3)**:synthesize audio
```shell
$ python synthesize.py
```
## Training
**Step (1)**:prepare dataset
```shell
$ python prepare-dataset.py
```
Through `--config` to set config file, default ([`default.yaml`](https://github.com/atomicoo/ParallelTTS/blob/main/config/default.yaml)) is for [LJSpeech](https://keithito.com/LJ-Speech-Dataset/) dataset.
**Step (2)**:train alignment model
```shell
$ python train-duration.py
```
**Step (3)**:extract durations
```shell
$ python extract-duration.py
```
Through `--ground_truth` to set weather generating ground-truth spectrograms or not。
**Step (4)**:train synthesize model
```shell
$ python train-parallel.py
```
Through `--ground_truth` to set weather training model by ground-truth spectrograms。
## Training Log
if use [TensorBoardX](https://github.com/lanpa/tensorboardX), run this:
```
$ tensorboard --logdir logdir/[DIR]/
```
It is highly recommended to use [Wandb](https://wandb.ai/)(Weights & Biases), just set `--enable_wandb` when training。
## Datasets
- [LJSpeech](https://keithito.com/LJ-Speech-Dataset/): English, Female, 22050 Hz, ~24 h
- [JSUT](https://sites.google.com/site/shinnosuketakamichi/publication/jsut): Japanese, Female, 48000 Hz, ~10 h
- [BiaoBei](https://www.data-baker.com/open_source.html): Mandarin, Female, 48000 Hz, ~12 h
- [KSS](https://www.kaggle.com/bryanpark/korean-single-speaker-speech-dataset): Korean, Female, 44100 Hz, ~12 h
- [RuLS](https://www.openslr.org/96/): Russian, Multi-speakers (only use audios of single speaker), 16000 Hz, ~98 h
- [TWLSpeech](#) (non-public, poor quality): Tibetan, Female (multi-speakers, sound similar), 16000 Hz,~23 h
## Quality
TODO: to be added.
## Speed
**Speed of Training**:[LJSpeech](https://keithito.com/LJ-Speech-Dataset/) dataset, batch size = 64, training on 8GB GTX 1080 GPU, elapsed ~8h (~300 epochs).
**Speed of Synthesizing**:test under CPU @ Intel Core i7-8550U / GPU @ NVIDIA GeForce MX150, 8s per synthesized audio (about 20 words)
| Batch Size | Spec
(GPU) | Audio
(GPU) | Spec
(CPU) | Audio
(CPU) |
| ---------- | ------------- | -------------- | ------------- | -------------- |
| 1 | 0.042 | 0.218 | 0.100 | 2.004 |
| 2 | 0.046 | 0.453 | 0.209 | 3.922 |
| 4 | 0.053 | 0.863 | 0.407 | 7.897 |
| 8 | 0.062 | 2.386 | 0.878 | 14.599 |
Attention, no multiple tests, for reference only.
## Few Issues
- In [wavegan](https://github.com/atomicoo/ParallelTTS/tree/wavegan) branch, code of `vocoder` is from [ParallelWaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN). Since the method of acoustic feature extraction is not compatible, it needs to be transformed. See [here](https://github.com/atomicoo/ParallelTTS/blob/4eb44679271494f1d478da281ae474a07dfe77c6/synthesize.wave.py#L79-L85).
- The input of mandarin model is pinyin. Because of the lack of punctuations in [BiaoBei](https://www.data-baker.com/open_source.html)'s raw pinyin sequence and the incomplete alignment model training, there's something wrong with the rhythm of synthesized samples.
- I haven't trained a Korean vocoder specially, and just use the vocoder of LJSpeech (22050 Hz), which might slightly affect the quality of synthesized audio.
## References
- [Kyubyong/tacotron](https://github.com/Kyubyong/tacotron)
- [r9y9/deepvoice3_pytorch](https://github.com/r9y9/deepvoice3_pytorch)
- [tugstugi/pytorch-dc-tts](https://github.com/tugstugi/pytorch-dc-tts)
- [janvainer/speedyspeech](https://github.com/janvainer/speedyspeech)
- [Po-Hsun-Su/pytorch-ssim](https://github.com/Po-Hsun-Su/pytorch-ssim)
- [Maghoumi/pytorch-softdtw-cuda](https://github.com/Maghoumi/pytorch-softdtw-cuda)
- [seungwonpark/melgan](https://github.com/seungwonpark/melgan)
- [kan-bayashi/ParallelWaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN)
## TODO
- [ ] Synthetic speech quality assessment (MOS)
- [ ] More tests in different languages
- [ ] Speech style transfer (tone)
## Communication
- VX: Joee1995
- QQ: 793071559