DSPGAN: a GAN-based universal vocoder for high-fidelity TTS by time-frequency domain supervision from DSP
0. Contents
Abstract
Demos on TTS
- Conversational TTS (Chinese)
- Low-quality speech few-shot (Chinese)
- Unseen styles (Chinese)
- Emotional TTS (Chinese)
- Cross-lingual TTS (English)
- Unseen language (Thai)
- Unseen speakers (Chinese and English)
- Seen speakers (Chinese and English)
- Extra audio samples that do not appear in the paper
Demos on copy synthesis
1. Abstract
Recent development of neural vocoders based on the generative adversarial neural network (GAN) has shown their advantages of generating raw waveform conditioned on mel-spectrogram with fast inference speed and lightweight networks. Whereas, it is still challenging to train a universal neural vocoder that can synthesize high-fidelity speech from various scenarios with unseen speakers, languages, and speaking styles. In this paper, we propose DSPGAN, a GAN-based universal vocoder for high-fidelity speech synthesis by applying the time-frequency domain supervision from digital signal processing (DSP). To eliminate the mismatch problem caused by the ground-truth spectrograms in training phase and the predicted spectrograms in inference phase, we leverage the mel-spectrogram extracted from the waveform generated by a DSP module, rather than the predicted mel-spectrogram from the Text-to-Speech (TTS) acoustic model, as the time-frequency domain supervision to the GAN-based vocoder. We also utilize sine excitation as the time-domain supervision to improve the harmonic modeling and eliminate various artifacts of the GAN-based vocoder. Experimental results show that DSPGAN significantly outperforms the compared approaches and can generate high-fidelity speech based on diverse data in TTS.

2. Demos on TTS
We train a multi-speaker multi-language vocoder as a universal vocoder (not fine-tune) using 295 hours of audio in 308 speakers containing Chinese and English. Here's demos that uses this universal vocoder to work with various TTS tasks.
2.1 Conversational TTS
The expressive conversational speech synthesis task is evaluated by a conversation dataset, 10 hours in total, containing 1 male speaker and 1 female speaker.
Speakers:
male | female |
---|---|
Demos:
speaker | multi-band MelGAN | DSPGAN-mm | HiFi-GAN | DSPGAN-hf | NHV |
---|---|---|---|---|---|
male |
|||||
male |
|||||
male |
|||||
male |
|||||
male |
|||||
female |
|||||
female |
|||||
female |
|||||
female |
|||||
female |
2.2 Low-quality speech few-shot
Speakers:
spk1 |
---|
Demos:
speaker | multi-band MelGAN | DSPGAN-mm | HiFi-GAN | DSPGAN-hf | NHV |
---|---|---|---|---|---|
spk1 |
|||||
spk1 |
2.3 Unseen styles
The stylistic speech synthesis task is evaluated by a single-speaker multi-style dataset, 30 minutes in total, with 5 styles, i.e., poetry, fairy tale, joke, story, and thriller.
Styles:
poetry | fairy tale | joke | story | thriller |
---|---|---|---|---|
Demos:
style | multi-band MelGAN | DSPGAN-mm | HiFi-GAN | DSPGAN-hf | NHV |
---|---|---|---|---|---|
sty1 |
|||||
sty1 |
|||||
sty2 |
|||||
sty2 |
|||||
sty3 |
|||||
sty3 |
|||||
sty4 |
|||||
sty4 |
|||||
sty5 |
|||||
sty5 |
2.4 Emotional TTS
The emotional speech synthesis task is evaluated by a single-speaker multi-emotion dataset, 12 hours in total, with 6 emotions, i.e., sad, angry, happy, disgusted, fearful, and surprise. We also apply controllable emotional intensity in the acoustic model.
Emotion:
happy | angry | sad | surprise | fearful | disgusted |
---|---|---|---|---|---|
Demos:
emotion | multi-band MelGAN | DSPGAN-mm | HiFi-GAN | DSPGAN-hf | NHV |
---|---|---|---|---|---|
emo1 |
|||||
emo2 |
|||||
emo3 |
|||||
emo3 |
|||||
emo4 |
|||||
emo4 |
|||||
emo5 |
|||||
emo5 |
|||||
emo6 |
|||||
emo6 |
2.5 Cross-lingual TTS
The Cross-lingual TTS task means using foreign text to synthesize language foreign to the speaker, we use "Chinese/English" text with "Chinese/English" speaker ID and "English/Chinese" language to implement cross-lingual TTS. We show the "Chinese speaker say English" demo here.
Chinese speakers' Chinese audio:
spk1 | spk2 |
---|---|
Demos:
speaker | multi-band MelGAN | DSPGAN-mm | HiFi-GAN | DSPGAN-hf | NHV |
---|---|---|---|---|---|
spk1 |
|||||
spk1 |
|||||
spk1 |
|||||
spk1 |
|||||
spk1 |
|||||
spk2 |
|||||
spk2 |
|||||
spk2 |
|||||
spk2 |
|||||
spk2 |
2.6 Unseen language
Language unseen to the vocoder training, we use Thai speakers for testing. Furthermore, for more challenges, we show demos on "Chinese speaker say Thai with emotion" by cross-lingual TTS and emotion transfer.
Speakers(spk1 is a Thai speaker and spk2 is a Chinese speaker):
spk1 | spk2 |
---|---|
Demos on "Thai speaker say Thai":
speaker | multi-band MelGAN | DSPGAN-mm | HiFi-GAN | DSPGAN-hf | NHV |
---|---|---|---|---|---|
spk1 |
|||||
spk1 |
|||||
spk1 |
|||||
spk1 |
|||||
spk1 |
Demos on "Chinese speaker say Thai with emotion by Cross-lingual TTS and emotion transfer":
speaker | multi-band MelGAN | DSPGAN-mm | HiFi-GAN | DSPGAN-hf | NHV |
---|---|---|---|---|---|
"angry" spk2 |
|||||
"angry" spk2 |
|||||
"fearful" spk2 |
|||||
"fearful" spk2 |
|||||
"sad" spk2 |
2.7 Demos on Unseen speakers
We use the unseen speakers from AISHELL-3 (Chinese) and LibriTTS (English), which are the same as the testing set in the copy synthesis to test "unseen speaker with enough data in the acoustic model's training".
Speakers:
(Chinese) spk1 | (Chinese) spk2 | (English) spk3 | (English) spk4 |
---|---|---|---|
Demos:
speaker | multi-band MelGAN | DSPGAN-mm | HiFi-GAN | DSPGAN-hf | NHV |
---|---|---|---|---|---|
spk1 |
|||||
spk1 |
|||||
spk1 |
|||||
spk2 |
|||||
spk2 |
|||||
spk3 |
|||||
spk3 |
|||||
spk3 |
|||||
spk4 |
|||||
spk4 |
2.8 Demos on Seen speakers
Speakers which are seen to the vocoder training.
Speakers:
(Chinese) spk1 | (Chinese) spk2 | (English) spk3 | (English) spk4 |
---|---|---|---|
Demos:
speaker | multi-band MelGAN | DSPGAN-mm | HiFi-GAN | DSPGAN-hf | NHV |
---|---|---|---|---|---|
spk1 |
|||||
spk1 |
|||||
spk1 |
|||||
spk2 |
|||||
spk2 |
|||||
spk3 |
|||||
spk3 |
|||||
spk3 |
|||||
spk4 |
|||||
spk4 |
2.9 Extra audio samples that do not appear in the paper
2.9.1 Unseen language -- Mongolian
Mongolian speaker whose language unseen to the vocoder training:
Demos on a Tacotron 2 acoustic model:
HiFi-GAN | DSPGAN-mm |
---|---|
2.9.2 Audio super-resolution (16kHz to 48kHz)
We implement a 48kHz DSPGAN-mm by upsampling the 16kHz mel-spectrogram generated by the acoustic model (Fastspeech 2) to 48kHz audio. Here we show the 48kHz audio demos on challenging TTS tasks like unseen language and style transfer.
Mongolian speaker whose language unseen to the vocoder training:
speaker (16kHz) | DSPGAN-mm (48kHz) |
---|---|
style transfer to the speaker with limited noisy data (10 utterances):
speaker (16kHz) | style (16kHz) | DSPGAN-mm (48kHz) |
---|---|---|
3. Demos on copy synthesis
Copy synthesis means using raw mel-spectrogram but not generated by the acoustic model.3.1 Unseen speakers
Unseen speakers are randomly selected from AISHELL-3 and LibriTTS.
Recording | multi-band MelGAN | DSPGAN-mm | HiFi-GAN | DSPGAN-hf | NHV |
---|---|---|---|---|---|
3.2 Seen speakers
We reserve 50 utterances of 10 speakers randomly selected from the training data as the seen speakers.
Recording | multi-band MelGAN | DSPGAN-mm | HiFi-GAN | DSPGAN-hf | NHV |
---|---|---|---|---|---|