DSPGAN: a GAN-based universal vocoder for high-fidelity TTS by time-frequency domain supervision from DSP

Kun Song, Yongmao Zhang, Yi Lei, Jian Cong, Hanzhao Li, Lei Xie, Gang He, Jinfeng Bai Northwestern Polytechnical University, Xi'an, China TAL Education Group, Beijing, China

0. Contents

Abstract
Demos on TTS

Conversational TTS (Chinese)
Low-quality speech few-shot (Chinese)
Unseen styles (Chinese)
Emotional TTS (Chinese)
Cross-lingual TTS (English)
Unseen language (Thai)
Unseen speakers (Chinese and English)
Seen speakers (Chinese and English)
Extra audio samples that do not appear in the paper

Demos on copy synthesis

Unseen speakers
Seen speakers

1. Abstract

Recent development of neural vocoders based on the generative adversarial neural network (GAN) has shown their advantages of generating raw waveform conditioned on mel-spectrogram with fast inference speed and lightweight networks. Whereas, it is still challenging to train a universal neural vocoder that can synthesize high-fidelity speech from various scenarios with unseen speakers, languages, and speaking styles. In this paper, we propose DSPGAN, a GAN-based universal vocoder for high-fidelity speech synthesis by applying the time-frequency domain supervision from digital signal processing (DSP). To eliminate the mismatch problem caused by the ground-truth spectrograms in training phase and the predicted spectrograms in inference phase, we leverage the mel-spectrogram extracted from the waveform generated by a DSP module, rather than the predicted mel-spectrogram from the Text-to-Speech (TTS) acoustic model, as the time-frequency domain supervision to the GAN-based vocoder. We also utilize sine excitation as the time-domain supervision to improve the harmonic modeling and eliminate various artifacts of the GAN-based vocoder. Experimental results show that DSPGAN significantly outperforms the compared approaches and can generate high-fidelity speech based on diverse data in TTS.

2. Demos on TTS

We train a multi-speaker multi-language vocoder as a universal vocoder (not fine-tune) using 295 hours of audio in 308 speakers containing Chinese and English. Here's demos that uses this universal vocoder to work with various TTS tasks.

2.1 Conversational TTS

The expressive conversational speech synthesis task is evaluated by a conversation dataset, 10 hours in total, containing 1 male speaker and 1 female speaker.

Speakers:

male	female

Demos:

speaker	multi-band MelGAN	DSPGAN-mm	HiFi-GAN	DSPGAN-hf	NHV
male
male
male
male
male
female
female
female
female
female

2.2 Low-quality speech few-shot

Speakers:

spk1

Demos:

speaker	multi-band MelGAN	DSPGAN-mm	HiFi-GAN	DSPGAN-hf	NHV
spk1
spk1

2.3 Unseen styles

The stylistic speech synthesis task is evaluated by a single-speaker multi-style dataset, 30 minutes in total, with 5 styles, i.e., poetry, fairy tale, joke, story, and thriller.

Styles:

poetry	fairy tale	joke	story	thriller

Demos:

style	multi-band MelGAN	DSPGAN-mm	HiFi-GAN	DSPGAN-hf	NHV
sty1
sty1
sty2
sty2
sty3
sty3
sty4
sty4
sty5
sty5

2.4 Emotional TTS

The emotional speech synthesis task is evaluated by a single-speaker multi-emotion dataset, 12 hours in total, with 6 emotions, i.e., sad, angry, happy, disgusted, fearful, and surprise. We also apply controllable emotional intensity in the acoustic model.

Emotion:

happy	angry	sad	surprise	fearful	disgusted

Demos:

emotion	multi-band MelGAN	DSPGAN-mm	HiFi-GAN	DSPGAN-hf	NHV
emo1
emo2
emo3
emo3
emo4
emo4
emo5
emo5
emo6
emo6

2.5 Cross-lingual TTS

The Cross-lingual TTS task means using foreign text to synthesize language foreign to the speaker, we use "Chinese/English" text with "Chinese/English" speaker ID and "English/Chinese" language to implement cross-lingual TTS. We show the "Chinese speaker say English" demo here.

Chinese speakers' Chinese audio:

spk1	spk2

Demos:

speaker	multi-band MelGAN	DSPGAN-mm	HiFi-GAN	DSPGAN-hf	NHV
spk1
spk1
spk1
spk1
spk1
spk2
spk2
spk2
spk2
spk2

2.6 Unseen language

Language unseen to the vocoder training, we use Thai speakers for testing. Furthermore, for more challenges, we show demos on "Chinese speaker say Thai with emotion" by cross-lingual TTS and emotion transfer.

Speakers(spk1 is a Thai speaker and spk2 is a Chinese speaker):

spk1	spk2

Demos on "Thai speaker say Thai":

speaker	multi-band MelGAN	DSPGAN-mm	HiFi-GAN	DSPGAN-hf	NHV
spk1
spk1
spk1
spk1
spk1

Demos on "Chinese speaker say Thai with emotion by Cross-lingual TTS and emotion transfer":

speaker	multi-band MelGAN	DSPGAN-mm	HiFi-GAN	DSPGAN-hf	NHV
"angry" spk2
"angry" spk2
"fearful" spk2
"fearful" spk2
"sad" spk2

2.7 Demos on Unseen speakers

We use the unseen speakers from AISHELL-3 (Chinese) and LibriTTS (English), which are the same as the testing set in the copy synthesis to test "unseen speaker with enough data in the acoustic model's training".

Speakers:

(Chinese) spk1	(Chinese) spk2	(English) spk3	(English) spk4

Demos:

speaker	multi-band MelGAN	DSPGAN-mm	HiFi-GAN	DSPGAN-hf	NHV
spk1
spk1
spk1
spk2
spk2
spk3
spk3
spk3
spk4
spk4

2.8 Demos on Seen speakers

Speakers which are seen to the vocoder training.

Speakers:

(Chinese) spk1	(Chinese) spk2	(English) spk3	(English) spk4

Demos:

speaker	multi-band MelGAN	DSPGAN-mm	HiFi-GAN	DSPGAN-hf	NHV
spk1
spk1
spk1
spk2
spk2
spk3
spk3
spk3
spk4
spk4

2.9 Extra audio samples that do not appear in the paper

2.9.1 Unseen language -- Mongolian

Mongolian speaker whose language unseen to the vocoder training:

Demos on a Tacotron 2 acoustic model:

HiFi-GAN	DSPGAN-mm

2.9.2 Audio super-resolution (16kHz to 48kHz)

We implement a 48kHz DSPGAN-mm by upsampling the 16kHz mel-spectrogram generated by the acoustic model (Fastspeech 2) to 48kHz audio. Here we show the 48kHz audio demos on challenging TTS tasks like unseen language and style transfer.

Mongolian speaker whose language unseen to the vocoder training: