DSPGAN: a GAN-based universal vocoder for high-fidelity TTS by time-frequency domain supervision from DSP

Kun Song, Yongmao Zhang, Yi Lei, Jian Cong, Hanzhao Li, Lei Xie, Gang He, Jinfeng Bai
Northwestern Polytechnical University, Xi'an, China
TAL Education Group, Beijing, China

0. Contents

  1. Abstract

  2. Demos on TTS

    1. Conversational TTS (Chinese)
    2. Low-quality speech few-shot (Chinese)
    3. Unseen styles (Chinese)
    4. Emotional TTS (Chinese)
    5. Cross-lingual TTS (English)
    6. Unseen language (Thai)
    7. Unseen speakers (Chinese and English)
    8. Seen speakers (Chinese and English)
    9. Extra audio samples that do not appear in the paper
  3. Demos on copy synthesis

    1. Unseen speakers
    2. Seen speakers


1. Abstract

Recent development of neural vocoders based on the generative adversarial neural network (GAN) has shown their advantages of generating raw waveform conditioned on mel-spectrogram with fast inference speed and lightweight networks. Whereas, it is still challenging to train a universal neural vocoder that can synthesize high-fidelity speech from various scenarios with unseen speakers, languages, and speaking styles. In this paper, we propose DSPGAN, a GAN-based universal vocoder for high-fidelity speech synthesis by applying the time-frequency domain supervision from digital signal processing (DSP). To eliminate the mismatch problem caused by the ground-truth spectrograms in training phase and the predicted spectrograms in inference phase, we leverage the mel-spectrogram extracted from the waveform generated by a DSP module, rather than the predicted mel-spectrogram from the Text-to-Speech (TTS) acoustic model, as the time-frequency domain supervision to the GAN-based vocoder. We also utilize sine excitation as the time-domain supervision to improve the harmonic modeling and eliminate various artifacts of the GAN-based vocoder. Experimental results show that DSPGAN significantly outperforms the compared approaches and can generate high-fidelity speech based on diverse data in TTS.




2. Demos on TTS

We train a multi-speaker multi-language vocoder as a universal vocoder (not fine-tune) using 295 hours of audio in 308 speakers containing Chinese and English. Here's demos that uses this universal vocoder to work with various TTS tasks.

2.1 Conversational TTS

The expressive conversational speech synthesis task is evaluated by a conversation dataset, 10 hours in total, containing 1 male speaker and 1 female speaker.

Speakers:

male female

Demos:

speaker multi-band MelGAN DSPGAN-mm HiFi-GAN DSPGAN-hf NHV

male

male

male

male

male

female

female

female

female

female

2.2 Low-quality speech few-shot

Speakers:

spk1

Demos:

speaker multi-band MelGAN DSPGAN-mm HiFi-GAN DSPGAN-hf NHV

spk1

spk1

2.3 Unseen styles

The stylistic speech synthesis task is evaluated by a single-speaker multi-style dataset, 30 minutes in total, with 5 styles, i.e., poetry, fairy tale, joke, story, and thriller.

Styles:

poetry fairy tale joke story thriller

Demos:

style multi-band MelGAN DSPGAN-mm HiFi-GAN DSPGAN-hf NHV

sty1

sty1

sty2

sty2

sty3

sty3

sty4

sty4

sty5

sty5

2.4 Emotional TTS

The emotional speech synthesis task is evaluated by a single-speaker multi-emotion dataset, 12 hours in total, with 6 emotions, i.e., sad, angry, happy, disgusted, fearful, and surprise. We also apply controllable emotional intensity in the acoustic model.

Emotion:

happy angry sad surprise fearful disgusted

Demos:

emotion multi-band MelGAN DSPGAN-mm HiFi-GAN DSPGAN-hf NHV

emo1

emo2

emo3

emo3

emo4

emo4

emo5

emo5

emo6

emo6

2.5 Cross-lingual TTS

The Cross-lingual TTS task means using foreign text to synthesize language foreign to the speaker, we use "Chinese/English" text with "Chinese/English" speaker ID and "English/Chinese" language to implement cross-lingual TTS. We show the "Chinese speaker say English" demo here.

Chinese speakers' Chinese audio:

spk1 spk2

Demos:

speaker multi-band MelGAN DSPGAN-mm HiFi-GAN DSPGAN-hf NHV

spk1

spk1

spk1

spk1

spk1

spk2

spk2

spk2

spk2

spk2

2.6 Unseen language

Language unseen to the vocoder training, we use Thai speakers for testing. Furthermore, for more challenges, we show demos on "Chinese speaker say Thai with emotion" by cross-lingual TTS and emotion transfer.

Speakers(spk1 is a Thai speaker and spk2 is a Chinese speaker):

spk1 spk2

Demos on "Thai speaker say Thai":

speaker multi-band MelGAN DSPGAN-mm HiFi-GAN DSPGAN-hf NHV

spk1

spk1

spk1

spk1

spk1

Demos on "Chinese speaker say Thai with emotion by Cross-lingual TTS and emotion transfer":

speaker multi-band MelGAN DSPGAN-mm HiFi-GAN DSPGAN-hf NHV

"angry" spk2

"angry" spk2

"fearful" spk2

"fearful" spk2

"sad" spk2

2.7 Demos on Unseen speakers

We use the unseen speakers from AISHELL-3 (Chinese) and LibriTTS (English), which are the same as the testing set in the copy synthesis to test "unseen speaker with enough data in the acoustic model's training".

Speakers:

(Chinese) spk1 (Chinese) spk2 (English) spk3 (English) spk4

Demos:

speaker multi-band MelGAN DSPGAN-mm HiFi-GAN DSPGAN-hf NHV

spk1

spk1

spk1

spk2

spk2

spk3

spk3

spk3

spk4

spk4

2.8 Demos on Seen speakers

Speakers which are seen to the vocoder training.

Speakers:

(Chinese) spk1 (Chinese) spk2 (English) spk3 (English) spk4

Demos:

speaker multi-band MelGAN DSPGAN-mm HiFi-GAN DSPGAN-hf NHV

spk1

spk1

spk1

spk2

spk2

spk3

spk3

spk3

spk4

spk4

2.9 Extra audio samples that do not appear in the paper

2.9.1 Unseen language -- Mongolian

Mongolian speaker whose language unseen to the vocoder training:

Demos on a Tacotron 2 acoustic model:

HiFi-GAN DSPGAN-mm

2.9.2 Audio super-resolution (16kHz to 48kHz)

We implement a 48kHz DSPGAN-mm by upsampling the 16kHz mel-spectrogram generated by the acoustic model (Fastspeech 2) to 48kHz audio. Here we show the 48kHz audio demos on challenging TTS tasks like unseen language and style transfer.

Mongolian speaker whose language unseen to the vocoder training:

speaker (16kHz) DSPGAN-mm (48kHz)

style transfer to the speaker with limited noisy data (10 utterances):

speaker (16kHz) style (16kHz) DSPGAN-mm (48kHz)

3. Demos on copy synthesis

Copy synthesis means using raw mel-spectrogram but not generated by the acoustic model.

3.1 Unseen speakers

Unseen speakers are randomly selected from AISHELL-3 and LibriTTS.

Recording multi-band MelGAN DSPGAN-mm HiFi-GAN DSPGAN-hf NHV

3.2 Seen speakers

We reserve 50 utterances of 10 speakers randomly selected from the training data as the seen speakers.

Recording multi-band MelGAN DSPGAN-mm HiFi-GAN DSPGAN-hf NHV