MultiSpeech: Multi-Speaker Text to Speech with Transformer

06/08/2020
by   Mingjian Chen, et al.
0

Transformer-based text to speech (TTS) model (e.g., Transformer TTS <cit.>, FastSpeech <cit.>) has shown the advantages of training and inference efficiency over RNN-based model (e.g., Tacotron <cit.>) due to its parallel computation in training and/or inference. However, the parallel computation increases the difficulty while learning the alignment between text and speech in Transformer, which is further magnified in the multi-speaker scenario with noisy data and diverse speakers, and hinders the applicability of Transformer for multi-speaker TTS. In this paper, we develop a robust and high-quality multi-speaker Transformer TTS system called MultiSpeech, with several specially designed components/techniques to improve text-to-speech alignment: 1) a diagonal constraint on the weight matrix of encoder-decoder attention in both training and inference; 2) layer normalization on phoneme embedding in encoder to better preserve position information; 3) a bottleneck in decoder pre-net to prevent copy between consecutive speech frames. Experiments on VCTK and LibriTTS multi-speaker datasets demonstrate the effectiveness of MultiSpeech: 1) it synthesizes more robust and better quality multi-speaker voice than naive Transformer based TTS; 2) with a MutiSpeech model as the teacher, we obtain a strong multi-speaker FastSpeech model with almost zero quality degradation while enjoying extremely fast inference speed.

READ FULL TEXT
research
04/02/2021

SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model

In this paper, we propose SC-GlowTTS: an efficient zero-shot multi-speak...
research
07/07/2021

Effective and Differentiated Use of Control Information for Multi-speaker Speech Synthesis

In multi-speaker speech synthesis, data from a number of speakers usuall...
research
11/18/2021

Transformer-S2A: Robust and Efficient Speech-to-Animation

We propose a novel robust and efficient Speech-to-Animation (S2A) approa...
research
12/14/2021

End-to-end speaker diarization with transformer

Speaker diarization is connected to semantic segmentation in computer vi...
research
01/19/2022

MHTTS: Fast multi-head text-to-speech for spontaneous speech with imperfect transcription

Neural network based end-to-end Text-to-Speech (TTS) has greatly improve...
research
10/25/2022

Adapitch: Adaption Multi-Speaker Text-to-Speech Conditioned on Pitch Disentangling with Untranscribed Data

In this paper, we proposed Adapitch, a multi-speaker TTS method that mak...
research
06/11/2020

FastPitch: Parallel Text-to-speech with Pitch Prediction

We present FastPitch, a fully-parallel text-to-speech model based on Fas...

Please sign up or login with your details

Forgot password? Click here to reset