SANE-TTS: Stable And Natural End-to-End Multilingual Text-to-Speech

06/24/2022
by   Hyunjae Cho, et al.
0

In this paper, we present SANE-TTS, a stable and natural end-to-end multilingual TTS model. By the difficulty of obtaining multilingual corpus for given speaker, training multilingual TTS model with monolingual corpora is unavoidable. We introduce speaker regularization loss that improves speech naturalness during cross-lingual synthesis as well as domain adversarial training, which is applied in other multilingual TTS models. Furthermore, by adding speaker regularization loss, replacing speaker embedding with zero vector in duration predictor stabilizes cross-lingual inference. With this replacement, our model generates speeches with moderate rhythm regardless of source speaker in cross-lingual synthesis. In MOS evaluation, SANE-TTS achieves naturalness score above 3.80 both in cross-lingual and intralingual synthesis, where the ground truth score is 3.99. Also, SANE-TTS maintains speaker similarity close to that of ground truth even in cross-lingual inference. Audio samples are available on our web page.

READ FULL TEXT

page 3

page 4

research
11/07/2022

ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual Multi-Speaker Text-to-Speech

Speech representation learning has improved both speech understanding an...
research
06/25/2023

DSE-TTS: Dual Speaker Embedding for Cross-Lingual Text-to-Speech

Although high-fidelity speech can be obtained for intralingual speech sy...
research
02/28/2023

CrossSpeech: Speaker-independent Acoustic Representation for Cross-lingual Speech Synthesis

While recent text-to-speech (TTS) systems have made remarkable strides t...
research
01/20/2022

Cross-Lingual Text-to-Speech Using Multi-Task Learning and Speaker Classifier Joint Training

In cross-lingual speech synthesis, the speech in various languages can b...
research
07/15/2020

Cross-Lingual Speaker Verification with Domain-Balanced Hard Prototype Mining and Language-Dependent Score Normalization

In this paper we describe the top-scoring IDLab submission for the text-...
research
03/31/2022

Data-augmented cross-lingual synthesis in a teacher-student framework

Cross-lingual synthesis can be defined as the task of letting a speaker ...
research
08/03/2020

One Model, Many Languages: Meta-learning for Multilingual Text-to-Speech

We introduce an approach to multilingual speech synthesis which uses the...

Please sign up or login with your details

Forgot password? Click here to reset