Improving Cross-lingual Speech Synthesis with Triplet Training Scheme

02/22/2022
by   Jianhao Ye, et al.
0

Recent advances in cross-lingual text-to-speech (TTS) made it possible to synthesize speech in a language foreign to a monolingual speaker. However, there is still a large gap between the pronunciation of generated cross-lingual speech and that of native speakers in terms of naturalness and intelligibility. In this paper, a triplet training scheme is proposed to enhance the cross-lingual pronunciation by allowing previously unseen content and speaker combinations to be seen during training. Proposed method introduces an extra fine-tune stage with triplet loss during training, which efficiently draws the pronunciation of the synthesized foreign speech closer to those from the native anchor speaker, while preserving the non-native speaker's timbre. Experiments are conducted based on a state-of-the-art baseline cross-lingual TTS system and its enhanced variants. All the objective and subjective evaluations show the proposed method brings significant improvement in both intelligibility and naturalness of the synthesized cross-lingual speech.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/20/2022

Cross-Lingual Text-to-Speech Using Multi-Task Learning and Speaker Classifier Joint Training

In cross-lingual speech synthesis, the speech in various languages can b...
research
10/14/2021

Revisiting IPA-based Cross-lingual Text-to-speech

International Phonetic Alphabet (IPA) has been widely used in cross-ling...
research
04/25/2023

Multi-Speaker Multi-Lingual VQTTS System for LIMMITS 2023 Challenge

In this paper, we describe the systems developed by the SJTU X-LANCE tea...
research
10/14/2021

Exploring Timbre Disentanglement in Non-Autoregressive Cross-Lingual Text-to-Speech

In this paper, we present a FastPitch-based non-autoregressive cross-lin...
research
10/18/2021

Tackling the Score Shift in Cross-Lingual Speaker Verification by Exploiting Language Information

This paper contains a post-challenge performance analysis on cross-lingu...
research
05/12/2020

AdaDurIAN: Few-shot Adaptation for Neural Text-to-Speech with DurIAN

This paper investigates how to leverage a DurIAN-based average model to ...
research
11/17/2021

Cross-lingual Low Resource Speaker Adaptation Using Phonological Features

The idea of using phonological features instead of phonemes as input to ...

Please sign up or login with your details

Forgot password? Click here to reset