ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual Multi-Speaker Text-to-Speech

11/07/2022
by   Xiaoran Fan, et al.
0

Speech representation learning has improved both speech understanding and speech synthesis tasks for single language. However, its ability in cross-lingual scenarios has not been explored. In this paper, we extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks, including cross-lingual multi-speaker voice cloning and cross-lingual multi-speaker speech editing. We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes given a speech example and its transcription. By learning to reconstruct the masked parts of the input in different languages, our model shows great improvements over speaker-embedding-based multi-speaker TTS methods. Moreover, our framework is end-to-end for both the training and the inference without any finetuning effort. In cross-lingual multi-speaker voice cloning and cross-lingual multi-speaker speech editing tasks, our experiments show that our model outperforms speaker-embedding-based multi-speaker TTS methods.

READ FULL TEXT
research
04/25/2023

Multi-Speaker Multi-Lingual VQTTS System for LIMMITS 2023 Challenge

In this paper, we describe the systems developed by the SJTU X-LANCE tea...
research
06/24/2022

SANE-TTS: Stable And Natural End-to-End Multilingual Text-to-Speech

In this paper, we present SANE-TTS, a stable and natural end-to-end mult...
research
09/02/2023

DiCLET-TTS: Diffusion Model based Cross-lingual Emotion Transfer for Text-to-Speech – A Study between English and Mandarin

While the performance of cross-lingual TTS based on monolingual corpora ...
research
03/31/2022

Data-augmented cross-lingual synthesis in a teacher-student framework

Cross-lingual synthesis can be defined as the task of letting a speaker ...
research
03/18/2022

A^3T: Alignment-Aware Acoustic and Text Pretraining for Speech Synthesis and Editing

Recently, speech representation learning has improved many speech-relate...
research
10/11/2022

Cross-Lingual Speaker Identification Using Distant Supervision

Speaker identification, determining which character said each utterance ...
research
01/20/2022

Cross-Lingual Text-to-Speech Using Multi-Task Learning and Speaker Classifier Joint Training

In cross-lingual speech synthesis, the speech in various languages can b...

Please sign up or login with your details

Forgot password? Click here to reset