Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised Learning for Text-To-Speech

10/27/2022
by   Takaaki Saeki, et al.
0

This paper proposes Virtuoso, a massively multilingual speech-text joint semi-supervised learning framework for text-to-speech synthesis (TTS) models. Existing multilingual TTS typically supports tens of languages, which are a small fraction of the thousands of languages in the world. One difficulty to scale multilingual TTS to hundreds of languages is collecting high-quality speech-text paired data in low-resource languages. This study extends Maestro, a speech-text joint pretraining framework for automatic speech recognition (ASR), to speech generation tasks. To train a TTS model from various types of speech and text data, different training schemes are designed to handle supervised (paired TTS and ASR data) and unsupervised (untranscribed speech and unspoken text) datasets. Experimental evaluation shows that 1) multilingual TTS models trained on Virtuoso can achieve significantly better naturalness and intelligibility than baseline ones in seen languages, and 2) they can synthesize reasonably intelligible and naturally sounding speech for unseen languages where no high-quality paired TTS data is available.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/30/2023

Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with Unsupervised Text Pretraining

While neural text-to-speech (TTS) has achieved human-like natural synthe...
research
08/05/2022

Large vocabulary speech recognition for languages of Africa: multilingual modeling and self-supervised learning

Almost none of the 2,000+ languages spoken in Africa have widely availab...
research
10/18/2022

Maestro-U: Leveraging joint speech-text representation learning for zero supervised speech ASR

Training state-of-the-art Automated Speech Recognition (ASR) models typi...
research
07/11/2022

Speaker consistency loss and step-wise optimization for semi-supervised joint training of TTS and ASR using unpaired text data

In this paper, we investigate the semi-supervised joint training of text...
research
08/18/2016

DNN-based Speech Synthesis for Indian Languages from ASCII text

Text-to-Speech synthesis in Indian languages has a seen lot of progress ...
research
03/29/2022

CycleGAN-Based Unpaired Speech Dereverberation

Typically, neural network-based speech dereverberation models are traine...
research
12/06/2022

Learning the joint distribution of two sequences using little or no paired data

We present a noisy channel generative model of two sequences, for exampl...

Please sign up or login with your details

Forgot password? Click here to reset