Cross-lingual Knowledge Distillation via Flow-based Voice Conversion for Robust Polyglot Text-To-Speech

09/15/2023
by   Dariusz Piotrowski, et al.
0

In this work, we introduce a framework for cross-lingual speech synthesis, which involves an upstream Voice Conversion (VC) model and a downstream Text-To-Speech (TTS) model. The proposed framework consists of 4 stages. In the first two stages, we use a VC model to convert utterances in the target locale to the voice of the target speaker. In the third stage, the converted data is combined with the linguistic features and durations from recordings in the target language, which are then used to train a single-speaker acoustic model. Finally, the last stage entails the training of a locale-independent vocoder. Our evaluations show that the proposed paradigm outperforms state-of-the-art approaches which are based on training a large multilingual TTS model. In addition, our experiments demonstrate the robustness of our approach with different model architectures, languages, speakers and amounts of data. Moreover, our solution is especially beneficial in low-resource settings.

READ FULL TEXT
research
10/31/2022

Cross-lingual Text-To-Speech with Flow-based Voice Conversion for Improved Pronunciation

This paper presents a method for end-to-end cross-lingual text-to-speech...
research
10/08/2020

Latent linguistic embedding for cross-lingual text-to-speech and voice conversion

As the recently proposed voice cloning system, NAUTILUS, is capable of c...
research
10/14/2021

Improve Cross-lingual Voice Cloning Using Low-quality Code-switched Data

Recently, sequence-to-sequence (seq-to-seq) models have been successfull...
research
05/22/2020

NAUTILUS: a Versatile Voice Cloning System

We introduce a novel speech synthesis system, called NAUTILUS, that can ...
research
08/03/2020

One Model, Many Languages: Meta-learning for Multilingual Text-to-Speech

We introduce an approach to multilingual speech synthesis which uses the...
research
07/04/2022

GlowVC: Mel-spectrogram space disentangling model for language-independent text-free voice conversion

In this paper, we propose GlowVC: a multilingual multi-speaker flow-base...
research
10/29/2019

a novel cross-lingual voice cloning approach with a few text-free samples

In this paper, we present a cross-lingual voice cloning approach. BN fea...

Please sign up or login with your details

Forgot password? Click here to reset