One Model, Many Languages: Meta-learning for Multilingual Text-to-Speech

08/03/2020 ∙ by Tomáš Nekvinda, et al. ∙ Charles University in Prague 0

We introduce an approach to multilingual speech synthesis which uses the meta-learning concept of contextual parameter generation and produces natural-sounding multilingual speech using more languages and less training data than previous approaches. Our model is based on Tacotron 2 with a fully convolutional input text encoder whose weights are predicted by a separate parameter generator network. To boost voice cloning, the model uses an adversarial speaker classifier with a gradient reversal layer that removes speaker-specific information from the encoder. We arranged two experiments to compare our model with baselines using various levels of cross-lingual parameter sharing, in order to evaluate: (1) stability and performance when training on low amounts of data, (2) pronunciation accuracy and voice quality of code-switching synthesis. For training, we used the CSS10 dataset and our new small dataset based on Common Voice recordings in five languages. Our model is shown to effectively share information across languages and according to a subjective evaluation test, it produces more natural and accurate code-switching speech than the baselines.



There are no comments yet.


page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Contemporary end-to-end speech synthesis systems achieve great results and produce natural-sounding human-like speech [shen18, oord16] even in real time [kalchbrenner18, kumar19]. They make possible an efficient training that does not put high demands on quality, amount, and preprocessing of training data. Based on these advances, researchers aim at, for example, expressiveness [wang18], controllability [hsu19], or few-shot voice cloning [jia2018]. When extending these models to support multiple languages, one may encounter obstacles such as different input representations or pronunciations, and imbalanced amounts of training data per language.

In this work, we examine cross-lingual knowledge-sharing aspects of multilingual text-to-speech (TTS). We experiment with more languages simultaneously than most previous TTS work known to us. We can summarize our contributions as follows: (1) We propose a scalable grapheme-based model that utilizes the idea of contextual parameter generator network [platanios18] and we compare it with baseline models using different levels of parameter sharing. (2) We introduce a new small dataset based on Common Voice [ardila19] that includes data in five languages from 84 speakers. (3) We evaluate effectiveness of the compared models on ten languages with three different scripts and we show their code-switching abilities on five languages. For the purposes of the evaluation, we created a new test set of 400 bilingual code-switching sentences.

Figure 1: Diagram of our model. The meta-network generates parameters of language-specific convolutional text encoders. Encoded text inputs enhanced with speaker embeddings are read by the decoder. The adversarial classifier suppresses speaker-dependent information in encoder outputs.

Our source code, hyper-parameters, training and evaluation data, samples, pre-trained models, and interactive demos are freely available on GitHub.111

2 Related Work

So far, several works explored training joint multilingual models in text-to-speech, following similar experiments in the field of neural machine translation

[sachan18, platanios18]. Multilingual models offer a few key benefits:

  • [nosep,leftmargin=10pt]

  • Transfer learning:

    We can try to make use of high-resource languages for training TTS systems for low-resource languages, e.g., via transfer learning approaches

    [jui19, lee18].

  • Knowledge sharing: We may think of using multilingual data for joint training of a single shared text-to-speech model. Intuitively, this enables cross-lingual sharing of patterns learned from data. The only work in this area to our knowledge is prakash2019’s study [prakash2019] on TTS for related Indian languages using hand-built unified phoneme representations.

  • Voice cloning: Under certain circumstances, producing speech in multiple languages with the same voice, i.e., cross-lingual voice cloning, is desired. However, audio data where a single speaker speaks several languages is scarce. That is why multilingual voice-cloning systems should be trainable using mixtures of monolingual data. Here, zhang19 used Tacotron 2 [shen18] conditioned on phonemes and showed voice-cloning abilities on English, Spanish, and Chinese. nachmani19 extended Voice Loop [taigman18] and enabled voice conversion for English, Spanish, and German. Chen2019 used a phoneme-based Tacotron 2 with a ResCNN based speaker encoder [li2017deep] that enables a massively multi-speaker speech synthesis, even with fictitious voices.

  • Code switching: In this task closely related to cross-lingual voice cloning, we would like to alternate languages within sentences. This is useful for foreign names in navigation systems or news readers. In view of that, cao2019 modified Tacotron; their model uses language-specific encoders. Code-switching itself is done by combining of their outputs.

Overall, all recent multilingual text-to-speech systems were only tested in 2-3 languages simultaneously, or required vast amounts of data to be trained.

3 Model Architecture

We base our experiments on Tacotron 2 [shen18]. We focus on the spectrogram generation part here; for vocoding, we use WaveRNN [kalchbrenner18, fatchord19] in all our configurations. We first explain our new model that uses meta-learning for multilingual knowledge sharing in Sec. 3.1, then describe contrastive baseline models which are based on recent multilingual TTS architectures (Sec. 3.2).

3.1 Our Model: Generated (Gen)

We introduce a scalable multilingual text-to-speech model that follows a meta-learning approach of contextual parameter generation proposed by platanios18 for NMT (see Fig. 1). We call the model generated (Gen) further in this text.

The backbone of our model is built on our own implementation of Tacotron 2, composed of these main components: (1) an input text encoder that includes a stack of convolutional layers and a bidirectional LSTM, (2) a location-sensitive attention mechanism [shen18] with the guided attention loss term [tachibana17] that supports faster convergence, (3) a decoder with two stacked LSTM layers where the first queries the attention mechanism and the second generates outputs. We increase tolerance of the guided attention loss exponentially during training.

We propose the following changes to this basic architecture:

Convolutional Encoders:

We use multiple language-specific input text encoders. However, having a separate encoder with recurrent layers for each language is not practical as it involves passing the training batches (which should be balanced with respect to languages) through multiple encoders sequentially. Therefore, we use a fully convolutional encoder from DCTTS [tachibana17]

. The encoders use grouped layers and are thus processed effectively. We enhance the encoders with batch normalization and dropout with a very low rate. The normalization layers are situated before activations and dropouts after them.

Encoder parameter generation:

To enable cross-lingual knowledge-sharing, parameters of the encoders are generated using a separate network conditioned on language embeddings. The parameter generator is composed of multiple site-specific generators, each of which takes a language embedding on the input and produces parameters for one layer of the convolutional encoder for the given language. The generators enable a controllable cross-lingual parameter sharing because reduction of their size prevents generation of highly language-specific parameters. We implement them as fully connected layers.

Training with multilingual batches:

We construct unusual training batches to fully utilize the potential of this architecture. We would like to have a batch of examples that can be reshaped into a batch of size where is the number of encoder groups or languages. This new batch should have a new dimension that groups all examples with the same language. Thus we use a batch sampler that creates batches where for each and , all -th examples are of the same language.

Speaker embedding:

We extend the model with a speaker embedding which is concatenated with each element of the encoded sequence that is attended by the decoder while generating spectrogram frames. This makes the model multi-speaker and allows cross-lingual voice cloning.

Adversarial speaker classifier:

We combine the model with an adversarial speaker classifier [zhang19] to boost voice cloning. The classifier follows principles of domain adversarial training [ganin16]

and is used to proactively remove speaker-specific information from the encoders. It includes a single hidden layer, a softmax layer, and a gradient reversal layer that scales the gradient flowing to the encoders by a factor

. The gradients are clipped to stabilize training. It is optimized to reduce the cross-entropy of speaker predictions. The predictions are done separately for each element of the encoders’ outputs.

3.2 Baselines: Shared, Separate & Single

We compare Gen with baseline models called shared (Sha), separate (Sep), and single (Sgl). Sgl is a basic Tacotron 2 model, Sha and Sep follow the recent multilingual TTS works of zhang19 and cao2019, respectively, but were slightly adapted to our tasks for a fairer comparison to Gen – we use more languages and less data than the original works. In the following, we only describe their differences from Gen.

Single (Sgl)

represents a set of monolingual models that follow vanilla Tacotron 2 [shen18] with the original recurrent encoder and default settings. Sgl cannot be used for code-switching.

Shared (Sha):

Unlike Gen, Sha has a single encoder with the original Tacotron 2 architecture, so it fully shares all encoder parameters. This sharing implicitly leads to language-independent encoder outputs. The language-dependent processing happens in the decoder, so the speaker embeddings are explicitly factorized into speaker and language parts.

Separate (Sep)

uses multiple language-specific convolutional encoders too, but their parameters are not generated. It also does not include the adversarial speaker classifier.

4 Dataset

We created a new dataset for our experiments, based on carefully cleaning and preprocessing freely available audio sources: CSS10 [park19] and a small fraction of Common Voice [ardila19]. Table 1 shows total durations of the used audio data per language.

CSS 15.4 3.5 20.9 9.7 16.9 9.5 14.3 11.7 17.7 5.6
CV 4.8 N/A N/A N/A 3.0 N/A N/A 1.3 3.4 1.0
Table 1: Total data sizes per language (hours of audio data) in our cleaned CSS10 (CSS) and Common Voice (CV) subsets.

4.1 Css10

CSS10 consists of mono-speaker data in German, Greek, Spanish, Finnish, French, Hungarian, Japanese, Dutch, Russian, and Chinese. It was created from audiobooks and contains various punctuation styles. We applied an automated cleaning to normalize transcripts across languages, including punctuation and some spelling variants (e.g., “œ” “oe”). We romanized Japanese with MeCab and Romkan [mecab, romkan], Chinese using Pinyin [pinyin].

GT Sgl Sha Sep Gen Sha 600 Sha 900 Gen 600 Gen 900






















Table 2: Left: CERs of ground-truth recordings (GT) and recordings produced by monolingual and the three examined multilingual models. Right: CERs of the recordings synthesized by Gen and Sha

trained on just 600 or 900 training examples per language. Best results for the given language are shown in bold; “*” denotes statistical significance (established using paired t-test;


We further filtered the data to remove any potentially problematic transcripts: we preserved just examples with 0.5-10.1s of audio and 3-190 transcript characters. We computed means

and variances

of audio durations of groups corresponding to examples with the same transcript lengths. Then we removed those with durations outside the interval . In total, the resulting dataset includes 125.26 hours of recordings.

4.2 Common Voice

To train code-switching models, multi-speaker data is required to disentangle the connection between languages and speakers. We thus enhanced CSS10 with data from Common Voice (CV) for languages included in both sets – the intersection covers German, French, Chinese, Dutch, Russian, Japanese, and Spanish.

Since CV is mainly aimed at speech recognition and rather noisy, we performed extensive filtering: We removed recordings with a negative rating (as provided by CV for each example) and excluded any speakers with less than 50 recordings. We checked a sample of recordings for each speaker, and we removed all their data if we considered the sample to have poor quality. This resulted in a small dataset of 39 German, 22 French, 11 Dutch, 6 Chinese, and 6 Russian speakers. Japanese and Spanish data were removed completely. A lot of recordings in CV contain artifacts at the beginning or end. Thus we semi-automatically cleaned leading and trailing segments of all recordings. The dataset has 13.7 hours of audio data in total.

5 Experiments

We compare our models described in Section 3. The experiment in Section 5.1 was designed to show stability and ability to train on lower amounts of data. We conclude that character error rate (CER) evaluation [soukoreff01] is sufficient for this experiment. In Section 5.2, we test pronunciation accuracy and voice quality of code-switching synthesis. We used a subjective evaluation test as there are no straightforward objective metrics for this task.

We used the same vocoder for all models, i.e., the WaveRNN model trained on a training subset of the cleaned CSS10 dataset.

5.1 Multilingual training

Training setup:

We used our cleaned CSS10 dataset for training; 64 randomly selected samples per language were reserved for validation and another 64 for testing. We did not have an ambition to clone voices in this experiment, so we switched off speaker classifiers for Sha and Gen (i.e., Sha was reduced to the vanilla Tacotron 2 model with a language embedding).

We trained the three models for 50k steps with the Adam optimizer.222With , , , and weight decay of We used a stepped learning rate that starts from and halves every 10k steps. In the case of Sep, we used a lower initial learning rate . For Sgl, the learning rate schedule was tuned individually per language. We stopped training early after validation data loss started increasing. Sha, Sep, and Gen used speaker embeddings of size 32 and Gen used language embeddings and parameter generators of size 10 and 8, respectively. We used language-balanced batches of size 60 for all models.


We synthesized evaluation data using all the models followed by WaveRNN and we sent the synthesized recordings to Google Cloud Platform ASR.333 Then we computed CERs between ground-truth and ASR-produced transcripts (we used the native symbols for Chinese and Japanese).


Table 2 summarizes the obtained CERs. The first column gives us a notion about the performance of the ASR engine. The rates stay below 20% for all languages; higher CERs are mostly caused by noisy CSS10 recordings.

We were not able to train the Greek Sgl model due to low amount of training data. The decoder started to overfit soon before the attention could have been established. The performance of Sgl is similar to Sha except for Chinese, Finnish, and Greek. Sep performed noticeably worse than Sha or even Sgl. This may be caused by the imbalance between the batch size of the encoder and the decoder as the encoder’s effective batch size is just .444Our attempts to compensate for this using different encoder and decoder learning rates were not successful.

Sharing of the data probably regularized the decoder, so the attention was established even in the case of Greek.

Gen seems to be significantly better than Sha on most languages. It fulfills our expectations as Gen should be more flexible.

Manual error analysis:

We manually inspected the outputs in German, French, Spanish, and Russian. In the case of Spanish, all the models work well; we noticed just differences in the treatment of punctuation. German outputs by Gen seem to be the best. Other models sometimes do unnatural pauses when reaching a punctuation mark. Right after the pauses, they often skip a few words. Gen is noticeably better on French and Russian, others produce obvious mispronunciations.

Data-stress training:

To further test the models in data-stress situations, we chose random subsets of 600 and 900 examples per language from the training set (i.e., about 80 or 120 minutes of recordings, respectively). We trained all models on both reduced datasets, but accomplished training just for Sha and Gen. While training on the bigger and smaller dataset, we decayed the learning rate every 7.5k and 5k training steps, respectively. The right half of Table 2 shows that Gen can work better even in data-stress situations. Gen models have, compared to Sha models, significantly better CER values on six languages.

5.2 Code-switching

Training setup:

In this experiment, we only used the five languages where both CSS10 and CV data are available (Table 1), and trained on all data in our cleaned sets; 64 and 4 randomly selected samples for each speaker from CSS10 and CV, respectively, were reserved for validation. The Sgl models are not applicable to the code-switching scenario. Sha, Sep, and Gen models were trained for 50k steps with the same learning rate and schedule settings as in Section 5.1, this time with the adversarial speaker classifiers enabled.555Based on preliminary experiments on validation data, we set and weighted the loss of the classifier by and for Gen and Sha, respectively. The classifiers include a hidden layer of size 256. We set the size of speaker embeddings to 32 and used a language embedding of size 4 in Sha. Gen uses language embeddings of size 10 and generator layers of size 4. We used mini-batches of size 50 for all models.

Figure 2: Language abilities of participants of our survey.

Code-switching evaluation dataset:

We created a new small-scale dataset especially for code-switching evaluation. We used bilingual sentences scraped from Wikipedia. For each language, we picked 80 sentences with a few foreign words (20 sentences for each of the 4 other languages); Chinese was romanized. We replaced foreign names with their native forms (see Fig. 3).

Figure 3: Examples of code-switching evaluation sentences.

Subjective evaluation:

We synthesized all evaluation sentences using speaker embedding of the CSS10 speaker for the base language of the sentence. We arranged a subjective evaluation test and used a rating method that combines five-point mean opinion score (MOS) with MUSHRA [itu_mushra]. For each sample, its transcript and systems’ outputs were shown at the same time. Participants were asked to rate them on a scale from 1 to 5 with 0.1 increments and with labels “Bad”, “Poor”, “Fair”, “Good”, “Excellent”. To distinguish different error types, we asked for two ratings: (1) fluency, naturalness, and stability of the voice (speaker similarity) – to check if foreign words cause any change to the speaker’s voice, and (2) accuracy – testing if all words are pronounced and the foreign word pronunciation is correct. Participants could leave a textual note at the end of the survey.

For each language, we recruited ten native speakers that spoke at least one other language fluently via the Prolific platform (Fig. 2).666; 4 participants who reported as Chinese native speakers on Prolific only reported non-native fluency in our survey. They were given twelve sentences with the base language matching their native language where each of the other languages was represented by three sentences.777In 3 sentences, a random model output was distorted and used as sanity check (expected to be rated lowest). All participants passed.


Table 3 summarizes results of the survey. The rows marked “All” show means and variances of the ratings of all 50 participants. Fig. 4

visualizes quantiles of the ratings (grouped by dominant languages).

Gen has significantly higher mean ratings on both scales. Unlike Sha or Sep, it allows cross-lingual mixing of the encoder outputs and enables smooth control over pronunciation. Sep scores consistently worst. The accuracy ratings are overall slightly higher than the fluency ratings; this might be caused by improper word stress, which several participants commented on.

Sha Sep Gen


German *
French *
Dutch *
Russian *
Chinese *
All *


German *
French *
Dutch *
Russian *
Chinese *
All *
Word skips 41/400 38/400 11/400
Table 3: Mean (with std. dev.) ratings of fluency, naturalness, voice stability (top) and pronunciation accuracy (middle). The bottom row shows the number of sentences with word skips.
Figure 4: Graphs showing distributions of fluency and accuracy ratings grouped by the dominant language of rated sentences.

Manual error analysis:

We found that the models sometimes skip words, especially when reaching foreign words in Chinese sentences. Therefore, we manually inspected all 400 outputs of all models and counted sentences where any word skip occurred, see the “Word skips” row in Table 3. We found that the Gen model makes much fewer of these errors than Sha and Sep.

6 Conclusion

We presented a new grapheme-based model that uses meta-learning for multilingual TTS. We showed that it significantly outperforms multiple strong baselines on two tasks: data-stress training and code-switching, where our model was favored in both voice fluency as well as pronunciation accuracy. Our code is available on GitHub.1 For future work, we consider changes to our model’s attention module to further improve accuracy.

7 Acknowledgements

This research was supported by the Charles University grant PRIMUS/19/SCI/10.