Recently, text-to-speech (TTS) has witnessed a rapid development in synthesizing speech for single-language text due to the introduction of sequence-to-sequence models [DBLP:conf/interspeech/WangSSWWJYXCBLA17, shen2018natural, DBLP:conf/iclr/PingPGAKNRM18] and high-fidelity neural vocoders [vanwavenet, oord2018parallel, 48585, NEURIPS2019_6804c9bc]. However, multi-lingual TTS remains a challenging task. The main challenge lies in disentangling language attributes from speaker identities in order to achieve code-switching and cross-lingual voice cloning.
Usually, multi-lingual speech from the multi-lingual speaker is required to build a TTS system that can perform code-switching and cross-lingual voice cloning. For example, [Traber99frommultilingual] builds a bilingual TTS system using bilingual speech data from a bilingual speaker. However, it is hard to find a speaker who is proficient in multiple languages and has smooth articulation across different languages.
Thus, some studies have started to build cross-lingual TTS systems using mono-lingual TTS data. 
uses a mixture of mono-lingual speech data to build a Hidden-Markov-Model(HMM) based code-switched TTS system, where HMM states are shared across different languages.[4730269, 5153557] construct an HMM-based Mandarin-English TTS system with shared context-dependent HMM states, and mapping from the bilingual speech is learned. [He2012TurningAM] turns a mono-lingual speaker into multi-lingual for mixed-lingual TTS by formant mapping based frequency warping, adapting F0 dynamics, and adjusting speaking rates accordingly. [Sitaram2016ExperimentsWC] presents a code-mixed TTS system where mappings between the phonemes of both languages are used to synthesize the mixed-lingual text. [Chandu2017Speech] develops a bi-lingual TTS system for navigation instructions using a mixture of mono-lingual speech datasets and a unified phone set for two languages.
With a successful application of sequence-to-sequence models in TTS [DBLP:conf/interspeech/WangSSWWJYXCBLA17, shen2018natural, gibiansky2017deep, DBLP:conf/iclr/PingPGAKNRM18], some researchers have begun to investigate sequence-to-sequence cross-lingual TTS.
[8682674, 8682927, Nekvinda2020OneMM, DBLP:conf/interspeech/XueSXXW19] build a sequence-to-sequence code-switched TTS system using a mixture of mono-lingual speech data.  proposes to use bytes as model inputs instead of grapheme, resulting in synthesizing fluent code-switched speech; but the voice switches for different languages.  explores two kinds of encoders to handle alphabet inputs of different languages, namely (1) shared multi-lingual encoder with explicit language embedding; (2) separate mono-lingual encoders for each language. [Nekvinda2020OneMM] introduces meta learning to improve multi-lingual TTS based on . [DBLP:conf/interspeech/XueSXXW19] builds a mixed-lingual TTS system by pre-training an average voice model trained by multi-speaker mono-lingual data. [DBLP:conf/interspeech/XueSXXW19] also looks into the effect of position of speaker embedding on speaker consistency and phoneme embedding on intelligibility and naturalness.
 proposes to build a multi-lingual TTS system using hundreds of hours of high-quality TTS data in three languages. Therefore, the system can perform code-switched TTS and cross-lingual voice cloning. However,  shows that it is difficult to achieve cross-lingual voice cloning if only one speaker is available for each language in training data, even when augmented with the proposed speaker-adversarial loss which aims to disentangle textual representation from speaker identities.
In this paper, we aim to achieve cross-lingual voice cloning using low-quality code-switched found data. As it is both laborious and expensive to record multi-lingual TTS data as in [Traber99frommultilingual] or a large amount of high-quality mono-lingual TTS data as in [DBLP:conf/interspeech/XueSXXW19, 48331], we propose to utilize abundant code-switched found speech data to achieve cross-lingual voice cloning for the target speakers. The contributions of this work are summarised as below:
To the best of our knowledge, it is the first work to improve cross-lingual voice cloning using low-quality code-switched found data.
Our proposed method to use low-quality code-switched found data can significantly improve the performance of cross-lingual voice cloning, achieving a comparable result to the SOTA performance.
Experiments show that our proposed method can also be incorporated with using separate encoders  to improve the results, which indicates its great compatibility with other methods in cross-lingual TTS.
The rest of the paper is organized as follows: Section 2 reviews some related works. Section 3 describes the baseline models and our proposed approach. Section 4 details the experimental setup and result analysis. The paper is closed with a conclusion in Section 5.
2 Related Works
There are some researches on building TTS models using low-quality found data. [cooper2017utterance] investigates automatically selecting ASR speech data to improve the intelligibility of TTS systems. Similarly, [kuo2018data, kuo2019selection, baljekar2016utterance] also find that careful data selection could improve the performance of TTS models using found data.
This work is related to the works above since this work investigates improving the TTS performance using low-quality found data. However, the previous works focus on mono-lingual TTS using mono-lingual found data. This work aims to improve cross-lingual voice cloning performance by utilizing code-switched found data. This task is more challenging because we not only need to deal with the “low-quality” attribute of these found data, but also to handle the data mismatch problems described in section 3.3.1.
3.1 The baseline model
In this paper, we use Tacotron2 [DBLP:conf/interspeech/WangSSWWJYXCBLA17]
as the baseline TTS model structure, which is a SOTA sequence-to-sequence model to generate mel spectrogram given the text sequence. We make some modifications to the vanilla Tacotron. We replace the Location-Sensitive Attention (LSA) in the vanilla Tacotron with Gaussian Mixture Model (GMM) Attention for more robust sequence generation. We then augment the model with additional speaker embedding for modeling speaker characteristics. The whole model is illustrated as Figure1 and below equations.
where , , and are the encoder output, phoneme input and speaker embedding, respectively. Concat denotes embedding concatenation. and
are hidden representation of attention RNN and decoder RNN.,
are context vector of attention and output of attention RNN.and are mel spectrogram before and after postnet.
3.2 Separated mono-lingual encoder
We also implement separate mono-lingual encoder (SPE)  to verify the compatibility of our proposed approach with other methods proposed for cross-lingual TTS. We briefly review SPE as below.
To avoid the mutual interference between the representations of different languages, the separate mono-lingual encoder (SPE) system  uses two separate encoders for inputs from two languages. The whole structure of the encoder is illustrated in Figure 2. The model has an English encoder and a Chinese encoder . The input character sequence is fed into both encoders, and a language ID sequence is used as a mask to extract the language portion from the respective encoder. The final encoder output is computed as follows :
where refers to element-wise multiplication, and denotes element-wise addition. During training, each encoder block is actually trained by the corresponding mono-lingual data, since either or is set into 1. But in the testing phase, as stated in , it is better to input the whole input sequence into both encoder blocks for computing the textual presentation. Although the encoder would mishandle the inputs from the other language, maintaining the whole context information could help reduce the mismatches in the language boundaries and learn better alignment for code-switched utterances. The rest of the SPE model is implemented as the baseline Tacotron2 model.
3.3 Proposed Method
There is a data distribution mismatch problem in performing cross-lingual voice cloning using mono-lingual data. Namely, we only have mono-lingual data from target speakers when training the model, while we want to synthesize multi-lingual speech using the voices of target speakers when testing. In this paper, we hypothesize that using code-switched data could mitigate this mismatch problem. Since it is hard to record code-switched data, we choose to collect low-quality code-switched data in this paper.
Although it is more convenient to collect low-quality mono-lingual speech data than code-switched speech data,  has found that even training the model using hundreds of high-quality mono-lingual data can not achieve satisfactory performance on speaker similarity in cross-lingual voice cloning, especially when Chinese speakers are involved. That is to say, simply using abundant mono-lingual data could not disentangle language attributes from speakers well. Thus, in this paper, we hypothesize that utilizing code-switched speech data could help better disentangle language attributes from speakers than mono-lingual data.
3.3.2 Training & filtering strategy
Due to the “low-quality” attribute of these code-switched found data, we adopt a pretrain-finetune two-stage training strategy. Inspired by [DBLP:conf/interspeech/XueSXXW19], we pretrain the TTS model using a mixture of code-switched found data and high-quality TTS data from target speakers. When pretraining, we assign each speaker a speaker embedding for multi-speaker training, as we find this pretraining method can achieve better results than treating all the data as from a single speaker and using no speaker embedding in our initial experiment. When fine-tuning the model, we fix the encoder and only fine-tune the rest of the model, since we find that fine-tuning the whole model would cause the problem of catastrophic forgetting. 111Forgetting about disentangling the language attributes from speaker identities, since the fine-tuning data are mono-lingual.
Moreover, modeling seq-to-seq TTS models using low-quality speech data would be challenging, since the attention mechanism would fail to learn a robust alignment between text representations and acoustic features. Thus, inspired by [cooper2016data, cooper2017utterance]
, we filter some potentially “noisy” data by metrics including fast speaking rate, hypo-articulation, and signal to noise ratio (SNR).
|Tac_Mix||4.30 0.07||4.18 0.08||4.26 0.07||3.73 0.09||3.83 0.08|
|SPE_Mix||4.02 0.08||3.98 0.06||4.14 0.13||4.15 0.08||3.29 0.13||3.83 0.09|
|Tac_Mix||4.05 0.07||4.01 0.05||3.83 0.08||3.95 0.08||4.08 0.06|
|SPE_Mix||3.92 0.05||3.82 0.10||3.77 0.07||3.92 0.08||3.93 0.07|
4.1 Experimental setup
In this paper, we aim to utilize low-quality code-switched found data to enable our mono-lingual target speakers to speak foreign languages (a.k.a. cross-lingual voice cloning) . We use only mono-lingual speech data from two female speakers. The first one is a Chinese female speaker from our internal corpus, and the other is the English female speaker from [ljspeech17]
. The number of training utterances for each speaker is 5000, with in total 10 hours of speech. For the low-quality code-switched data, we use the code-switched corpus of the ASRU 2019 Code-Switching Challenge. Since the corpus is designed for automatic speech recognition, the quality is significantly lower than TTS data. As described in section3.3.2, we perform filtering on the original corpus to select only 100 speakers for experiments. For each speaker, we further filter 10% lowest-quality data. As a result, we used in total 33000 code-switched utterances (about 27-hour data). 200 utterances of each type (i.e Chinese, English and Code-switched) are randomly selected for evaluations, which are not used in training and development.
4.1.2 Training Setup
In this paper, we build several TTS systems as follows:
Tac : Tacotron2 model trained with mono-lingual data;
Tac_Mix : Tacotron2 pre-trained with mixture of data, then fine-tuned with mono-lingual data;
SPE : SPE TTS system trained with mono-lingual data;
SPE_Mix : SPE TTS system pre-trained with mixture of data, then fine-tuned with mono-lingual data;
For Model Tac and SPE, we train the models for 200k steps. For Model Tac_Mix and SPE_Mix, we pre-train the models for 100k steps. Then the models are fine-tuned using data from target speakers. Early stopping is used to avoid over-fitting. We train the models using the Adam [DBLP:journals/corr/KingmaB14] optimizer and a batch size of 32. We use an initial learning rate of 1e-3, which is halved to 4 1e-4 until convergence.
Waveforms are synthesized by WaveNet vocoder which generates 16-bit speech at a 16kHz sample rate conditioned on the predicted spectrograms. We used a single variance-bounded Gaussian distribution to model the waveform samples as in[ping2018clarinet], which could relieve the quantization errors in synthetic speeches brought by previous categorical distribution. We train one vocoder for each target speaker. For training the vocoder for the Chinese speaker, we use 10-hour speech data from the target speaker, while for the English one, we use 15-hour data from the target speaker.
4.2 Result Analysis
To verify whether using low-quality code-switched data is helpful, both objective and subjective tests are carried out. For the objective test, we use iFLYTEK speech-to-text API 222https://www.xfyun.cn/ to recognize the generated speech and use the word error rate (WER) as the measurement metric for intelligibility. For the subjective test, we conduct a formal listening test. In the listening test, 16 raters are included, who are native Chinese Mandarin speakers and proficient in English. The listeners are asked to evaluate the naturalness and speaker similarity of synthesized speech by each model using the mean opinion score (MOS). Scores range from 1 to 5 with an interval of 0.5. We involve three types of utterances in the listening test. Each type of utterances includes 10 samples, (namely, 10 Mandarin-English code-switched utterances (CS-Target), 10 Chinese utterances (CN-Target), and 10 English utterances (EN-Target).). Each utterance is rated by all 16 listeners for naturalness and similarity. The results of the ground-truth speech are also provided as references. Speech demos are available at https://haitongzhang.github.io/Code-switch-TTS/ .
4.2.1 Objective evaluation
For the objective evaluation, we use about 5-minute speech data for each type. The results are provided in Table 1.
It is clearly shown that using low-quality code-switched data almost does not hurt synthesizing source language (namely, synthesizing Chinese speech using the Chinese speaker’s voice and English speech using the English speaker’s voice).
We find that utilizing these low-quality data to pretrain the model could significantly reduce the WER in case of code-switching, although the improvement on the Tacotron model is more significant than on SPE.
In cases of cross-lingual voice cloning, since model Tac fails to perform cross-lingual voice cloning, we do not compute the WER of the model. However, we find that model SPE_Mix outperforms significantly SPE system, which indicates the effectiveness of using low-quality code-switched data.
Moreover, we find that model Tac_Mix slightly outperforms SPE_Mix because model Tac_Mix uses phonemes as inputs while SPE_Mix uses characters as inputs. As a result, when training, SPE_Mix needs to learn the actual pronunciations of characters simultaneously, which is more difficult.
4.2.2 Subjective evaluation
The results of subjective evaluation on naturalness and speaker similarity are provided in Table 2.
For synthesizing source language speech, pre-training using low-quality code-switched data almost does not have a negative impact on the results, which is consistent with the results of objective evaluation.
In the case of code-switched speech, Model Tac can hardly generate intelligible code-switched speech regardless of the speaker, since the naturalness MOS scores are only about . With pre-training using low-quality code-switched data, Model Tac_Mix can synthesize natural code-switched speech. Specifically, compared with Model Tac, Model Tac_Mix provides an increase of and in the naturalness MOS score for the Chinese speaker and the English speaker, respectively. Meanwhile, the improvement in speaker similarity is significant, with an increase of and , respectively.
As shown in Table 2, using separate encoders for two different languages can improve the performances in code-switched synthesis, which is consistent with . When further pre-training the model using low-quality code-switched data, Model SPE_Mix outperforms Model SPE, with an increase of and in naturalness and and in speaker similarity, respectively.
As far as cross-lingual voice cloning is concerned, Model Tac fails to perform cross-lingual voice cloning. We find that when with the Chinese speaker embedding, the generated English utterances still sound like the English speaker and vice versa. But Model Tac_Mix can achieve promising performance in cross-lingual voice cloning, which reflects the effectiveness of pre-training using low-quality code-switched data.
Besides that, using separate encoders, Model SPE can hardly perform cross-lingual voice cloning, since the performance is not satisfactory enough. But, when pre-training the model using low-quality code-switched data, Model SPE_Mix can achieve a significantly better result in both naturalness and similarity, which indicates the proposed pretraining method using low-quality code-switched can be incorporated with SPE system to achieve better results.
Moreover, we report the cross-lingual cloning results in . Although the naturalness of our proposed system is not better than that of  333Partially because of the quality of our training data, the speaker similarity of our proposed best system is significantly better than that of . Generally speaking, our system has achieved performance comparable to the SOTA system considering the differences in experimental settings (e.g., the quality and quantity of training data).
In this paper, we aim to achieve cross-lingual voice cloning using mono-lingual data from target speakers. We leverage abundant low-quality code-switched found data to pretrain the TTS model. We conclude our findings as followed:
When utilizing low-quality code-switched found data, data filtering strategy and pretrain-finetune training strategy are helpful to mitigate the attribute of “low-quality”, as it does not have a negative impact on synthesizing source speech.
Pretrained with low-quality code-switched data, the model can improve the performance in case of code-switched synthesis.
Pretrained with low-quality code-switched data, the model can achieve comparable performance to the SOTA model in cross-lingual voice cloning.
Pretaining with low-quality code-switched data can be incorporated with using separate encoders to bring about a further improvement.
Although experiments have shown that pre-training with low-quality code-switched found data is useful to achieve cross-lingual voice cloning, we only investigate using paired code-switched data in this paper. However, there are more unlabelled code-switched data in the wild. Future investigation should be carried out on unsupervised pre-training using these unlabelled data.