Revisiting IPA-based Cross-lingual Text-to-speech

10/14/2021 ∙ by Haitong Zhang, et al. ∙ NetEase, Inc. 0

International Phonetic Alphabet (IPA) has been widely used in cross-lingual text-to-speech (TTS) to achieve cross-lingual voice cloning (CL VC). However, IPA itself has been understudied in cross-lingual TTS. In this paper, we report some empirical findings of building a cross-lingual TTS model using IPA as inputs. Experiments show that the way to process the IPA and suprasegmental sequence has a negligible impact on the CL VC performance. Furthermore, we find that using a dataset including one speaker per language to build an IPA-based TTS system would fail CL VC since the language-unique IPA and tone/stress symbols could leak the speaker information. In addition, we experiment with different combinations of speakers in the training dataset to further investigate the effect of the number of speakers on the CL VC performance.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently, text-to-speech (TTS) has witnessed a rapid development in synthesizing mono-language speech using sequence-to-sequence models [21, 17, 20] and high-fidelity neural vocoders [16, 8, 9]. Meanwhile, researchers have begun to study cross-lingual TTS, whose main challenge may lie in disentangling language attributes from speaker identities to achieve cross-lingual voice cloning (CL VL).

Normally, multi-lingual speech from the multi-lingual speaker is required to build a TTS system that can perform CL VL [23]. However, it is hard to find a speaker who is proficient in multiple languages and has smooth articulation across different languages [26]. Thus, researchers have started to study building cross-lingual TTS systems using mono-lingual data.

Researchers initially investigated code-switched TTS by sharing the HMM states across different languages [10, 18, 19] , formant mapping based frequency warping [6], and using a unified phone set for multiple languages [4].

More recently, researchers have started to investigate sequence-to-sequence based cross-lingual TTS. [2] proposes to use separate encoders to handle alphabet inputs of different languages. [26] adopts the pretrain-and-finetune method to build a cross-lingual TTS system using mono-lingual data. [28, 12, 25] use a gradient reversal layer to disentangle speaker information from the textual encoder. [15] uses meta-learning to improve the cross-lingual performance. [15] uses graphemes as the input representations, while [11] proposes to use bytes as model inputs, resulting in synthesizing fluent code-switched speech; but the voice switches for different languages. [27] compares the CL VL performance between language-dependent phoneme and language-independent phoneme (IPA) based multi-lingual TTS systems. [3] uses bilingual phonetic posteriorgram (PPG) to achieve code-switched TTS.

Figure 1: Model structure studied.

1.1 The contribution

Although IPA has been widely used in cross-lingual TTS [28, 27], IPA itself has been understudied in cross-lingual TTS. In this paper, we conduct an empirical study of IPA in cross-lingual TTS, with an attempt to answer the following questions:

  • Does the way to process IPA and suprasegmental sequences have a significant impact on the CL VL performance?

  • Is monolingual data from only two speakers (one speaker per language) sufficient to achieve a promising CL VL performance in the IPA-based cross-lingual model?

  • What is the impact of the number of speakers per language on the CL VL performance?

To answer these questions, we conduct a performance comparison between two IPA processing modules in the non-autoregressive TTS model. Besides, we analyze the cross-lingual TTS model trained with only one speaker per language by devising two input perturbation methods and compare the number of speakers per language to analyze its effect on the CL VL performance.

2 Framework

2.1 Model architecture

The core of the framework is Fastspeech2 model [20]

, a transformer-based non-autoregressive TTS model. The model mainly consists of an encoder, a variance adaptor, and a mel-spectrogram decoder. The encoder converts the input embedding sequence into the hidden sequence, and then the variance adaptor adds different variance information such as duration, pitch, and energy into the hidden sequence; finally, the mel-spectrogram decoder converts the adapted hidden sequence into mel-spectrogram sequence in parallel. To support multi-speaker TTS, we extract the speaker embedding from the speaker embedding look-up table and place it at two positions: 1) adding to the encoder output and 2) adding to the decoder input. The overall structure is illustrated in Fig.


2.2 Input processing module

The input of the cross-lingual model usually includes IPA and suprasegmental, including tone/stress. To investigate whether the way to process them has an impact on the CL VL performance, we consider two different processing modules: 1) SEA: use Separate Embedding sequences for IPA and tone/stress, then Add two embedding sequences to form the final input embedding; 2) UEI: use Unified Embedding for IPA and tone/stress, then take each embedding as an Independent input in the final input embedding. We illustrate these two processing modules in Fig. 2.

3 Experimental setup

3.1 Data

In this paper, we implement experiments on Chinese (Mandarin) and English. We include two datasets in this paper. Dataset1 consists of the monolingual speech from two female speakers: a Chinese speaker [5] and an English speaker [7]. Each speaker has roughly ten hours of speech. We use 200 utterances for evaluation and the rest for training. Besides Dataset1, Dataset2 includes monolingual data from four additional speakers (one female and male from [1] and one female and male from our proprietary speech corpus). Each speaker has about 5 to 10 hours of monolingual data.

3.2 Implementation details

We use G2P converter to convert text sequence into language-dependent phoneme sequence, and then convert them into IPA 111 and tone/stress sequence. We include five tones for Chinese Mandarin and two stresses (primary and secondary stress) for English. A special symbol is used when there is no tone and stress. We also include a word boundary symbol in the input sequence. We use Montreal forced alignment (MFA) [13] tool to extract the phoneme duration. The duration for the word boundary symbol is set to zero. When input processing module UEI is used, the duration for tone/stress is zero.

We train the Fastspeech2 models with batchsize of 24 utterances on one GPU. We use the Adam optimizer [21] with , , and and follow the same learning rate schedule in [22]. It takes 200k steps for training until convergence. We encourage readers to refer to [20] for more training details.

The generated speech is represented as a sequence of 80-dim log-mel spectrogram frames, computed from 40ms windows shifted by 10ms. Waveforms are synthesized using a HifiGAN [9] vocoder which generates 16-bit signals sampled at 16 kHz conditioned on spectrograms predicted by the TTS model.

3.3 Evaluation metrics

We designed listening tests to evaluate the synthesized speech’s naturalness (NAT) and speaker similarity (SIM) . Ten utterances were randomly chosen for evaluation in each scenario. Each utterance is rated by 14 listeners using the mean opinion score (MOS) on a five-point scale. Demos are available at

Figure 2: Examples of the input processing modules, where denotes element-wise addition. Prosody symbols are ignored here for brevity.
Speaker Model Text CH EN
MUEI 4.35 0.11 4.52 0.08 4.37 0.10 2.09 0.13
EN MSEA 2.32 0.14 3.83 0.13
MUEI 3.84 0.13 2.06 0.11 4.44 0.09 3.75 0.12
Table 1: Naturalness (NAT) and similarity (SIM) MOS of synthesized speech by models with two different input processing modules.

4 Results and discussion

4.1 The impact of input processing modules

To study whether two different input modules impact the cross-lingual voice cloning performance, we trained two model variants using Dataset1: MSEA (the model with SEA) and MUEI (the model with UEI). The subjective evaluation results are provided in Table 1. It clearly shows that these two input processing modules have comparable performances on intra-lingual and cross-lingual voice cloning.

4.2 Why fails cross-lingual voice cloning

Table 1 shows that the speaker similarity of CL VL is significantly lower than the intra-lingual performance. We learn from an informal listening test that many Chinese utterances synthesized using the English speaker’s voice sound like the Chinese speaker and English utterances synthesized using the Chinese speaker’s voice sound like the English speaker. In other words, only using IPA does not guarantee a perfect disentanglement between speaker identities and language symbols.

We hypothesize that this result can be attributed to the fact that (1) there are some non-overlapped IPA symbols across two target languages; (2) the suprasegmental, including tone and stress, are unique to only one of the target languages. To test our hypothesizes, we devised two input perturbation methods.

  • IPA perturbation: Replace all the IPA symbols in testing sentences in one language with the non-overlapped IPA symbols from the other language randomly. To remove the potential effect of tone/stress, we replace all tone/stress symbols with the special non-tone symbol.

  • Tone/stress perturbation: Replace all tone symbols in Chinese testing sentences with the primary stress symbol in English, or replace all stress symbols in English testing sentences with the mid-tone in Chinese. To remove the potential effect of the non-overlapped IPA symbols, we replace them with their closest IPA symbols as in [22]

We use these two input perturbation methods to modify the original testing sentences and create in total six test datasets, namely CH and EN (original Chinese and English test data), CH_IP and EN_IP (Chinese and English test data with IPA perturbation), and CH_TP and EN_SP (Chinese and English test data with tone/stress perturbation). We then use MUEI to synthesize these six test datasets. We implement a speaker similarity preference test, where the raters are asked to judge whether the synthesized utterance is close to the voice of the Chinese speaker, the English one, or neither of them. Since using the proposed IPA or tone/stress perturbation may result in non-intelligible or accented speech, we ask the raters to focus on the speaker similarity during the test.The results are illustrated in Fig. 3 and Fig. 4.

Speaker Model Text CH EN CS
Ground-Truth - -
Analysis-Synthesis - -
C4E1 4.54 0.10 4.62 0.08
C4E4 4.07 0.11 4.17 0.08 4.06 0.12 4.06 0.11
C4E4 4.07 0.11 3.68 0.14 4.46 0.09 3.98 0.13 4.11 0.13 3.63 0.14
Table 2: Naturalness (NAT) and similarity (SIM) MOS of synthesized speech by models with different training data.

4.2.1 The effect of non-overlapped IPA

As shown in Fig. 3 and Fig. 4, with IPA perturbation, the speaker similarity of the Chinese synthesized utterances decreases significantly for the Chinese speaker, and increases significantly for the English speaker (see CH_IP). When using IPA perturbation to the English text, the speaker similarity for the Chinese speaker increases while that for the English speaker decreases (see EN_IP). These results support our hypothesis that the non-overlapped IPA symbols are likely to contain some speaker information.

4.2.2 The effect of tone/stress

With tone perturbation, the speaker similarity of the Chinese synthesized utterances decreases significantly for the Chinese speaker and increases significantly for the English speaker (see CH_TP). This indicates that the stress symbols in English contain speaker information of the English speaker. For the English text, stress perturbation significantly increases the speaker similarity for the Chinese speaker, while it decreases the speaker similarity for the English speaker by a large margin (see EN_SP). This reveals that the tone symbols in Chinese are also responsible for the speaker information leakage.

Figure 3: Speaker similarity preference of synthesized utterances of six test datasets using the Chinese speaker’s voice.
Figure 4: Speaker similarity preference of synthesized utterances of six test datasets using the English speaker’s voice.

4.3 The number of speakers

In section 4.2, we find that both non-overlapped (or language-unique) IPA and tone/stress symbols are likely to contain some speaker information, which causes the model to fail cross-lingual voice cloning. In this section, we continued the investigation by proposing the following hypothesizes.

Hypothesis 1: The secondary or indirect reason our models fail CL VL is that we only use two speakers as training data. In other words, as we increase the number of speakers, this failure can be avoided.

Hypothesis 2: Increasing the number of speakers in only one language would result in success to CL VL for speakers in this language, but a failure for the speaker in the other language.

To test our hypothesizes we compared several model variants trained with different subsets of Dataset2:

  • C1E1: Model trained with one Chinese speaker and one English speaker (MUEI in section 4.1 ).

  • C1E4: Model trained with one Chinese speaker and four English speakers.

  • C4E1: Model trained with four Chinese speakers and one English speaker.

  • C4E4: Model trained with four Chinese speakers and four English speakers.

We use the input processing module UEI in this scenario for fair comparisons. The results of naturalness and speaker similarity MOS evaluations are illustrated in Table 2.

As shown in Table 2, the speaker similarity of cross-lingual voice cloning tends to increase as the number of speakers increases. In addition, when only increasing the number of speakers in one target language (i.e., C1E4 or C4E1), the speaker similarity improvement on CL VL for speakers in that target language is more significant than speakers in the other language. However, the naturalness MOS for speakers in that target language shows a decreasing trend. We suspect that the models have learned to disentangle the speaker identities from the language-unique symbols but fail to synthesize natural cross-lingual speech due to the imbalanced distribution of the training data. Hence, increasing the number of speakers in all languages provides the best CL VL performance.

Furthermore, we provided the results of code-switched synthesized speech. Model C1E1 performs a decent performance on code-switched utterances. We suspect that when synthesizing these code-switched sentences, the non-overlapped IPA symbols in two languages are likely to compete with each other to leak the speaker information, and tones in Chinese and stresses in English would join the competition as well. As the results indicate, they usually play a tie game in that they fail to leak the speaker information, and the speaker embedding plays its full role. In addition, we observed a steady improvement through increasing the number of speakers, and Model C4E4 achieves the best code-switched performance.

Fig. 5 provides a vivid visualization of the speaker similarity of synthesized speech as well. Model C4E4 provides the ideal clustering: the synthesized speech clusters close to the respective target speaker’s ground-truth utterances, and the synthesized speech from two speakers is separated at a considerable distance.

Figure 5: Visualizing the effect of the number of speakers, using 2D UMAP [14] of speaker embeddings[24] computed from speech synthesized with different speaker and text combinations. The orange color represents speech from the Chinese speaker, while blue represents the English one. , , and denotes the Chinese, code-switched, and English text, respectively; and refers to the ground truth utterances.

5 Conclusions

In this study, we present an empirical study of building an IPA-based cross-lingual non-autoregressive TTS model. We conclude our findings as follows.

  • The way to process IPA and tone/stress sequences has a negligible impact on the CL VL performance.

  • IPA alone does not guarantee successful CL VL performance since the language-unique IPA and tone/stress symbols are likely to leak the speaker information.

  • One simple but effective method to improve the CL VL performance of IPA-based CL TTS is to increase the number of speakers in all languages.

Although our findings are based on the non-autoregressive TTS model, they should be generalized well to other TTS frameworks.


  • [1] E. Bakhturina, V. Lavrukhin, B. Ginsburg, and Y. Zhang (2021) Hi-fi multi-speaker english tts dataset. arXiv preprint arXiv:2104.01497. Cited by: §3.1.
  • [2] Y. Cao, X. Wu, S. Liu, J. Yu, X. Li, Z. Wu, X. Liu, and H. Meng (2019-05) End-to-end code-switched tts with mix of monolingual recordings. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 6935–6939. External Links: Document, ISSN 1520-6149 Cited by: §1.
  • [3] Y. Cao, S. Liu, X. Wu, S. Kang, P. Liu, Z. Wu, X. Liu, D. Su, D. Yu, and H. Meng (2020) Code-switched speech synthesis using bilingual phonetic posteriorgram with only monolingual corpora. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7619–7623. Cited by: §1.
  • [4] K. R. Chandu, S. K. Rallabandi, S. Sitaram, and A. W. Black (2017) Speech synthesis for mixed-language navigation instructions. In Interspeech 2017, Cited by: §1.
  • [5] D. B. China (2017) Chinese standard mandarin speech corpus. Note: Cited by: §3.1.
  • [6] J. He, Y. Qian, F. Soong, and S. Zhao (2012) Turning a monolingual speaker into multilingual for a mixed-language tts. In INTERSPEECH, Cited by: §1.
  • [7] K. Ito and L. Johnson (2017) The lj speech dataset. Note: Cited by: §3.1.
  • [8] N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. van den Oord, S. Dieleman, and K. Kavukcuoglu (2017) Efficient neural audio synthesis.

    Proceedings of the 35th International Conference on Machine Learning

    PMLR 80, pp. 2410–2419.
    Cited by: §1.
  • [9] J. Kong, J. Kim, and J. Bae (2020)

    Hifi-gan: generative adversarial networks for efficient and high fidelity speech synthesis

    arXiv preprint arXiv:2010.05646. Cited by: §1, §3.2.
  • [10] J. Latorre, K. Iwano, and S. Furui (2005-03) Polyglot synthesis using a mixture of monolingual corpora. In Proceedings. (ICASSP ’05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005., Vol. 1, pp. I/1–I/4 Vol. 1. External Links: Document, ISSN 2379-190X Cited by: §1.
  • [11] B. Li, Y. Zhang, T. Sainath, Y. Wu, and W. Chan (2019-05) Bytes are all you need: end-to-end multilingual speech recognition and synthesis with bytes. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 5621–5625. External Links: Document, ISSN 1520-6149 Cited by: §1.
  • [12] R. Liu, X. Wen, C. Lu, and X. Chen (2020) Tone learning in low-resource bilingual tts.. In INTERSPEECH, pp. 2952–2956. Cited by: §1.
  • [13] M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Sonderegger (2017) Montreal forced aligner: trainable text-speech alignment using kaldi.. In Interspeech, Vol. 2017, pp. 498–502. Cited by: §3.2.
  • [14] L. McInnes, J. Healy, and J. Melville (2018) Umap: uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426. Cited by: Figure 5.
  • [15] T. Nekvinda and O. Dusek (2020) One model, many languages: meta-learning for multilingual text-to-speech. In INTERSPEECH, Cited by: §1.
  • [16] A. Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals, K. Kavukcuoglu, G. Driessche, E. Lockhart, L. Cobo, F. Stimberg, et al. (2018) Parallel wavenet: fast high-fidelity speech synthesis. In International conference on machine learning, pp. 3918–3926. Cited by: §1.
  • [17] W. Ping, K. Peng, A. Gibiansky, S. Ö. Arik, A. Kannan, S. Narang, J. Raiman, and J. Miller (2018) Deep voice 3: scaling text-to-speech with convolutional sequence learning. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, External Links: Link Cited by: §1.
  • [18] Y. Qian, H. Cao, and F. K. Soong (2008-12) HMM-based mixed-language (mandarin-english) speech synthesis. In 2008 6th International Symposium on Chinese Spoken Language Processing, Vol. , pp. 1–4. External Links: Document, ISSN null Cited by: §1.
  • [19] Y. Qian, H. Liang, and F. K. Soong (2009-08) A cross-language state sharing and mapping approach to bilingual (mandarin–english) tts. IEEE Transactions on Audio, Speech, and Language Processing 17 (6), pp. 1231–1239. External Links: Document, ISSN 1558-7924 Cited by: §1.
  • [20] Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T. Liu (2020) Fastspeech 2: fast and high-quality end-to-end text to speech. arXiv preprint arXiv:2006.04558. Cited by: §1, §2.1, §3.2.
  • [21] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan, et al. (2018) Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783. Cited by: §1.
  • [22] M. Staib, T. H. Teh, A. Torresquintero, D. S. R. Mohan, L. Foglianti, R. Lenain, and J. Gao (2020) Phonological features for 0-shot multilingual speech synthesis. arXiv preprint arXiv:2008.04107. Cited by: 2nd item.
  • [23] C. Traber, K. Huber, K. Nedir, B. Pfister, E. Keller, and B. Zellner (1999) From multilingual to polyglot speech synthesis. In In Proceedings of Eurospeech 99, pp. 835–838. Cited by: §1.
  • [24] L. Wan, Q. Wang, A. Papir, and I. L. Moreno (2018) Generalized end-to-end loss for speaker verification. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4879–4883. Cited by: Figure 5.
  • [25] D. Xin, Y. Saito, S. Takamichi, T. Koriyama, and H. Saruwatari (2020) Cross-lingual text-to-speech synthesis via domain adaptation and perceptual similarity regression in speaker space.. In INTERSPEECH, pp. 2947–2951. Cited by: §1.
  • [26] L. Xue, W. Song, G. Xu, L. Xie, and Z. Wu (2019) Building a mixed-lingual neural TTS system with only monolingual data. In Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15-19 September 2019, G. Kubin and Z. Kacic (Eds.), pp. 2060–2064. External Links: Link, Document Cited by: §1, §1.
  • [27] H. Zhan, H. Zhang, W. Ou, and Y. Lin (2021) Improve cross-lingual text-to-speech synthesis on monolingual corpora with pitch contour information. Cited by: §1.1, §1.
  • [28] Y. Zhang, R. J. Weiss, H. Zen, Y. Wu, Z. Chen, R. Skerry-Ryan, Y. Jia, A. Rosenberg, and B. Ramabhadran (2019) Learning to speak fluently in a foreign language: multilingual speech synthesis and cross-language voice cloning. In Interspeech, External Links: Link Cited by: §1.1, §1.