Log In Sign Up

An Empirical Study on L2 Accents of Cross-lingual Text-to-Speech Systems via Vowel Space

by   JIhwan Lee, et al.

With the recent developments in cross-lingual Text-to-Speech (TTS) systems, L2 (second-language, or foreign) accent problems arise. Moreover, running a subjective evaluation for such cross-lingual TTS systems is troublesome. The vowel space analysis, which is often utilized to explore various aspects of language including L2 accents, is a great alternative analysis tool. In this study, we apply the vowel space analysis method to explore L2 accents of cross-lingual TTS systems. Through the vowel space analysis, we observe the three followings: a) a parallel architecture (Glow-TTS) is less L2-accented than an auto-regressive one (Tacotron); b) L2 accents are more dominant in non-shared vowels in a language pair; and c) L2 accents of cross-lingual TTS systems share some phenomena with those of human L2 learners. Our findings imply that it is necessary for TTS systems to handle each language pair differently, depending on their linguistic characteristics such as non-shared vowels. They also hint that we can further incorporate linguistics knowledge in developing cross-lingual TTS systems.


page 1

page 2

page 3

page 4


Revisiting IPA-based Cross-lingual Text-to-speech

International Phonetic Alphabet (IPA) has been widely used in cross-ling...

Cross-lingual Models of Word Embeddings: An Empirical Comparison

Despite interest in using cross-lingual knowledge to learn word embeddin...

Cross-lingual Short-text Matching with Deep Learning

The problem of short text matching is formulated as follows: given a pai...

Findings of the 2016 WMT Shared Task on Cross-lingual Pronoun Prediction

We describe the design, the evaluation setup, and the results of the 201...

Deep Feelings: A Massive Cross-Lingual Study on the Relation between Emotions and Virality

This article provides a comprehensive investigation on the relations bet...

Cross-lingual Hate Speech Detection using Transformer Models

Hate speech detection within a cross-lingual setting represents a paramo...

Bleaching Text: Abstract Features for Cross-lingual Gender Prediction

Gender prediction has typically focused on lexical and social network fe...

1 Introduction

Over the recent years, various approaches have been proposed to build cross-lingual Text-to-Speech (TTS) systems [34, 23, 33]. However, L2 (second-language) accents frequently appear in such cross-lingual scenarios, and several attempts have been made to improve its nativeness [34, 23, 33]. To make matters even worse, subjective evaluation of less major languages is challenging, especially for researchers in a less diverse environment. Running a subjective evaluation test every time can be a hassle, unless you are a fluent polygot.

The vowel space analysis method can be a nice alternative. The vowel space refers to the two-dimensional area depicting the jaw opening and the tongue positions of vowels [10]. It is accompanied by the formant analysis, since the jaw opening and the horizontal position of the tongue are highly related to the first and second formant frequencies ( and ), respectively [10, 27]. It is utilized to explore various aspects of language, including L2 accents [7, 16, 17, 14, 24, 6].

Abeysinghe, et al. [1] propose to utilize the vowel space analysis as an intermediary evaluation tool to assess accents or dialects of English. In this paper, we extend their vowel space analysis approach [1] to cross-lingual TTS systems. We propose a cross-lingual vowel space analysis method and probe L2 accents in cross-lingual TTS systems. To our best knowledge, this is the first attempt to utilize the vowel space analysis to investigate the L2 accents of cross-lingual TTS systems.

In this work, we focus on the following research questions, by utilizing the vowel space analysis in cross-lingual scenarios:

  1. Do L2 accents of cross-lingual TTS systems differ by model architectures?

  2. How are L2 accents of cross-lingual TTS systems different by linguistic characteristics among languages?

  3. Is any of L2 accent observations in cross-lingual TTS systems also present in linguistics literature conducted on actual human L2 learners?

To answer the questions above, we choose to explore cross-lingual TTS systems based on Tacotron [31, 28, 8] and Glow-TTS [18] as representative backbone models of auto-regressive and parallel (non auto-regressive) architectures, respectively. We assess the following two factors that are commonly explored in the vowel space analysis [16, 17, 14, 24, 6]: the accuracy and compactness of vowel categories. It has been reported that these two factors are highly correlated with L2 accents [14, 17]. The vowel accuracy is the correctness of the position of a synthesized vowel, compared to its native one on the vowel space. The vowel compactness, or the vowel variability, is one way to assess acoustic stability among realizations of each vowel [17].

Evaluating these metrics, we compare the shared and non-shared vowels in a language pair. The shared vowels refer to the vowels that exist in both languages of a language pair, while the non-shared vowels do not. We inspect some of the vulnerable cases in cross-lingual synthesis, regarding these linguistic characteristics. In addition, we investigate some of the L2 accent characteristics from TTS systems, that are also present in experimental results conducted on actual human L2 learners.

2 Methodology

2.1 Database

We utilize part of the CSS10 datasets [25] along with the LJ Speech dataset [15] and our in-house datasets recorded by professional voice actors. The total database consists of three American English (EN) speakers, two Korean (KO) speakers, and one speaker from each of the following languages: German (DE), Spanish (ES), French (FR), and Japanese (JA). Table 2 shows the detailed database information. We randomly select 300 utterances from each speaker as the test set.

2.2 TTS systems

We investigate two mainstream architectures of TTS systems: auto-regressive and parallel (non auto-regressive). We choose a Tacotron [31, 28] variant [8] as the representative model for the former, and Glow-TTS [18] for the latter. They are suitable for cross-lingual analysis, as no external forced-alignment is required. The Tacotron variant is a slightly modified version, utilizing the same acoustic features as the LPCNet [29]. Glow-TTS is also slightly adjusted to take the same acoustic features as input. In this paper, for simplicity, we refer this Tacotron variant and the modified Glow-TTS as Tacotron and Glow-TTS, respectively. We use the bunched LPCNet2 [26] as a neural vocoder to transform acoustic features into actual waveforms.

We train these systems in single speaker and multi-lingual versions. In the single speaker version, we use the base architecture from each model [8, 18], without any speaker or language embedding, only one speaker’s data being separately trained. This single speaker version is essentially mono-lingual, therefore we assume that no L2 accent is present in this condition. Hence, we set this single speaker version as the non L2-accented anchor. The multi-lingual version is an extended version of the single speaker version. In the multi-lingual version, we utilize the speaker embedding and the language embedding from a look-up table. In the case of Tacotron, the speaker and language embeddings are fed to the decoder as in the [34]. In Glow-TTS, the speaker embedding is applied as in the original architecture [18], and the language embedding is concatenated to the speaker embedding. We train the Tacotron single speaker version for 350k steps, and the multi-lingual version for 500k steps. Glow-TTS versions usually require more training steps, so we train the Glow-TTS single speaker version for 500k steps, and the multi-lingual version for 1000k steps. Other training configurations are identical to the original architectures [8, 34, 18].

We utilize the International Phonetic Alphabet (IPA) [3] as phoneme input to the TTS systems. In the multi-lingual versions, we add a language tagging to each phoneme to differentiate the same IPA phoneme in different languages, as one IPA phoneme’s actual acoustic realization as a phone may differ in each language. For instance, the phoneme input /i/ in English and the same phoneme input /i/ in Korean are considered different input tokens. We limit our inspection scope to monophthongs, since analyzing diphthong requires a time-dynamic analysis.

Language Code Utterances Duration (h) # of Vowels
German DE 7,427 16.7 17
English EN 39,455 59.5 12
Spanish ES 11,110 23.8 5
French FR 8,649 19.2 14
Japanese JA 6,841 14.9 5
Korean KO 24,962 34.0 7
Table 1: The number of utterances, the duration, and the number of vowels (monophthongs)222The exact number of vowels may vary by region and reference.[19, 5, 21, 20, 30] of each language in the database.

2.3 Vowel space analysis method

We extend the previous vowel space analysis approach [1] from mono-lingual to cross-lingual scenarios. Our vowel analysis method comprises the following steps. First, we extract the first and second formants of synthesized vowels. Next, we apply speaker normalization to the formant values. Lastly, we measure vowel accuracy and compactness for analysis.

2.3.1 Formant estimation

We first begin with extracting formant values. To synthesize vowels from the TTS systems, we give text input as a sequence of tokens, beginning with a target vowel followed by a silence token and a subsequent random sentence starting with a voiceless plosive consonant such as /p/, /t/, or /k/. Afterwards, we slice the synthesized speech by the first silence boundary and take the first segment. We set the threshold energy as 10 dB. We empirically observed that appending a sentence beginning with a voiceless plosive consonant induces a clearer silence boundary, easing the automatic slicing process. As in [1], the and values are extracted at the middle point of the chosen segment. We synthesize 100 samples per vowel, and set the median as the representative.

Figure 1: The normalized vowel space of a male Korean speaker which corresponds to the Korean phonology [20].

Our approach has some advantages over the previous method [1]. First, it requires no external aligner or segmentation, thus it is free from any potential errors caused by an external segmentation tool. It can also be utilized to low resource languages where no aligner or segmentation tool is publicly available. Second, it minimizes the co-articulation effect by nearby phonemes in the cross-lingual scenarios. For example, the conventional hVd test [13] does not apply equally to some languages such as French and Japanese, where syllable-initial /h/ and syllable-final /d/ are not allowed, respectively. Third, it is closer to the actual usage scenario of TTS systems than the previous method. In the previous approach [1], multiple hVd words with an in-between pause are given as text input. However, this is slightly astray from what a TTS system usually is trained for. Compared to this, our method is more coherent with the real world usage of TTS systems.

2.3.2 Speaker normalization

The raw numerical and values in the vowel space vary by speaker [22]. In order to extend Abeysinghe et al.’s approach [1] to cross-lingual TTS systems, a speaker normalization process should precede. We apply the renowned Lobanov normalization method [22] to normalize speaker information on the vowel space, as many other linguistics researchers have proven its effectiveness [2, 9].

2.3.3 Vowel space visualization

After the speaker normalization process, we visualize the vowels on the vowel space diagram, where the normalized and values are depicted in the reversed and axis, respectively. Figure 1

shows the vowel space of a male Korean speaker after the formant estimation and speaker normalization process. The positions of the vowels in the vowel space correspond to those from Korean phonology


2.3.4 Vowel accuracy and compactness

We focus on the following two factors that are commonly investigated in the cross-lingual analysis: accuracy and compactness of vowel categories [16, 17, 14, 24, 6]. It has been frequently reported that L2 accent perception is highly correlated with these two factors [14, 17]. The vowel accuracy, or correctness, is the distance to the native productions of the target vowels [6]. Assuming that non cross-lingual TTS systems are not L2-accented, we measure the vowel accuracy by calculating the Euclidean distance between the non-accented target vowel and the corresponding vowel which is cross-lingually synthesized. The greater vowel distance indicates the more L2-accented speech. The vowel compactness, or variability, is one way to evaluate acoustic stability among vowel productions [17]

. It is measured by calculating the standard deviation of each vowel. The greater vowel standard deviation, or less compactness, implies the more L2-accented speech.

3 Results and discussion

We first examine if L2 accents differ by the model architecture. Table 2 shows the mean vowel distance and the standard deviation of cross-lingually synthesized vowels from Tacotron and Glow-TTS. Glow-TTS exhibits lower values in both metrics than Tacotron. This implies that the L2 accent is more severe in Tacotron compared to Glow-TTS. This phenomenon may arise from their different nature in auto-regressiveness, that is, auto-regressive systems are more influenced by preceding input tokens than its parallel counterpart during training. However, further investigations are required to more clearly understand why Tacotron performs more poorly than Glow-TTS in cross-lingual speech synthesis.

Model Distance () Standard Deviation ()
shared non-shared shared non-shared
Tacotron 0.720 0.824 0.356 0.492
Glow-TTS 0.584 0.715 0.285 0.298
Table 2: The mean vowel distance and standard deviation across different TTS systems. The shared vowels refer to the vowels that appear common in a language pair, while the non-shared vowels exist only in one language in a language pair. The smaller vowel distance and standard deviation indicate more accurate and stable vowel production, respectively.
(a) Multi-lingual Tacotron
(b) Multi-lingual Glow-TTS
Figure 2: The mean vowel distances in each language pair. The darker blue indicates the greater vowel distance, or the lesser vowel accuracy. Relatively low vowel accuracy is observed when the target language is German, or when the source language is Japanese or Spanish.

We also explore if L2 accents of cross-lingual TTS systems are influenced by the linguistic characteristics among languages and if any of the observations appears in linguistics literature on actual human L2 learners. We begin by comparing shared and non-shared vowels in a language pair. The shared vowels are the vowels that appear in both languages in a language pair. For example, /i/ and /u/ are examples of the shared vowels in the English-German pair. On the contrary, the non-shared vowels appear only in one language in a language pair, such as the German umlaut /y/. As shown in Table 2, the vowels that are not shared in a language pair exhibit lower vowel accuracy and compactness, which indicate that the non-shared vowels are more prone to L2 accents. The fact that unseen sounds are usually harder to process is one possible explanation for this result. Considering that the shared vowels in different languages are regarded as different text input tokens, the chance is low that this result is caused by sharing the same text input tokens. A phenomenon similar to this is also observed in some studies on human L2 learners, that adult L2 learners exhibit difficulty with non-native phonological segments [11].

The vowel accuracy is the lowest when the target language is German regardless of the source language, as shown in Figure 2. It is also low when the source language is either Spanish or Japanese, especially in Tacotron. This result may be related to the number of vowels each language has. The German language has approximately 17 vowels (monophthongs), while Spanish or Japanese has only five. When German is the target language, more unseen vowel sound categories should be created from the source language, and this is the most prominent in the languages with five vowels. Similarly, human L2 learners experience more difficulty when the number of new sound categories they need to learn increases [12, 32]. Poor performance is observed when the target language is Japanese in Glow-TTS as in Figure 1(b). This may arise from lack of explicit pitch accent information, which such parallel-based systems often require to properly model the Japanese language.

We inspect some vowels that are especially prone to L2 accents. To list a few, the French nasal vowels and the Korean vowel /W/ rank top, probably because these vowels exist uniquely in each language. Figure

3 depicts the Korean vowel /W/ on the vowel space along with other cross-lingually synthesized /W/ samples. All of the cross-lingually synthesized samples are located in the upper region compared to the native one, which means too much jaw closing. The cross-lingual Korean /W/ when the source language is Japanese seems to be the least distant. One possible reason is that the Japanese /u/ is usually realized as [WB] (close back compressed vowel), slightly closer to the Korean /W/ [30]. Hence, when the source language is Japanese, the Korean /W/ is relatively less unseen compared to other European languages. The realizations of /W/ tend to be more distant from that of Korean, when the source language is one of the three European languages (German, English, and Spanish), as they have distinct vowel systems from Korean. In addition, some linguistics study suggests that Korean-English bilingual children speakers often assimilate the Korean /W/ to /u/ [4], and this phenomenon partly corresponds to our findings, as the jaw opening value () of the cross-lingual /W/ from English is closer to /u/.

Our findings suggest that it may be necessary to consider linguistic characteristics among languages to mitigate the L2 accent issue. For instance, an additional module, embedding, or loss function properly reflecting these linguistic characteristics in a language pair may help improve any L2 accents in cross-lingual scenarios. These observations also imply we can utilize linguistics studies conducted on humans to develop TTS systems. Our work may even be utilized as auxiliary information for building a new database or balancing an existing one.

Figure 3: Cross-lingual realizations of the Korean vowel /W/ from various languages shown on the vowel space diagram.444Note that it is zoomed in for better visualization.

4 Conclusion and Future Work

In this study, extending the vowel space analysis method to cross-lingual TTS systems, we explore L2 accents in such TTS systems. We also compare some of observations on cross-lingual TTS systems with linguistics studies conducted on humans. We hope our interdisciplinary study on cross-lingual TTS systems provides inspiration in relevant fields.

Based on what we have discovered, we are planning to mitigate the L2 accent problem in such cross-lingual TTS models, by proposing additional loss functions based on the linguistic characteristics in a language pair.


  • [1] B. N. Abeysinghe, J. James, C. Watson, and F. Marattukalam (2022) Visualising Model Training via Vowel Space for Text-To-Speech Systems. In Proc. Interspeech, pp. 511–515. External Links: Document Cited by: §1, §2.3.1, §2.3.1, §2.3.2, §2.3.
  • [2] P. Adank, R. Smits, and R. Van Hout (2004) A comparison of vowel normalization procedures for language variation research. The Journal of the Acoustical Society of America 116 (5), pp. 3099–3107. Cited by: §2.3.2.
  • [3] I. P. Association (1999) Handbook of the international phonetic association: a guide to the use of the international phonetic alphabet. Cambridge University Press. Cited by: §2.2.
  • [4] W. Baker and P. Trofimovich (2005) Interaction of native-and second-language vowel system (s) in early and late bilinguals. Language and speech 48 (1), pp. 1–27. Cited by: §3.
  • [5] B. Collins and I. M. Mees (2013) Practical phonetics and phonology: a resource book for students. Routledge. Cited by: Table 1.
  • [6] S. Dimov and A. Bradlow (2013) Non-native vowel production accuracy and variability in relation to overall intelligibility. The Journal of the Acoustical Society of America 134 (5), pp. 4107–4107. Cited by: §1, §1, §2.3.4.
  • [7] A. Dowd, J. Smith, and J. Wolfe (1998) Learning to pronounce vowel sounds in a foreign language using acoustic measurements of the vocal tract as feedback in real time. Language and Speech 41 (1), pp. 1–20. Cited by: §1.
  • [8] N. Ellinas, G. Vamvoukakis, K. Markopoulos, A. Chalamandaris, G. Maniati, P. Kakoulidis, S. Raptis, J. S. Sung, H. Park, and P. Tsiakoulis (2020) High Quality Streaming Speech Synthesis with Low, Sentence-Length-Independent Latency. In Proc. Interspeech, pp. 2022–2026. External Links: Document Cited by: §1, §2.2, §2.2.
  • [9] A. H. Fabricius, D. Watt, and D. E. Johnson (2009) A comparison of three speaker-intrinsic vowel formant frequency normalization algorithms for sociophonetics. Language Variation and Change 21 (3), pp. 413–435. Cited by: §2.3.2.
  • [10] G. Fant (1973) Speech sounds and features.. Cited by: §1.
  • [11] J. E. Flege (2003) Assessing constraints on second-language segmental production and perception. Phonetics and phonology in language comprehension and production: Differences and similarities 6, pp. 319–355. Cited by: §3.
  • [12] O. B. E. Flege (1997) Perpeption and production of a new vowel category. Second-language speech: Structure and process 13, pp. 53. Cited by: §3.
  • [13] J. Hillenbrand, L. A. Getty, M. J. Clark, and K. Wheeler (1995) Acoustic characteristics of american english vowels. The Journal of the Acoustical society of America 97 (5), pp. 3099–3111. Cited by: §2.3.1.
  • [14] M. K. Huffman and K. S. Schuhmann (2020) The relation between l1 and l2 category compactness and l2 vot learning. In Proc. of Meetings on Acoustics 179ASA, Vol. 42, pp. 060011. Cited by: §1, §1, §2.3.4.
  • [15] K. Ito and L. Johnson (2017) The lj speech dataset. Note: Cited by: §2.1.
  • [16] N. Kartushina and U. H. Frauenfelder (2014) On the effects of l2 perception and of individual differences in l1 production on l2 pronunciation. Frontiers in psychology 5, pp. 1246. Cited by: §1, §1, §2.3.4.
  • [17] N. Kartushina, A. Hervais-Adelman, U. H. Frauenfelder, and N. Golestani (2016) Mutual influences between native and non-native vowels in production: evidence from short-term visual articulatory feedback training. Journal of Phonetics 57, pp. 21–39. Cited by: §1, §1, §2.3.4.
  • [18] J. Kim, S. Kim, J. Kong, and S. Yoon (2020) Glow-tts: a generative flow for text-to-speech via monotonic alignment search. NeurIPS 33, pp. 8067–8077. Cited by: §1, §2.2, §2.2.
  • [19] S. Kleiner (2015) Duden–das aussprachewörterbuch. bearbeitet von stefan kleiner und ralf knöbl in zusammenarbeit mit der dudenredaktion. 7., komplett überarb. und aktual. Aufl. Berlin. Cited by: Table 1.
  • [20] C. Kwak (2003) The vowel system of contemporary korean and direction of change. Journal of Korean Linguistics 41, pp. 59–91. Cited by: Figure 1, §2.3.3, Table 1.
  • [21] P. Ladefoged and K. Johnson (2014) A course in phonetics. Cengage learning. Cited by: Table 1.
  • [22] B. M. Lobanov (1971) Classification of russian vowels spoken by different speakers. The Journal of the Acoustical Society of America 49 (2B), pp. 606–608. Cited by: §2.3.2.
  • [23] E. Nachmani and L. Wolf (2019) Unsupervised polyglot text-to-speech. In Proc. ICASSP, pp. 7055–7059. Cited by: §1.
  • [24] J. Nycz and L. Hall-Lew (2013) Best practices in measuring vowel merger. In Proc. of Meetings on Acoustics 166ASA, Vol. 20, pp. 060008. Cited by: §1, §1, §2.3.4.
  • [25] K. Park and T. Mulc (2019) CSS10: a collection of single speaker speech datasets for 10 languages. Interspeech. Cited by: §2.1.
  • [26] S. Park, K. Choo, J. Lee, A. V. Porov, K. Osipov, and J. S. Sung (2022) Bunched LPCNet2: Efficient Neural Vocoders Covering Devices from Cloud to Edge. In Proc. Interspeech, pp. 808–812. External Links: Document Cited by: §2.2.
  • [27] S. Sandoval, V. Berisha, R. L. Utianski, J. M. Liss, and A. Spanias (2013) Automatic assessment of vowel space area. The Journal of the Acoustical Society of America 134 (5), pp. EL477–EL483. Cited by: §1.
  • [28] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan, R. A. Saurous, Y. Agiomvrgiannakis, and Y. Wu (2018) Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In Proc. ICASSP, Vol. , pp. 4779–4783. External Links: Document Cited by: §1, §2.2.
  • [29] J. Valin and J. Skoglund (2019) LPCNet: improving neural speech synthesis through linear prediction. In ICASSP, pp. 5891–5895. Cited by: §2.2.
  • [30] T. J. Vance (2008) The sounds of japanese with audio cd. Cambridge University Press. Cited by: Table 1, §3.
  • [31] Y. Wang, R.J. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. Le, Y. Agiomyrgiannakis, R. Clark, and R. A. Saurous (2017) Tacotron: Towards End-to-End Speech Synthesis. In Proc. Interspeech, pp. 4006–4010. External Links: Document Cited by: §1, §2.2.
  • [32] S. H. Weinberger (1997) Minimal segments in second language phonology. Second language speech: Structure and process, pp. 263–312. Cited by: §3.
  • [33] J. Ye, H. Zhou, Z. Su, W. He, K. Ren, L. Li, and H. Lu (2022) Improving cross-lingual speech synthesis with triplet training scheme. In ICASSP, pp. 6072–6076. Cited by: §1.
  • [34] Y. Zhang, R. J. Weiss, H. Zen, Y. Wu, Z. Chen, R.J. Skerry-Ryan, Y. Jia, A. Rosenberg, and B. Ramabhadran (2019) Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning. In Proc. Interspeech, pp. 2080–2084. Cited by: §1, §2.2.