Over the recent years, various approaches have been proposed to build cross-lingual Text-to-Speech (TTS) systems [34, 23, 33]. However, L2 (second-language) accents frequently appear in such cross-lingual scenarios, and several attempts have been made to improve its nativeness [34, 23, 33]. To make matters even worse, subjective evaluation of less major languages is challenging, especially for researchers in a less diverse environment. Running a subjective evaluation test every time can be a hassle, unless you are a fluent polygot.
The vowel space analysis method can be a nice alternative. The vowel space refers to the two-dimensional area depicting the jaw opening and the tongue positions of vowels . It is accompanied by the formant analysis, since the jaw opening and the horizontal position of the tongue are highly related to the first and second formant frequencies ( and ), respectively [10, 27]. It is utilized to explore various aspects of language, including L2 accents [7, 16, 17, 14, 24, 6].
Abeysinghe, et al.  propose to utilize the vowel space analysis as an intermediary evaluation tool to assess accents or dialects of English. In this paper, we extend their vowel space analysis approach  to cross-lingual TTS systems. We propose a cross-lingual vowel space analysis method and probe L2 accents in cross-lingual TTS systems. To our best knowledge, this is the first attempt to utilize the vowel space analysis to investigate the L2 accents of cross-lingual TTS systems.
In this work, we focus on the following research questions, by utilizing the vowel space analysis in cross-lingual scenarios:
Do L2 accents of cross-lingual TTS systems differ by model architectures?
How are L2 accents of cross-lingual TTS systems different by linguistic characteristics among languages?
Is any of L2 accent observations in cross-lingual TTS systems also present in linguistics literature conducted on actual human L2 learners?
To answer the questions above, we choose to explore cross-lingual TTS systems based on Tacotron [31, 28, 8] and Glow-TTS  as representative backbone models of auto-regressive and parallel (non auto-regressive) architectures, respectively. We assess the following two factors that are commonly explored in the vowel space analysis [16, 17, 14, 24, 6]: the accuracy and compactness of vowel categories. It has been reported that these two factors are highly correlated with L2 accents [14, 17]. The vowel accuracy is the correctness of the position of a synthesized vowel, compared to its native one on the vowel space. The vowel compactness, or the vowel variability, is one way to assess acoustic stability among realizations of each vowel .
Evaluating these metrics, we compare the shared and non-shared vowels in a language pair. The shared vowels refer to the vowels that exist in both languages of a language pair, while the non-shared vowels do not. We inspect some of the vulnerable cases in cross-lingual synthesis, regarding these linguistic characteristics. In addition, we investigate some of the L2 accent characteristics from TTS systems, that are also present in experimental results conducted on actual human L2 learners.
We utilize part of the CSS10 datasets  along with the LJ Speech dataset  and our in-house datasets recorded by professional voice actors. The total database consists of three American English (EN) speakers, two Korean (KO) speakers, and one speaker from each of the following languages: German (DE), Spanish (ES), French (FR), and Japanese (JA). Table 2 shows the detailed database information. We randomly select 300 utterances from each speaker as the test set.
2.2 TTS systems
We investigate two mainstream architectures of TTS systems: auto-regressive and parallel (non auto-regressive). We choose a Tacotron [31, 28] variant  as the representative model for the former, and Glow-TTS  for the latter. They are suitable for cross-lingual analysis, as no external forced-alignment is required. The Tacotron variant is a slightly modified version, utilizing the same acoustic features as the LPCNet . Glow-TTS is also slightly adjusted to take the same acoustic features as input. In this paper, for simplicity, we refer this Tacotron variant and the modified Glow-TTS as Tacotron and Glow-TTS, respectively. We use the bunched LPCNet2  as a neural vocoder to transform acoustic features into actual waveforms.
We train these systems in single speaker and multi-lingual versions. In the single speaker version, we use the base architecture from each model [8, 18], without any speaker or language embedding, only one speaker’s data being separately trained. This single speaker version is essentially mono-lingual, therefore we assume that no L2 accent is present in this condition. Hence, we set this single speaker version as the non L2-accented anchor. The multi-lingual version is an extended version of the single speaker version. In the multi-lingual version, we utilize the speaker embedding and the language embedding from a look-up table. In the case of Tacotron, the speaker and language embeddings are fed to the decoder as in the . In Glow-TTS, the speaker embedding is applied as in the original architecture , and the language embedding is concatenated to the speaker embedding. We train the Tacotron single speaker version for 350k steps, and the multi-lingual version for 500k steps. Glow-TTS versions usually require more training steps, so we train the Glow-TTS single speaker version for 500k steps, and the multi-lingual version for 1000k steps. Other training configurations are identical to the original architectures [8, 34, 18].
We utilize the International Phonetic Alphabet (IPA)  as phoneme input to the TTS systems. In the multi-lingual versions, we add a language tagging to each phoneme to differentiate the same IPA phoneme in different languages, as one IPA phoneme’s actual acoustic realization as a phone may differ in each language. For instance, the phoneme input /i/ in English and the same phoneme input /i/ in Korean are considered different input tokens. We limit our inspection scope to monophthongs, since analyzing diphthong requires a time-dynamic analysis.
|Language||Code||Utterances||Duration (h)||# of Vowels|
2.3 Vowel space analysis method
We extend the previous vowel space analysis approach  from mono-lingual to cross-lingual scenarios. Our vowel analysis method comprises the following steps. First, we extract the first and second formants of synthesized vowels. Next, we apply speaker normalization to the formant values. Lastly, we measure vowel accuracy and compactness for analysis.
2.3.1 Formant estimation
We first begin with extracting formant values. To synthesize vowels from the TTS systems, we give text input as a sequence of tokens, beginning with a target vowel followed by a silence token and a subsequent random sentence starting with a voiceless plosive consonant such as /p/, /t/, or /k/. Afterwards, we slice the synthesized speech by the first silence boundary and take the first segment. We set the threshold energy as 10 dB. We empirically observed that appending a sentence beginning with a voiceless plosive consonant induces a clearer silence boundary, easing the automatic slicing process. As in , the and values are extracted at the middle point of the chosen segment. We synthesize 100 samples per vowel, and set the median as the representative.
Our approach has some advantages over the previous method . First, it requires no external aligner or segmentation, thus it is free from any potential errors caused by an external segmentation tool. It can also be utilized to low resource languages where no aligner or segmentation tool is publicly available. Second, it minimizes the co-articulation effect by nearby phonemes in the cross-lingual scenarios. For example, the conventional hVd test  does not apply equally to some languages such as French and Japanese, where syllable-initial /h/ and syllable-final /d/ are not allowed, respectively. Third, it is closer to the actual usage scenario of TTS systems than the previous method. In the previous approach , multiple hVd words with an in-between pause are given as text input. However, this is slightly astray from what a TTS system usually is trained for. Compared to this, our method is more coherent with the real world usage of TTS systems.
2.3.2 Speaker normalization
The raw numerical and values in the vowel space vary by speaker . In order to extend Abeysinghe et al.’s approach  to cross-lingual TTS systems, a speaker normalization process should precede. We apply the renowned Lobanov normalization method  to normalize speaker information on the vowel space, as many other linguistics researchers have proven its effectiveness [2, 9].
2.3.3 Vowel space visualization
After the speaker normalization process, we visualize the vowels on the vowel space diagram, where the normalized and values are depicted in the reversed and axis, respectively. Figure 1
shows the vowel space of a male Korean speaker after the formant estimation and speaker normalization process. The positions of the vowels in the vowel space correspond to those from Korean phonology.
2.3.4 Vowel accuracy and compactness
We focus on the following two factors that are commonly investigated in the cross-lingual analysis: accuracy and compactness of vowel categories [16, 17, 14, 24, 6]. It has been frequently reported that L2 accent perception is highly correlated with these two factors [14, 17]. The vowel accuracy, or correctness, is the distance to the native productions of the target vowels . Assuming that non cross-lingual TTS systems are not L2-accented, we measure the vowel accuracy by calculating the Euclidean distance between the non-accented target vowel and the corresponding vowel which is cross-lingually synthesized. The greater vowel distance indicates the more L2-accented speech. The vowel compactness, or variability, is one way to evaluate acoustic stability among vowel productions 
. It is measured by calculating the standard deviation of each vowel. The greater vowel standard deviation, or less compactness, implies the more L2-accented speech.
3 Results and discussion
We first examine if L2 accents differ by the model architecture. Table 2 shows the mean vowel distance and the standard deviation of cross-lingually synthesized vowels from Tacotron and Glow-TTS. Glow-TTS exhibits lower values in both metrics than Tacotron. This implies that the L2 accent is more severe in Tacotron compared to Glow-TTS. This phenomenon may arise from their different nature in auto-regressiveness, that is, auto-regressive systems are more influenced by preceding input tokens than its parallel counterpart during training. However, further investigations are required to more clearly understand why Tacotron performs more poorly than Glow-TTS in cross-lingual speech synthesis.
|Model||Distance ()||Standard Deviation ()|
We also explore if L2 accents of cross-lingual TTS systems are influenced by the linguistic characteristics among languages and if any of the observations appears in linguistics literature on actual human L2 learners. We begin by comparing shared and non-shared vowels in a language pair. The shared vowels are the vowels that appear in both languages in a language pair. For example, /i/ and /u/ are examples of the shared vowels in the English-German pair. On the contrary, the non-shared vowels appear only in one language in a language pair, such as the German umlaut /y/. As shown in Table 2, the vowels that are not shared in a language pair exhibit lower vowel accuracy and compactness, which indicate that the non-shared vowels are more prone to L2 accents. The fact that unseen sounds are usually harder to process is one possible explanation for this result. Considering that the shared vowels in different languages are regarded as different text input tokens, the chance is low that this result is caused by sharing the same text input tokens. A phenomenon similar to this is also observed in some studies on human L2 learners, that adult L2 learners exhibit difficulty with non-native phonological segments .
The vowel accuracy is the lowest when the target language is German regardless of the source language, as shown in Figure 2. It is also low when the source language is either Spanish or Japanese, especially in Tacotron. This result may be related to the number of vowels each language has. The German language has approximately 17 vowels (monophthongs), while Spanish or Japanese has only five. When German is the target language, more unseen vowel sound categories should be created from the source language, and this is the most prominent in the languages with five vowels. Similarly, human L2 learners experience more difficulty when the number of new sound categories they need to learn increases [12, 32]. Poor performance is observed when the target language is Japanese in Glow-TTS as in Figure 1(b). This may arise from lack of explicit pitch accent information, which such parallel-based systems often require to properly model the Japanese language.
We inspect some vowels that are especially prone to L2 accents. To list a few, the French nasal vowels and the Korean vowel /W/ rank top, probably because these vowels exist uniquely in each language. Figure3 depicts the Korean vowel /W/ on the vowel space along with other cross-lingually synthesized /W/ samples. All of the cross-lingually synthesized samples are located in the upper region compared to the native one, which means too much jaw closing. The cross-lingual Korean /W/ when the source language is Japanese seems to be the least distant. One possible reason is that the Japanese /u/ is usually realized as [WB] (close back compressed vowel), slightly closer to the Korean /W/ . Hence, when the source language is Japanese, the Korean /W/ is relatively less unseen compared to other European languages. The realizations of /W/ tend to be more distant from that of Korean, when the source language is one of the three European languages (German, English, and Spanish), as they have distinct vowel systems from Korean. In addition, some linguistics study suggests that Korean-English bilingual children speakers often assimilate the Korean /W/ to /u/ , and this phenomenon partly corresponds to our findings, as the jaw opening value () of the cross-lingual /W/ from English is closer to /u/.
Our findings suggest that it may be necessary to consider linguistic characteristics among languages to mitigate the L2 accent issue. For instance, an additional module, embedding, or loss function properly reflecting these linguistic characteristics in a language pair may help improve any L2 accents in cross-lingual scenarios. These observations also imply we can utilize linguistics studies conducted on humans to develop TTS systems. Our work may even be utilized as auxiliary information for building a new database or balancing an existing one.
4 Conclusion and Future Work
In this study, extending the vowel space analysis method to cross-lingual TTS systems, we explore L2 accents in such TTS systems. We also compare some of observations on cross-lingual TTS systems with linguistics studies conducted on humans. We hope our interdisciplinary study on cross-lingual TTS systems provides inspiration in relevant fields.
Based on what we have discovered, we are planning to mitigate the L2 accent problem in such cross-lingual TTS models, by proposing additional loss functions based on the linguistic characteristics in a language pair.
-  (2022) Visualising Model Training via Vowel Space for Text-To-Speech Systems. In Proc. Interspeech, pp. 511–515. External Links: Cited by: §1, §2.3.1, §2.3.1, §2.3.2, §2.3.
-  (2004) A comparison of vowel normalization procedures for language variation research. The Journal of the Acoustical Society of America 116 (5), pp. 3099–3107. Cited by: §2.3.2.
-  (1999) Handbook of the international phonetic association: a guide to the use of the international phonetic alphabet. Cambridge University Press. Cited by: §2.2.
-  (2005) Interaction of native-and second-language vowel system (s) in early and late bilinguals. Language and speech 48 (1), pp. 1–27. Cited by: §3.
-  (2013) Practical phonetics and phonology: a resource book for students. Routledge. Cited by: Table 1.
-  (2013) Non-native vowel production accuracy and variability in relation to overall intelligibility. The Journal of the Acoustical Society of America 134 (5), pp. 4107–4107. Cited by: §1, §1, §2.3.4.
-  (1998) Learning to pronounce vowel sounds in a foreign language using acoustic measurements of the vocal tract as feedback in real time. Language and Speech 41 (1), pp. 1–20. Cited by: §1.
-  (2020) High Quality Streaming Speech Synthesis with Low, Sentence-Length-Independent Latency. In Proc. Interspeech, pp. 2022–2026. External Links: Cited by: §1, §2.2, §2.2.
-  (2009) A comparison of three speaker-intrinsic vowel formant frequency normalization algorithms for sociophonetics. Language Variation and Change 21 (3), pp. 413–435. Cited by: §2.3.2.
-  (1973) Speech sounds and features.. Cited by: §1.
-  (2003) Assessing constraints on second-language segmental production and perception. Phonetics and phonology in language comprehension and production: Differences and similarities 6, pp. 319–355. Cited by: §3.
-  (1997) Perpeption and production of a new vowel category. Second-language speech: Structure and process 13, pp. 53. Cited by: §3.
-  (1995) Acoustic characteristics of american english vowels. The Journal of the Acoustical society of America 97 (5), pp. 3099–3111. Cited by: §2.3.1.
-  (2020) The relation between l1 and l2 category compactness and l2 vot learning. In Proc. of Meetings on Acoustics 179ASA, Vol. 42, pp. 060011. Cited by: §1, §1, §2.3.4.
-  (2017) The lj speech dataset. Note: https://keithito.com/LJ-Speech-Dataset/ Cited by: §2.1.
-  (2014) On the effects of l2 perception and of individual differences in l1 production on l2 pronunciation. Frontiers in psychology 5, pp. 1246. Cited by: §1, §1, §2.3.4.
-  (2016) Mutual influences between native and non-native vowels in production: evidence from short-term visual articulatory feedback training. Journal of Phonetics 57, pp. 21–39. Cited by: §1, §1, §2.3.4.
-  (2020) Glow-tts: a generative flow for text-to-speech via monotonic alignment search. NeurIPS 33, pp. 8067–8077. Cited by: §1, §2.2, §2.2.
-  (2015) Duden–das aussprachewörterbuch. bearbeitet von stefan kleiner und ralf knöbl in zusammenarbeit mit der dudenredaktion. 7., komplett überarb. und aktual. Aufl. Berlin. Cited by: Table 1.
-  (2003) The vowel system of contemporary korean and direction of change. Journal of Korean Linguistics 41, pp. 59–91. Cited by: Figure 1, §2.3.3, Table 1.
-  (2014) A course in phonetics. Cengage learning. Cited by: Table 1.
-  (1971) Classification of russian vowels spoken by different speakers. The Journal of the Acoustical Society of America 49 (2B), pp. 606–608. Cited by: §2.3.2.
-  (2019) Unsupervised polyglot text-to-speech. In Proc. ICASSP, pp. 7055–7059. Cited by: §1.
-  (2013) Best practices in measuring vowel merger. In Proc. of Meetings on Acoustics 166ASA, Vol. 20, pp. 060008. Cited by: §1, §1, §2.3.4.
-  (2019) CSS10: a collection of single speaker speech datasets for 10 languages. Interspeech. Cited by: §2.1.
-  (2022) Bunched LPCNet2: Efficient Neural Vocoders Covering Devices from Cloud to Edge. In Proc. Interspeech, pp. 808–812. External Links: Cited by: §2.2.
-  (2013) Automatic assessment of vowel space area. The Journal of the Acoustical Society of America 134 (5), pp. EL477–EL483. Cited by: §1.
-  (2018) Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In Proc. ICASSP, Vol. , pp. 4779–4783. External Links: Cited by: §1, §2.2.
-  (2019) LPCNet: improving neural speech synthesis through linear prediction. In ICASSP, pp. 5891–5895. Cited by: §2.2.
-  (2008) The sounds of japanese with audio cd. Cambridge University Press. Cited by: Table 1, §3.
-  (2017) Tacotron: Towards End-to-End Speech Synthesis. In Proc. Interspeech, pp. 4006–4010. External Links: Cited by: §1, §2.2.
-  (1997) Minimal segments in second language phonology. Second language speech: Structure and process, pp. 263–312. Cited by: §3.
-  (2022) Improving cross-lingual speech synthesis with triplet training scheme. In ICASSP, pp. 6072–6076. Cited by: §1.
-  (2019) Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning. In Proc. Interspeech, pp. 2080–2084. Cited by: §1, §2.2.