Code-switching (CS) speech is a common phenomenon in multilingual countries and defined as speech that contains more than one language [auer2013code]. Recently, CS speech has received great attention in speech communities and is identified as one of the most challenging tasks for automatic speech recognition (ASR) systems, either for hybrid systems (e.g.[kaldi]) or end-to-end systems (e.g.[espnet]). One of the main challenges is the data scarcity issue, especially in the context of language modeling.
Data augmentation provides a potential solution for data scarcity because it is less time-consuming than collecting and transcribing real speech data and people have shown in many contexts [ko2015audio, gorin2016language, shorten2019survey, park2019specaugment, Bao2019] that it improves results. Text generation - a data augmentation method - has been proposed in [1stCS, Vu2014Generation, Yilmaz2018, Winata2018, Chang2019, Gao2019, Lee2019] with the aim of improving language models and therefore automatic speech recognition performance for CS speech. The first approach [1stCS, Chang2019, Gao2019, Lee2019]
breaks down the main task in two consecutive steps: first to predict CS start and end points and then to replace a certain number of words or phrases in monolingual text with translated words or phrases in another language using either rules or machine learning approaches. This approach suffers from the error propagation problem. The second approach[Vu2014Generation, Yilmaz2018]
leverages recurrent neural network language models trained with CS transcriptions. Because of the small amount of CS transcriptions, the generated texts are often meaningless[Vu2014Generation]. The third approach [Winata2018] formulates the CS text generation task as a sequence to sequence problem, mapping monolingual text to CS text and solves the task in an end-to-end fashion using sequence-to-sequence model [S2S, NMT-S2S]. This model, however, requires a large amount of parallel training data. Moreover in other contexts of text generation, it tends to generate ill-formed text [li-etal-2016-diversity, luo-etal-2018-auto].
Recently, CycleGAN - a variation of generative adversarial networks (GANs)[goodfellow2014generative] for style transfer - has been proposed for translating images [CycleGAN2017] and then applied in other domains [kaneko2018parallel, Bao2019]. The main idea of this model is to transfer one style of the source domain to another style of the target domain by learning a mapping between them without any parallel training data. The first attempt on text was done in [fu2018style] in which the authors interpreted text styles in a common sense and proposed model to modify sentiment of the text and to transform the text title between paper and news styles.
In this work, we propose the novel idea of considering Code-switching as a speaking style and therefore to use CycleGANs [CycleGAN2017] to generate artificially text data for two reasons: first, CycleGAN transfers styles without any parallel training data; second by doing so, CS text generation can be solved in an end-to-end fashion, meaning that the two tasks - CS points prediction and short text translation - will be jointly optimized to mimic the CS distributions in the real CS data and to maintain the same meaning when translating from one language to another. In order to successfully generate artificially CS text, we explore tricks how to train this complex model by using sequence-to-sequence model and by leveraging monolingual data, and investigate the impact of the cycle losses in CycleGAN on the language model and the automatic speech recognition performance.
In sum, our contributions are as follows: 1) To the best of our knowledge, we are the first to propose a novel framework based on the CycleGAN architecture for generating artificially CS texts from monolingual texts for language modeling; 2) We show that our CS generated texts contribute to improve not only the language modeling performance but also the ASR system on the SEAME corpus.
2 Proposed Method
2.1 Sequence-to-sequence (S2S) Model
The sequence-to-sequence model [S2S] proposed an end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. In the CS context, the S2S model takes a Chinese sentence as input and generates the Chinese-English CS sentence. For example, given a sequence of inputs (Chinese sentence)
where is the corresponding output sequence (Chinese-English CS sentence) whose length may differ from . is the representation of the input sequence given by the last hidden state of the encoder. The model is trained in a supervised manner. In CS, the trick is to generate paired training data by utilizing Chinese texts and their partial dictionary-based translations that form artificially generated CS texts [1stCS]. The idea of this simple method is to replace a certain number of words or phrases in monolingual texts with their translations based on a dictionary, or, as in this work, using the Google Translation API. Our intuition is that the S2S model will learn more generalized patterns in CS behaviours by taking advantage of the continuous latent semantic space and thereby generate more CS variations than the dictionary based method.
2.2 CycleGAN-based Model
The main disadvantage of the previous proposed method is that it does not make use of any prior knowledge encoded in an existing CS corpus. Therefore, we propose to employ the pretrained S2S models, which are and in Figure 1, in the CycleGAN architecture to take advantage of both methods: the S2S generation framework and the CS prior knowledge encoded in a ’Discriminator’ without any real paired data (monolingual text and its corresponding CS text).
CycleGAN is a currently popular technique to solve image to image translation problems where the goal is to learn a mappingbetween an input image from a source domain and an output image from a target domain without using paired training data [CycleGAN2017]. The mapping is learnt such that the distribution of images from is indistinguishable from the distribution using an adversarial loss. Because this mapping is highly under-constrained, [CycleGAN2017] coupled it with an inverse mapping and introduce a cycle consistency loss to push (and vice versa). The full objective is
where tries to generate images that look similar to images from domain , while aims to distinguish between translated samples and real samples . The same holds for the Discriminator which tries to distinguish between and real . controls the relative importance of cycle consistency loss:
In the context of CS, X refers to monolingual text (e.g. Chinese text) and Y refers to CS text. We define that
Furthermore, we adapt the objective to the following equation:
where and are tunable parameters to control the importance of the two cycle consistency losses. Moreover, we can tune the balance between them.
Figure 1 shows the network architecture of CycleGAN for generation of CS text. There are two Generators and : one transforms a Chinese sentence to a CS one, another converts a CS sentence to a Chinese one. The goal of the two Discriminators ( and ) is to predict whether the sample is from the actual distribution ( ‘real’) or produced by the Generator (‘fake’). The constraint of cycle consistency enforces the network to use shared latent features that can be used to map this output back to the original input. It could prevent the Generator from generating the same sentence in the target domain while the input is quite different. The upper part of Figure 1 describes how we compute the cycle consistency loss and the lower part is for .
3.1 SEAME Dataset
We use SEAME - the Mandarin-English Code-switching spontaneous speech corpus in South East Asia that contains 99 hours of spontaneous Chinese-English speech recorded from Singaporean and Malaysian. All recordings are done with close-talk microphones in a quiet room. The speakers are aged between 19 and 33 and almost balanced in gender (49.7% female and 50.3 % male). 16.96% of utterances are English, 15.54% are Chinese and the rest (67%) are CS utterances [seame].
3.2 Code-mixing Index (CMI)
In order to compare the artificially generated CS sentences with the real CS corpus, we use Code-mixing Index (CMI) at the utterance level introduced in [CMI] as a measurement of the level of mixing between languages, that is
where is the total number of tokens, is the number of other non-verbal tokens (e.g. people laughing) and is the number of tokens from the dominant language. If an utterance only contains non-verbal tokens (i.e., if ), it has an index of zero. The CMI ranges from . The lower the value, the lower the level of mixing between languages. Because this index does not indicate the dominant language in the utterances, we add the dominant language tag before the CMI and group them into ten groups presented in Table 1. It illustrates ten groups of different mixing levels and the proportion of each group in the SEAME data set. For example, EN-C2 presents a portion of data with English as the dominant language and the CMI ranges (0,15]. It is 18%, 16% and 15% of the SEAME train, dev and eval set, respectively.
|Groups||CMI||train (%)||dev (%)||eval (%)|
4 Experimental setup
4.1 Monolingual Datasets
We utilize two Chinese datasets to generate Code-switching text which were used to train the S2S RNN model and the pretrained generators in the CycleGAN: 1) Aishell-1 contains 96,078 sentences contributed by 400 people in China and covers “Finance”, “Science and Technology”, “Sports”, “Entertainments” and “News” topics [aishell-1]. 2) The National University of Singapore SMS Corpus is a corpus of Short Message Service (SMS) messages collected for research at the Department of Computer Science at the National University of Singapore. It consists of 29,031 SMS messages in Chinese and the messages largely originate from Singaporeans and mostly from students attending the University [SMS].
4.2 Baseline Systems
The baseline language model is trained with SEAME training text containing 94,504 sentences that includes monolingual (English or Chinese) and CS sentences. The word-based language model is one layer of LSTM with 650 units and trained with a batch size of 20. It achieves perplexities of 84 and 75 on “dev” and “eval” set, respectively. The subword-based language model is a two layers RNN model with 650 units and trained with the batch size of 256 (Espnet default settings). It has a perplexity of 12.37 on “dev” and 10.90 on “eval” set. The baseline end-to-end (E2E) ASR is trained by Espnet [espnet] using our best recipe presented in [espnetcs]. The mixed error rate (MER), which is the error rate on Chinese characters and English words, on SEAME “dev” and “eval” set are 30.7% and 23.0% without fusing any external language model, respectively. Integrating the baseline language model to our E2E ASR using shallow fusion results in a MER of 30.4% on “dev” and 22.4% on “eval” set.
4.3 Implementation & Hyperparameters
The S2S and CycleGAN models are implemented using Python3 and Pytorch. Both S2S and the two Generators in CycleGAN use RNN encoder decoder that consists of two RNNs with 650 LSTM units. The models are trained with 96,078 Chinese sentences from Aishell-1 and their corresponding CS artificially generated sentences. The two Discriminators in CycleGAN are trained with SEAME training data containing 94,504 sentences and 96,078 Chinese sentences from the Aishell-1 dataset. The dimension of word embeddings is 300 and the batch size is 64. The optimizer is Adam[adam]. The best tunable parameters ( and ) in the objective of CycleGAN on the development set are 0.3 and 0.8, respectively.
5.1 Language Model Evaluation
Table 2 shows the perplexity of two different language models (word-based LM and subword-based LM) on SEAME “dev” and “eval” set. We use the subword-based LM because our best E2E ASR system [espnetcs] outputs subwords. Both are trained with three different texts: SEAME training text, SEAME text adding the text from S2S model, and SEAME text adding the text from CycleGAN. The results show that the language model trained with SEAME and the generated text from CycleGAN outperforms the baseline LM and the one trained with SEAME and the generated text from S2S model on the “eval” set.
|+text from S2S||83.03||74.67|
|+text from CycleGAN||83.06||72.59|
|+text from S2S||12.13||10.80|
|+text from CycleGAN||10.39||9.59|
5.2 End-to-End Speech Recognition Evaluation
The subword-based LM is integrated to the E2E speech recognition system using shallow fusion [espnetcs], and the MERs on SEAME “dev” and “eval” set are presented in Table 3 and 4. ‘Overall’ means the entire “dev” or “eval” set, and ‘EN’, ‘ZH’ and ‘CS’ are denoted as the monolingual English sentences, monolingual Chinese sentences and Chinese-English CS sentences in “dev” or “eval” set, respectively. The results on both sets show that the MER is improved most not only on CS utterances but also on monolingual ones when the language model trained with the combination of SEAME text and the generated CS text from CycleGAN.
|+text from S2S||30.3||42.8||23.4||30.2|
|+text from CycleGAN||30.1||42.6||23.2||30.0|
|+text from S2S||22.1||33.0||23.3||21.2|
|+text from CycleGAN||21.9||32.5||22.1||21.0|
6 Quantitative & Qualitative Analysis
6.1 Impact of and
Table 5 shows the perplexity of subword-based language model on SEAME dev sets with different values of and . seems to deliver the best performance in most cases. The E2E systems that integrate language models trained with text generated usng CycleGAN framework either with and or with and achieve the best MER on both SEAME “dev” and “eval” set.
6.2 Generated Text
We analyzed 148,853 generated CS sentences from S2S and CycleGAN and show some interesting examples in this subsection. Table 6 shows that when comparing the distribution of CMI groups, texts generated by CycleGAN have a more similar distribution to the ones in SEAME than texts generated by S2S across all different levels of code-mixing. This fact suggests that CycleGAN successfully enforces the generated texts to mimic CS behaviour in the original corpus.
|Monolingual||提高 铁路 在 物流 市场 中 的 竞争力|
|S2S||improve 铁路 在 logistics market in 的 竞争力|
|CycleGAN||improve railway in 2016 market in 的 竞争力|
|take take 多 一个 then 就 then 就 这|
|but 那 边 的 东西 最好 了我 觉得 in terms of of 那|
|and 如果 take take out of ten|
|根据 one series of of of 问题|
While we aim at generating texts with the same meaning of the inputs, we observed an interesting effect as shown in an example in Table 7. While the output of S2S contains Chinese words being replaced by their translations, CycleGAN produces words with different meanings (”2016” marked in bold instead of the English noun ”logistics”). In this case, CycleGAN transforms the input sentence into a different meaningful CS one instead of just replacing Chinese words by their translations.
Furthermore, we observe that text generated using CycleGAN framework might simulate speaking styles used in SEAME. Table 8 presents two examples in which CycleGAN generates text with some repetitive words, mimicking real speakers in SEAME data who tends to stutter.
6.3 Analysis of Recognition Results
The top of Table 9 presents the recognition performance comparison in terms of average substitution, deletion, insertion and MER between the baseline and the improved ASR systems. The baseline system has less deletion, while adding artificially generated CS text from CycleGAN results in better substitution and insertion rates and the overall MER. We show some cherry picked ASR output examples in the bottom of Table 9 that show the positive impact of using artificially generated CS text data on improving the ASR performance in both monolingual (English and Chinese) and CS speech utterances.
|+text from S2S||13.1||5.1||3.9||22.1|
|+text from CycleGAN||13.1||5.1||3.7||21.9|
|Reference (EN example)||SHARED MICROSCOPE|
|Baseline||SHARED MY CROSCORD 吗|
|+text from S2S||SHARED MY CROSCORD 吗|
|+text from CycleGAN||SHARED MICROSCOPE|
|Reference (CS example)||不111The pronunciation (IPA) of Mandarin character ”不” is [bu] concern about this|
|Baseline||book concern about this|
|+text from S2S||book concern about this|
|+text from CycleGAN||不 concern about this|
|Reference (ZH example)||不 厉 害 打 架 的|
|Baseline||不 利 太 大 家 的|
|+text from S2S||不 利 太 大 家 的|
|+text from CycleGAN||不 厉 害 大 家 的|
We explored a data augmentation method based on CycleGAN to generate CS text with the goal of improving language modeling and further ASR performance. We considered CS as a speaking style and extended the CycleGAN framework to convert monolingual text to CS text. Our experiments showed that we could generate CS text that not only mimics the CS behaviours in terms of different levels of mixing languages but also the speaking styles in the SEAME corpus. Furthermore, our experiments on SEAME show promising results with consistent improvements in perplexities for language modeling as well as MERs for ASR. Future work could explore in depth differences between the real CS corpus and the artificially generated one with respect to linguistic properties.