In bilingual or multilingual communities, speakers can easily switch between different languages within a conversation Wang et al. (2009). People who know how to code-switch will mix languages in response to social factors as a way of communicating in a multicultural society. Generally, code-switching speakers switch languages by taking words or phrases from the embedded language to the matrix language. This can occur within a sentence, which is known as intra-sentential code-switching or between two matrix language sentences, which is called inter-sentential code-switching Heredia and Altarriba (2001).
Learning a code-switching automatic speech recognition (ASR) model has been a challenging task for decades due to data scarcity and difficulty in capturing similar phonemes in different languages. Several approaches have focused on generating synthetic speech data from monolingual resourcesNakayama et al. (2018); Winata et al. (2019). However, these methods are not guaranteed to generate natural code-switching speech or text. Another line of work explores the feasibility of leveraging large monolingual speech data in the pre-training and applying fine-tuning on the model using a limited source of code-switching data, which has been found useful to improve the performance Li et al. (2011); Winata et al. (2019). However, the transferability of these pretraining approaches is not optimized on extracting useful knowledge from each individual languages in the context of code-switching, and even after the fine-tuning step, the model forgets about the previously learned monolingual tasks.
In this paper, we introduce a new method, meta-transfer learning111The code is available at https://github.com/audioku/meta-transfer-learning, to learn to transfer knowledge from source monolingual resources to a code-switching model. Our approach extends the model-agnostic meta learning (MAML) Finn et al. (2017) to not only train with monolingual source language resources but also optimize the update on the code-switching data. This allows the model to leverage monolingual resources that are optimized to detect code-switching speech. Figure 1 illustrates the optimization flow of the model. Different from joint training, meta-transfer learning computes the first-order optimization using the gradients from monolingual resources constrained to the code-switching validation set. Thus, instead of learning one model that is able to generalize to all tasks, we focus on judiciously extracting useful information from the monolingual resources.
The main contribution is to propose a novel method to transfer learn information efficiently from monolingual resources to the code-switched speech recognition system. We show the effectiveness of our approach in terms of error rate, and that our approach is also faster to converge. We also show that our approach is also applicable to other natural language tasks, such as code-switching language modeling tasks.
2 Related Work
Our idea of learning knowledge transfer from source monolingual resources to a code-switching model comes from MAML Finn et al. (2017). Probabilistic MAML Finn et al. (2018) is an extension of MAML, which has better classification coverage. Meta-learning has been applied to natural language and speech processing Hospedales et al. (2020). Madotto et al. (2019)
extends MAML to the personalized text generation domain and successfully produces more persona-consistent dialogue.Gu et al. (2018) and Qian and Yu (2019) propose to apply meta-learning on low-resource learning. Lin et al. (2019) applies MAML to low-resource sales prediction. Several applications have been proposed in speech applications, such as cross-lingual speech recognition Hsu et al. (2019), speaker adaptation Klejch et al. (2018, 2019), and cross-accent speech recognition Winata et al. (2020).
explore syntactic and semantic features on recurrent neural networks (RNNs).Baheti et al. (2017) adapts effective curriculum learning by training a network with monolingual corpora of two languages, and subsequently training on code-switched data. Pratapa et al. (2018) and Lee et al. (2019) propose to use methods to generate artificial code-switching data using a linguistic constraint. Winata et al. (2018) proposes to leverage syntactic information to improve the identification of the location of code-switching points, and improve the language model performance. Finally Garg et al. (2018) and Winata et al. (2019) propose new neural-based methods using SeqGAN and pointer-generator (Pointer-Gen) to generate diverse synthetic code-switching sentences that are sampled from the real code-switching data distribution.
3 Meta-Transfer Learning
We aim to effectively transfer knowledge from source domains to a specific target domain. We denote our model by with parameters . Our model accepts a set of speech inputs and generates a set of utterances . The training involves a set of speech datasets in which each dataset is treated as a task . Each task is distinguished as either a source or target task . For each training iteration, we randomly sample a set of data as training , and a set of data as validation . In this section, we present and formalize the method.
To facilitate the model to achieve a good generalization on the code-switching data, we sample the source dataset from monolingual English and Chinese and code-switching corpora, and choose the target dataset only from the code-switching corpus. The code-switching data samples between and are disjoint. In this case, we exploit the meta-learning update using meta-transfer learning to acquire knowledge from the monolingual English and Chinese corpora, and optimize the learning process on the code-switching data. Then, we slowly fine-tune the trained model to become closer to the code-switching domain by avoiding aggressive updates that can push the model to a worse position.
3.2 Meta-Transfer Learning Algorithm
Our approach extends the meta-learning paradigm to adapt knowledge learned from source domains to a specific target domain. This approach captures useful information from multiple resources to the target domain, and updates the model accordingly. Figure 1 presents the general idea of meta-transfer learning. The goal of the meta-transfer learning is not to focus on generalizing to all tasks, but to focus on acquiring crucial knowledge to transfer from monolingual resources to the code-switching domain. As shown in Algorithm 1, for each adaptation step on , we compute updated parameters
via stochastic gradient descent (SGD) as follows:
where is a learning hyper-parameter of the inner optimization. Then, a cross-entropy loss is calculated from a learned model upon the generated text given the audio inputs on the target domain:
We define the objective as follows:
where and . We minimize the loss of the upon . Then, we apply gradient descent on the meta-model parameter with a meta-learning rate.
4 Code-Switched Speech Recognition
4.1 Model Description
to learn a language-agnostic audio representation and generate input embeddings. The decoder receives the encoder outputs and applies multi-head attention to the decoder input. We apply a mask into the decoder attention layer to avoid any information flow from future tokens. During the training process, we optimize the next character prediction by shifting the transcription by one. Then, we generate the prediction by maximizing the log probability of the sub-sequence using beam search.
4.2 Language Model Rescoring
To further improve the prediction, we incorporate Pointer-Gen LM Winata et al. (2019) in a beam search process to select the best sub-sequence scored using the softmax probability of the characters. We define as the probability of the predicted sentence. We add the pointer-gen language model to rescore the predictions. We also include word count wc(Y) to avoid generating very short sentences. is calculated as follows:
where is the parameter to control the decoding probability, is the parameter to control the language model probability, and is the parameter to control the effect of the word count.
|# Duration (hr)||100.58||5.56||5.25|
|Winata et al. (2019)||32.76%|
|+ Pointer-Gen LM||31.07%|
|Joint Training (EN + ZH)||98.29%|
|Joint Training (EN + CS)||34.77%|
|Joint Training (ZH + CS)||33.93%|
|Joint Training (EN + ZH + CS)||32.87%|
|+ Pointer-Gen LM||31.74%|
|Meta-Transfer Learning (EN + CS)||32.35%|
|Meta-Transfer Learning (ZH + CS)||31.57%|
|Meta-Transfer Learning (EN + ZH + CS)||30.30%|
|+ Pointer-Gen LM||29.30%|
5 Experiments and Results
We use SEAME Phase II, a conversational English-Mandarin Chinese code-switching speech corpus that consists of spontaneously spoken interviews and conversations Nanyang Technological University (2015). The data statistics and code-switching metrics, such as code mixing index (CMI) Gambäck and Das (2014) and switch-point fraction Pratapa et al. (2018) are depicted in Table 1. For monolingual speech datasets, we use HKUST Liu et al. (2006) as the monolingual Chinese dataset, and Common Voice Ardila et al. (2019) as the monolingual English dataset.222We downloaded the CommonVoice version 1 dataset from https://voice.mozilla.org/. We use 16 kHz audio inputs and up-sample the HKUST data from 8 to 16 kHz.
5.2 Experiment Settings
Our transformer model consists of two encoder layers and four decoder layers with a hidden size of 512, an embedding size of 512, a key dimension of 64, and a value dimension of 64. The input of all the experiments uses spectrogram, computed with a 20 ms window and shifted every 10 ms. Our label set has 3765 characters and includes all of the English and Chinese characters from the corpora, spaces, and apostrophes. We optimize our model using Adam and start the training with a learning rate of 1e-4. We fine-tune our model using SGD with a learning rate of 1e-5, and apply an early stop on the validation set. We choose , , and
. We draw the sample of the batch randomly with a uniform distribution every iteration.
We conduct experiments with the following approaches: (a) only CS, (b) joint training on EN + ZH, (c) joint training on EN + ZH + CS, and (d) meta-transfer learning. Then, we apply fine-tuning (b), (c), and (d) models on CS. We apply LM rescoring on our best model. We evaluate our model using beam search with a beam width of 5 and maximum sequence length of 300. The quality of our model is measured using character error rate (CER).
The results are shown in Table 2. Generally, adding monolingual data EN and ZH as the training data is effective to reduce error rates. There is a significant margin between only CS and joint training (1.64%) or meta-transfer learning (4.21%). According to the experiment results, meta-transfer learning consistently outperforms the joint-training approaches. This shows the effectiveness of meta-transfer learning in language adaptation.
The fine-tuning approach helps to improve the performance of trained models, especially on the joint training (EN + ZH). We observe that joint training (EN + ZH) without fine-tuning cannot predict mixed-language speech, while joint training on EN + ZH + CS is able to recognize it. However, according to Table 3, adding a fine-tuning step badly affects the previous learned knowledge (e.g., EN: 11.84% 63.85%, ZH: 31.30% 78.07%). Interestingly, the model trained with meta-transfer learning does not suffer catastrophic forgetting even without focusing the loss objective to learn both monolingual languages. As expected, joint training on EN + ZH + CS achieves decent performance on all tasks, but it does not optimally improve CS.
|Joint Training (EN + ZH)||-63.78%||11.84%||31.30%|
|Joint Training (EN + ZH + CS)||1.64%||13.88%||30.46%|
|Meta-Transfer Learning (EN + ZH + CS)||4.21%||16.22%||31.39%|
The language model rescoring using Pointer-Gen LM improves the performance of the meta-transfer learning model by choosing more precise code-switching sentences during beam search. Pointer-Gen LM improves the performance of the model, and outperforms the model trained only in CS by 5.21% and previous state-of-the-art by 1.77%.
Figure 2 depicts the dynamics of the validation loss per iteration on CS, EN, and ZH. As we can see from the figure, meta-transfer learning is able to converge faster than only CS and joint training, and results in the lowest validation loss. For the validation losses on EN and ZH, both joint training (EN + ZH + CS) and meta-transfer learning achieve a similar loss in the same iteration, while only CS achieves a much higher validation loss. This shows that meta-transfer learning is not only optimized on the code-switching domain, but it also preserves the generalization ability to monolingual domains, as depicted in Table 3.
5.4 Language Modeling Task
We further evaluate our meta-transfer learning approach on a language model task. We simply take the transcription of the same datasets and build a 2-layer LSTM-based language model following the model configuration in Winata et al. (2019)
. To further improve the performance, we apply fine-tuning with an SGD optimizer by using a learning rate of 1.0, and decay the learning rate by 0.25x for every epoch without any improvement on the validation performance. To prevent the model from over-fitting, we add an early stop of 5 epochs.
|Joint Training (EN + ZH + CS)||70.99||63.73|
|Meta-Transfer Learning (EN + ZH + CS)||68.83||62.14|
As shown in Table 4, the meta-transfer learning approach outperforms the joint-training approach. We find a similar trend for the language model task results to the speech recognition task where meta-transfer learning without additional fine-tuning performs better than joint training with fine-tuning. Compared to our baseline model (Only CS), meta-transfer learning is able to reduce the test set perplexity by 3.57 points (65.71 62.14), and the post fine-tuning step reduces the test set perplexity even further, from 62.14 to 61.97.
We propose a novel method, meta-transfer learning, to transfer learn on a code-switched speech recognition system in a low-resource setting by judiciously extracting information from high-resource monolingual datasets. Our model recognizes individual languages and transfers them so as to better recognize mixed-language speech by conditioning the optimization objective to the code-switching domain. Based on experimental results, our training strategy outperforms joint training even without adding a fine-tuning step, and it requires less iterations to converge.
In this paper, we have shown that our approach can be effectively applied to both speech processing and language modeling tasks. Finally, we will explore further the generability of our meta-transfer learning approach to more downstream multilingual tasks in our future work.
This work has been partially funded by ITF/319/16FP and MRP/055/18 of the Innovation Technology Commission, the Hong Kong SAR Government, and School of Engineering Ph.D. Fellowship Award, the Hong Kong University of Science and Technology, and RDC 1718050-0 of EMOS.AI.
- Recurrent neural network language modeling for code switching conversational speech. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pp. 8411–8415. Cited by: §2.
- Combination of recurrent neural networks and factored language models for code-switching language modeling. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Vol. 2, pp. 206–211. Cited by: §2.
- Common voice: a massively-multilingual speech corpus. arXiv preprint arXiv:1912.06670. Cited by: §5.1.
- Curriculum design for code-switching: experiments with language identification and language modeling with deep neural networks. Proceedings of ICON, pp. 65–74. Cited by: §2.
- Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5884–5888. Cited by: §4.1.
Model-agnostic meta-learning for fast adaptation of deep networks.
Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126–1135. Cited by: §1, §2.
- Probabilistic model-agnostic meta-learning. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, USA, pp. 9537–9548. External Links: Cited by: §2.
On measuring the complexity of code-mixing.
Proceedings of the 11th International Conference on Natural Language Processing, Goa, India, pp. 1–7. Cited by: §5.1.
- Code-switched language models using dual RNNs and same-source pretraining. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 3078–3083. External Links: Cited by: §2.
Meta-learning for low-resource neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 3622–3631. Cited by: §2.
- Bilingual language mixing: why do bilinguals code-switch?. Current Directions in Psychological Science 10 (5), pp. 164–168. Cited by: §1.
- Meta-learning in neural networks: a survey. arXiv preprint arXiv:2004.05439. Cited by: §2.
- Meta learning for end-to-end low-resource speech recognition. arXiv preprint arXiv:1910.12094. Cited by: §2.
- Speaker adaptive training using model agnostic meta-learning. arXiv preprint arXiv:1910.10605. Cited by: §2.
- Learning to adapt: a meta-learning approach for speaker adaptation. Proc. Interspeech 2018, pp. 867–871. Cited by: §2.
- Linguistically motivated parallel data augmentation for code-switch language modeling. In INTERSPEECH 2019, Cited by: §2.
- Asymmetric acoustic modeling of mixed language speech. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5004–5007. Cited by: §1.
- Code-switch language model with inversion constraints for mixed language speech recognition. Proceedings of COLING 2012, pp. 1671–1680. Cited by: §2.
- Learning to learn sales prediction with social media sentiment. In Proceedings of the First Workshop on Financial Technology and Natural Language Processing, pp. 47–53. Cited by: §2.
- Hkust/mts: a very large scale mandarin telephone speech corpus. In Chinese Spoken Language Processing, pp. 724–735. Cited by: §5.1.
- Personalizing dialogue agents via meta-learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 5454–5459. External Links: Cited by: §2.
Speech chain for semi-supervised learning of japanese-english code-switching asr and tts. In 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 182–189. Cited by: §1.
- Mandarin-english code-switching in south-east asia ldc2015s04. web download. philadelphia: linguistic data consortium. Cited by: §5.1.
- Language modeling for code-mixing: the role of linguistic theory based synthetic data. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp. 1543–1553. Cited by: §2, §5.1.
- Domain adaptive dialog generation via meta learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 2639–2649. External Links: Cited by: §2.
- Very deep convolutional networks for large-scale image recognition. In ICLR, Cited by: §4.1.
- Sustained and transient language control in the bilingual brain. NeuroImage 47 (1), pp. 414–422. Cited by: §1.
- Learning fast adaptation on cross-accented speech recognition. arXiv preprint arXiv:2003.01901. Cited by: §2.
- Code-switching language modeling using syntax-aware multi-task learning. In Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching, pp. 62–67. External Links: Cited by: §2.
- Code-switched language models using neural based synthetic data from parallel sentences. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pp. 271–280. Cited by: §1, §2, §4.1, §4.2, Table 2, §5.4, Table 4.