Reducing language context confusion for end-to-end code-switching automatic speech recognition

01/28/2022
by   Shuai Zhang, et al.
HUAWEI Technologies Co., Ltd.
0

Code-switching is about dealing with alternative languages in the communication process. Training end-to-end (E2E) automatic speech recognition (ASR) systems for code-switching is known to be a challenging problem because of the lack of data compounded by the increased language context confusion due to the presence of more than one language. In this paper, we propose a language-related attention mechanism to reduce multilingual context confusion for the E2E code-switching ASR model based on the Equivalence Constraint Theory (EC). The linguistic theory requires that any monolingual fragment that occurs in the code-switching sentence must occur in one of the monolingual sentences. It establishes a bridge between monolingual data and code-switching data. By calculating the respective attention of multiple languages, our method can efficiently transfer language knowledge from rich monolingual data. We evaluate our method on ASRU 2019 Mandarin-English code-switching challenge dataset. Compared with the baseline model, the proposed method achieves 11.37 mix error rate reduction.

READ FULL TEXT VIEW PDF

Authors

page 2

04/08/2019

Constrained Output Embeddings for End-to-End Code-Switching Speech Recognition with Only Monolingual Data

The lack of code-switch training data is one of the major concerns in th...
10/30/2018

Towards End-to-end Automatic Code-Switching Speech Recognition

Speech recognition in mixed language has difficulties to adapt end-to-en...
10/28/2020

Decoupling Pronunciation and Language for End-to-end Code-switching Automatic Speech Recognition

Despite the recent significant advances witnessed in end-to-end (E2E) AS...
11/30/2020

Transformer-Transducers for Code-Switched Speech Recognition

We live in a world where 60 languages fluently. Members of these communi...
10/28/2018

Language Modeling for Code-Switching: Evaluation, Integration of Monolingual Data, and Discriminative Training

We focus on the problem of language modeling for code-switched language,...
05/31/2021

Towards One Model to Rule All: Multilingual Strategy for Dialectal Code-Switching Arabic ASR

With the advent of globalization, there is an increasing demand for mult...
09/18/2019

Code-Switched Language Models Using Neural Based Synthetic Data from Parallel Sentences

Training code-switched language models is difficult due to lack of data ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Code-switching is a common language phenomenon where speakers alternate languages within a sentence. With the progress of globalization, this kind of multi-language cross-expression behavior is more common [15]. For the code-switching ASR task, the end-to-end (E2E) model becomes a widely studied paradigm because of its superiority compared with the traditional pipeline ASR system [11, 10, 9, 23]

. E2E method integrates acoustic, pronunciation, and language models into a whole with joint optimization

[6, 7, 3, 5]. However, training an E2E code-switching ASR system is known to be a challenging problem because of the lack of data compounded by the increased language context confusion due to the presence of more than one language [11, 10, 9, 23].

To alleviate the problem of multilingual context confusion, a natural idea is to use a large amount of monolingual audio-text data to train the code-switching ASR model. For the monolingual, some E2E models have achieved great performance with large-scale data [7, 3, 5]. However, these models usually cannot handle the code-switching speech well. One possible reason is that multilingual context information cannot be learned from monolingual data. In addition to monolingual data, the code-switching text data augmentation method is an effective way. The method often uses translation systems to artificially generate code-switching text from monolingual text with various rules [24, 4]. The generated text data can effectively increase the richness of the multi-language context, thereby improving the system’s ability to model language knowledge. To generate more reasonable code-switching text, various linguistic theories are used to constrain the generation strategy [12, 13, 14]. There are many attempts to explain the grammatical constraints on code-switching, with three of the most widely accepted being the Embedded Matrix (EM) [8], the Equivalence Constraint (EC) [18] and the Functional Head Constraint (FHC) theories [2]. The method of fusing linguistic theory can improve the performance of code-switching ASR. However, the unpaired text data cannot directly be used to train the E2E ASR model. So some methods like language model fusion or knowledge distillation are utilized to indirectly assist training and decoding [22, 25, 1]. To obtain audio-text paired data, speech synthesis technology is necessary [20]. However, the synthesized speech usually does not match the real speech, which is a challenge to improve the code-switching ASR performance [19].

Figure 1:

Three specific language-related attention mechanisms of our proposed method. (a) Adjust the weights of the self-attention scores according to language. (b) The embedding of the two languages shares the same self-attention parameters. (c) The embedding of the two languages has its own independent self-attention parameters. For simplicity, we omit parts such as layer normalization, residual connection, linear projection, and soft-max layer, etc.

In this paper, we propose a language-related attention mechanism to reduce multilingual context confusion for the E2E code-switching ASR model based on the EC linguistic theory. Our method is based on two points of view. One is that language switching in code-switching is extremely random, and there are almost infinite switching combinations. It is unrealistic to try to cover all combinations through text generation. The other is based on the EC linguistic theory, which requires that any monolingual fragment that occurs in the code-switching sentence must occur in one of the monolingual sentences. This linguistic theory establishes a bridge between monolingual data and code-switching data. On this basis, we propose a method to reduce multilingual contextual confusion that relies on monolingual data. Specifically, our method is based on the transformer structure, which has achieved outstanding performance in the field of ASR. This method uses self-attention mechanism to model language context information, as in the field of natural language processing. Due to the randomness of language switching, attention calculations between different languages are not general and will aggravate the confusion between contexts. Therefore, we adjust the attention calculation method between different languages to make the model more emphasis on the context of the same language. This not only reduces the multilingual context confusion but also more effectively learns contextual information from rich monolingual data. According to EC theory, our method can theoretically cover most language contexts.

In this work, our main contributions are as follows. Firstly, we propose a language-related attention mechanism to reduce multilingual context confusion for the E2E code-switching ASR model based on the EC linguistic theory. Secondly, we propose three specific attention schemes and conduct sufficient experimental verification for the results. Finally, the experimental results on ASRU 2019 Mandarin-English code-switching Challenge dataset show that our methods is more effective in using monolingual data. It is an effective strategy for code-switching ASR task.

The rest of the paper is organized as follows. Section 2 reviews the speech-transformer structure briefly. Section 3 describes the language-related attention strategies of our method in detail. In Section 4, we introduce our experimental setup in detail and discuss the experimental results in depth. Finally, we conclude our work and made plans for future work Section 5.

2 Review of Speech-Transformer

In order to clearly reveal the details of our method, it is necessary to first introduce the structure of this speech-transformer. It is a variant transformer model designed for the ASR task [5]. The details of its encoder and decoder are as follows.

For the acoustic encoder, a 2D CNN layer is used to down-sample the acoustic feature sequence and initially obtain the hidden representation. After a linear layer, the positional encoding is used to model relative positions. Then a stack of

encoder blocks is used to get the final acoustic encoded representation. Each encoder block has two sub-blocks: the first is multi-head self attention whose queries, keys, and values come from the outputs of the previous block. And the second is position-wise feed-forward networks. Meanwhile, layer normalization and residual connection are introduced to each sub-block to stabilize the training process and improve the performance.

For the decoder, a learnable target embedding and positional encoding are first applied to the target sequence. Then a stack of decoder blocks is used to encode the text context and interact with the acoustic encoded representation subsequently. Each decoder block has three sub-blocks: the first is masked multi-head self attention. It is used to model language context information and is also the focus of this paper. Its queries, keys, and values is the word embedding representation. The second is multi-head cross attention whose keys and values come from the acoustic encoder outputs and queries come from the previous sub-block outputs. Other parts are the same as the acoustic encoder block. Finally, the outputs layer includes linear projection and a soft-max layer.

3 Methods

3.1 Language-Related Attention Mechanism

According to the EC linguistic theory, any monolingual fragment that occurs in the code-switching sentence must occur in one of the monolingual sentences. This linguistic theory establishes a bridge between monolingual data and code-switching data. For the transformer model, the multi-head self attention part of the decoder module learns language context information. However, the ordinary decoder structure cannot effectively transfer language information from monolingual data. This is because the self attention calculation equalizes attention to the modeling units of all languages. And the contextual information between languages cannot be obtained from the monolingual data, so the monolingual data sometimes even reduce the performance of the code-switching ASR model. Therefore, we modify the self attention part of the decoder to strengthen the contextual relationship of the same language and reduce the contextual relationship between different languages. We implement three specific language-related attention mechanisms, and the model structure is shown in Fig. 1.

As shown in Fig. 1(a), we re-weight the self attention scores according to the language. Specifically, when the attention scores of a certain unit are calculated with respect to other units in the sentence, the attention scores belonging to the same language are increased while the attention scores of different language units are reduced. In this way, a stronger connection is established between the modeling units of the same language, and the mutual influence between different languages is suppressed. Therefore, the language context representation learned from the code-switching data is more compatible with the monolingual data, and the model can extract context information from monolingual data more effectively.

The calculation process of the attention can be formalized as the formula,

(1)

where denote the query, key, and value respectively. is the dimension of the key. For the self attention, the query, key, and value are the target text embedding.

Our method can be formally expressed as

(2)

where denotes the re-weight matrix.

Different from the above method of artificially adjusting the attention scores, we extract the embedding representation sequences of different languages and calculating the attention separately. As shown in Fig. 1(b), we separate the target embedding representation by language and get the embedding sequence of the two languages. Then these two sequence are treated as monolingual situations for attention calculation. In Fig. 1(b), they share the same attention calculation parameters. After obtaining the corresponding context representations, the two languages are merged to obtain a complete code-switching text representation. This approach can also achieve the goal of strengthening the connection with the same language while reducing the interference of different languages.

The method described in Fig. 1(c) is similar to that in Fig. 1(b), except that the two language embedding sequences have their own independent self attention parameters. The design of this model structure minimizes the mutual interference between the two languages. This is a more thorough way of calculating language-related attention and theoretically can make more effective use of monolingual data. After completing the self attention calculation, the context fusion of the two languages is the same as Fig. 1(b). It is worth mentioning that the subsequent cross attention calculation is completely consistent with the ordinary transformer model.

4 Experiments

4.1 Datasets

All the experiments are conducted on ASRU 2019 Mandarin-English code-switching Challenge dataset, which is designed for Chinese-English code-switching ASR task. It consists of about 200 hours code-switching data and 500 hours Mandarin data [21]. The development set and test set each has about 20 hours of data. All the data are collected in quiet rooms by various Android phones and iPhones with 16kHz sampling rate. The transcripts of data cover many common fields including entertainment, travel, daily life, and social interaction. Meanwhile, we choose the 460 hours of a subset of Librispeech corpus [16] as the English data.

4.2 Experiment Setups

In this paper, the input acoustic feature of the acoustic encoder network is 40-dimensional filter-bank with 25ms windowing and 10ms frame shift. For the output target, English word pieces and Chinese characters are adopted as the modeling units. The number of English word pieces is set as 1.5k. And the characters with more than 10 occurrences in the training set are reserved as modeling units, and there are about 3100 in total. The word pieces can not only balance the granularity of Chinese and English modeling units but also alleviate the out-of-vocabulary (OOV) problem with limited English training data. In this paper, mix error rate (MER) is used to evaluate the experimental results of our methods. The MER is defined as the word error rate (WER) for English and character error rate (CER) for Mandarin. The metric is widely adopted to evaluate the Mandarin-English code-switching ASR system.

All of the models are implemented based on transformer architecture. For the input acoustic features, two

2D CNN down-sampling layers with stride 2 are used. And the dimension of the subsequent linear layer is 512. Relative position encoding is used to model position information. The attention dimensions of the encoder and decoder are both 512, and the number of the head is 4. The dimension of position-wise feed-forward networks is 1024. And the number of acoustic encoder blocks and decoder blocks are 12 and 6 respectively. In our method, we need to separate two monolingual embeddings from mixed embeddings. There are many separation schemes, we adopt the weighted soft separation scheme in this work. Specifically, through weighted suppression of one language embedding, another language embedding is extracted. The weighting parameter is set to 0.1. To further improve the performance of the model, the weighted sum of our connectionist temporal classification (CTC) loss and cross-entropy loss is used as the final loss function. The weight parameter of CTC loss is set to 0.2. And the weight parameter of the CTC score during decoding is set to 0.3.

To avoid over-fitting, the uniform label smoothing technique is used and the parameter is set to 0.1. SpecAugment with frequency masking (F=30, mF=2) and time masking (T=40, mT=2) is used to improve the performance of the models [17]. Meanwhile, we set residual dropout as 0.1, where the residual dropout is applied to each sub-block before adding the residual information. We adopt the optimization optimizer with on 2 NVIDIA V100 GPUs in the training process [5]. The batch size is set to 32 during the training process. The learning rate is set by a warm-up strategy. After training, the last 10 checkpoints are averaged as the final model. Then, we performed decoding using beam search with a beam size of 10.

model Dev Test
All CH EN All CH EN
baseline 12.31 10.05 30.54 11.33 9.21 28.70
score re-weight 12.43 10.21 30.33 11.72 9.56 29.42
attention share 11.73 9.57 29.14 11.09 9.03 28.07
attention not share 11.49 9.38 28.51 10.97 8.91 27.91
Table 1: The MER/CER/WER (%) of different methods training with only code-switching data. CH is the CER of the Chinese part and EN is the WER of the English part in the both dev and test data. score re-weight, attention share and attention not share respectively correspond to the aforementioned three attention adjustment schemes.
model Data Dev Test
All CH EN All CH EN
200h CS 12.31 10.05 30.54 11.33 9.21 28.70
Transformer + 500h CH 11.73 9.19 32.15 10.79 8.43 30.16
+ 460h EN 12.22 10.19 28.62 11.37 9.53 26.41
All 11.42 9.22 29.17 10.73 8.70 27.33
Table 2: The MER/CER/WER (%) of transformer baseline with external monolingual data. 200h CS, 500h CH, 460h EN and all respectively refer to code-switching data, Chinese data, English data and all the above three data.
model Data Dev Test
All CH EN All CH EN
200h CS 11.73 9.57 29.14 11.09 9.03 28.07
attention + 500h CH 10.57 8.21 29.64 10.18 7.98 28.21
share + 460h EN 11.52 9.61 26.88 10.93 9.05 26.37
All 10.33 8.25 27.08 9.91 7.95 26.01
Table 3: The MER/CER/WER (%) of shared attention strategy with external monolingual data.
model Data Dev Test
All CH EN All CH EN
200h CS 11.49 9.38 28.51 10.97 8.91 27.91
attention + 500h CH 10.27 8.05 28.14 9.74 7.52 27.93
not share + 460h EN 11.36 9.51 26.31 10.71 8.97 25.02
All 10.17 8.14 26.58 9.51 7.61 25.10
Table 4: The MER/CER/WER (%) of attention independent strategy with external monolingual data.

4.3 Results with Code-switching Data

To verify the performance of the proposed method with code-switching data, we compare the results of our three attention mechanisms with the transformer baseline model under the code-switching data. The results are shown in Table 1. We can find that the attention scores re-weighting strategy has slight damage to the performance of the model, which may be due to the confusion of attention caused by the artificial post-adjustment of the attention score. The other two attention adjustment strategies can improve the performance of the model to a certain extent. Due to the small amount of code-switching data, this experimental result cannot fully reflect the superiority of our method. According to the previous analysis, the advantage of our method is mainly in the ability to use monolingual data.

4.4 Results with External Monolingual Data

To verify our model’s ability to use monolingual data, we perform experimental comparisons between the baseline model and our method on additional monolingual data. Table 2 shows the results of the baseline model with external monolingual training data. It is clear that external monolingual data can reduce the error rate of the corresponding language in code-switching. However, monolingual data will damage the recognition performance of another language to a certain extent. This may be because there is mutual interference between the two language data. The results demonstrate that monolingual data cannot always improve the code-switching performance for the E2E model. Ordinary transformer models are relatively inefficient in using monolingual data. When trained with all the data, the baseline achieves 5.30% relative MER reduction compared with only code-switching training data.

Then we conduct experiments on our three methods with the same data design scheme. Since the experimental results of the attention scores re-weighting strategy are not satisfactory and limited by space, we only show the experimental results of the two methods. Table 3 shows the experimental results of the shared attention strategy. It is obvious that our method achieves greater improvement with external monolingual data than the baseline model. More importantly, compared with the baseline model, the damage of monolingual data to the performance of another language is greatly reduced. This proves that our method can reduce the context interference between the two languages and improve the performance of the code-switching ASR model. And it demonstrates that our method is more effective in using monolingual data. When trained with all the data, the baseline achieves 10.64% relative MER reduction compared with only code-switching training data.

Table 4 shows the experimental results of the attention independent strategy. From the experimental results, we can get conclusions similar to Table 3. However, this strategy with independent attention for each language is more effective and achieves the best recognition performance. This is because the two languages carry out context modeling independently, which further reduces mutual interference. Overall, the proposed model provides up to 13.31% relative reduction in MER compared with the only code-switching training data. Compared with the baseline model, it achieves 11.37% relative reduction in MER. All the above experimental results show that our method is an effective solution for code-switching ASR tasks.

5 Conclusion and Future Work

In this paper, we propose a language-related attention mechanism to reduce multilingual context confusion for the E2E code-switching ASR model based on the EC linguistic theory. In order to achieve the above goals, three specific language-related attention adjustment implementations are proposed. The experimental results on ASRU 2019 Mandarin-English code-switching Challenge dataset show that our methods have consistent improvement compared with the baseline. It is an effective strategy for code-switching ASR tasks. In future work, we will explore more attention adjustment strategies, such as whether cross attention is shared or not.

References

  • [1] Y. Bai, J. Yi, J. Tao, Z. Tian, and Z. Wen (2019) Learn spelling from teachers: transferring knowledge from language models to sequence-to-sequence speech recognition. In Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, G. Kubin and Z. Kacic (Eds.), pp. 3795–3799. Cited by: §1.
  • [2] R. M. Bhatt (1995) Code-switching and the functional head constraint. In Janet Fuller et al. Proceedings of the Eleventh Eastern States Conference on Linguistics. Ithaca, NY: Department of Modern Languages and Linguistics, pp. 1–12. Cited by: §1.
  • [3] W. Chan, N. Jaitly, Q. Le, and O. Vinyals (2016)

    Listen, attend and spell: a neural network for large vocabulary conversational speech recognition

    .
    In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4960–4964. Cited by: §1, §1.
  • [4] C. Chang, S. Chuang, and H. Lee (2019)

    Code-switching sentence generation by generative adversarial networks and its application to data augmentation

    .
    In Interspeech 2019, G. Kubin and Z. Kacic (Eds.), pp. 554–558. Cited by: §1.
  • [5] L. Dong, S. Xu, and B. Xu (2018) Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 5884–5888. External Links: Document Cited by: §1, §1, §2, §4.2.
  • [6] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber (2006)

    Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks

    .
    In

    Proceedings of the 23rd international conference on Machine learning

    ,
    pp. 369–376. Cited by: §1.
  • [7] A. Graves, A. Mohamed, and G. Hinton (2013) Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing, pp. 6645–6649. Cited by: §1, §1.
  • [8] A. Joshi (1982) Processing of sentences with intra-sentential code-switching. In Coling 1982: Proceedings of the Ninth International Conference on Computational Linguistics, Cited by: §1.
  • [9] S. Kim and M. L. Seltzer (2018) Towards language-universal end-to-end speech recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 4914–4918. External Links: Document Cited by: §1.
  • [10] B. Li, Y. Zhang, T. Sainath, Y. Wu, and W. Chan (2019) Bytes are all you need: end-to-end multilingual speech recognition and synthesis with bytes. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 5621–5625. External Links: Document Cited by: §1.
  • [11] K. Li, J. Li, G. Ye, R. Zhao, and Y. Gong (2019) Towards code-switching asr for end-to-end ctc models. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6076–6080. Cited by: §1.
  • [12] Y. Li and P. Fung (2012) Code-switch language model with inversion constraints for mixed language speech recognition. In Proceedings of COLING 2012, pp. 1671–1680. Cited by: §1.
  • [13] Y. Li and P. Fung (2013) Improved mixed language speech recognition using asymmetric acoustic model and language model with code-switch inversion constraints. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7368–7372. Cited by: §1.
  • [14] Y. Li and P. Fung (2014) Language modeling with functional head constraint for code switching speech recognition. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 907–916. Cited by: §1.
  • [15] P. Muysken, P. C. Muysken, et al. (2000) Bilingual speech: a typology of code-mixing. Cambridge University Press. Cited by: §1.
  • [16] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur (2015) Librispeech: an ASR corpus based on public domain audio books. In 2015 IEEE, ICASSP 2015, pp. 5206–5210. Cited by: §4.1.
  • [17] D. S. Park, W. Chan, Y. Zhang, C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le (2019) SpecAugment: A simple data augmentation method for automatic speech recognition. In Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, G. Kubin and Z. Kacic (Eds.), pp. 2613–2617. Cited by: §4.2.
  • [18] C. W. Pfaff (1979) Constraints on language mixing: intrasentential code-switching and borrowing in spanish/english. Language, pp. 291–318. Cited by: §1.
  • [19] A. Rosenberg, Y. Zhang, B. Ramabhadran, Y. Jia, P. Moreno, Y. Wu, and Z. Wu (2019) Speech recognition with augmented synthesized speech. In 2019 IEEE automatic speech recognition and understanding workshop (ASRU), pp. 996–1002. Cited by: §1.
  • [20] S. Shah, B. Abraham, G. R. M, S. Sitaram, and V. Joshi (2020) Learning to recognize code-switched speech without forgetting monolingual speech recognition. CoRR abs/2006.00782. Cited by: §1.
  • [21] X. Shi, Q. Feng, and L. Xie (2020) The ASRU 2019 mandarin-english code-switching speech recognition challenge: open datasets, tracks, methods and results. CoRR abs/2007.05916. Cited by: §4.1.
  • [22] A. Sriram, H. Jun, S. Satheesh, and A. Coates (2018) Cold fusion: training seq2seq models together with language models. In Interspeech 2018, 19th Annual Conference of the International Speech Communication Association, B. Yegnanarayana (Ed.), pp. 387–391. Cited by: §1.
  • [23] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett (Eds.), pp. 5998–6008. Cited by: §1.
  • [24] E. Yilmaz, H. van den Heuvel, and D. A. van Leeuwen (2018) Acoustic and textual data augmentation for improved ASR of code-switching speech. In Interspeech 2018, B. Yegnanarayana (Ed.), pp. 1933–1937. Cited by: §1.
  • [25] D. Zhao, T. N. Sainath, D. Rybach, P. Rondon, D. Bhatia, B. Li, and R. Pang (2019) Shallow-fusion end-to-end contextual biasing. In Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, G. Kubin and Z. Kacic (Eds.), pp. 1418–1422. Cited by: §1.