Code-Switching Detection Using ASR-Generated Language Posteriors

06/19/2019 ∙ by Qinyi Wang, et al. ∙ National University of Singapore 0

Code-switching (CS) detection refers to the automatic detection of language switches in code-mixed utterances. This task can be achieved by using a CS automatic speech recognition (ASR) system that can handle such language switches. In our previous work, we have investigated the code-switching detection performance of the Frisian-Dutch CS ASR system by using the time alignment of the most likely hypothesis and found that this technique suffers from over-switching due to numerous very short spurious language switches. In this paper, we propose a novel method for CS detection aiming to remedy this shortcoming by using the language posteriors which are the sum of the frame-level posteriors of phones belonging to the same language. The CS ASR-generated language posteriors contain more complete language-specific information on frame level compared to the time alignment of the ASR output. Hence, it is expected to yield more accurate and robust CS detection. The CS detection experiments demonstrate that the proposed language posterior-based approach provides higher detection accuracy than the baseline system in terms of equal error rate. Moreover, a detailed CS detection error analysis reveals that using language posteriors reduces the false alarms and results in more robust CS detection.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Code-switching (CS), the alternating use of two or more languages in a single conversation, is a common phenomenon in multilingual communities. There is increasing research interest in developing CS automatic speech recognition (ASR) systems [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14] as most of the off-the-shelf systems are monolingual and cannot handle code-switched speech. Our previous research has focused on developing an all-in-one CS ASR system using a Frisian-Dutch bilingual acoustic and language model that allows language switches [11, 15].

Performing CS detection on a code-switched speech can automatically determine the points of code-switching and language identities of words in the code-switched utterances which can eventually be used for speech recognition tasks, such as spoken term detection and improving the CS ASR performance. Among two main types of code-switching, namely inter-sentential (which occurs between sentences) and intra-sentential (which occurs within a single sentence) [16], the detection of the latter is more challenging due to shorter duration between the CS points.

Despite considerable research effort on language recognition (LR) and diarization, there is little previous work available on CS detection [17, 18, 15, 19, 20]. It is worth mentioning that the CS detection task is more challenging than the standard LR experimental setting due to: (1) considerably shorter monolingual segment duration (i.e., intra-sentential CS durations can be as short as a few seconds) and (2) uncertainty over the language boundaries.

One previous work on detecting language switches uses multiple monolingual ASR systems in parallel and assigns language identities to words with the language of the system with highest likelihood score [20]. In our earlier work, we have described a CS detection technique which uses a Frisian-Dutch CS ASR system to recognize the most likely transcription of each utterance and detect language switches based on the time alignment of the ASR output [15].

In our latest work, we have investigated this code-switching detection performance using data-augmented CS ASR systems and observed that this technique suffers from over-switching due to numerous very short spurious language switches [21]

. To cope with shortcoming, this paper introduces a new method for code-switching detection which uses frame-level language posteriors which are created using a CS ASR system. This technique extracts a posterior probability for each language by summing the phone posteriors belonging to the same language. A frame-level decision is then made by choosing the language with the highest language posterior.

We present CS detection results on the FAME! corpus [22] using various detection systems including (1) the baseline system relying on the time alignment of the ASR output, (2) an intermediate system which makes a decision based on the maximum phone posteriors, and (3) the proposed language posterior-based technique. After reporting the detection performance, an analysis of the hypothesized CS by each technique is provided by reporting the total number of hypothesized language switches and duration distribution of the monolingual segments to provide further insight into the quality of the hypothesized language switches. The proposed technique has not only given lower equal error rates but also hypothesized language switches that most closely resembles the human annotations compared to the baseline CS detection technique.

The rest of the paper is organized as follows. Section 3 details the baseline and proposed CS detection techniques and Section 4 summarizes the experimental setup and implementation details. Section 5 presents and discusses the CS detection results and analyses before the conclusion given in Section 6.

2 Frisian-Dutch Radio Broadcast Database

West Frisian is one of the three Frisian languages (together with East and North Frisian spoken in Germany) and it has approximately half a million speakers mostly living in the province Fryslân located in the northwest of the Netherlands. The native speakers of West Frisian (Frisian henceforth) are mostly bilingual and often code-switch in daily conversations due to the extensive influence of the Dutch language [23].

The bilingual FAME! speech database has been collected in the scope of the Frisian Audio Mining Enterprise project and contains radio broadcasts in Frisian and Dutch from the archive of the regional public broadcaster Omrop Fryslân (Frisian Broadcast Organization). This bilingual data contains Frisian-only and Dutch-only utterances as well as mixed utterances with inter-sentential, intra-sentential and intra-word CS. These recordings include language switching cases and speaker diversity, and have a large time span (1966–2015). The longitudinal and bilingual nature of the material enables to perform research into language variation in Frisian over years, formal versus informal speech, language change across the life-span, dialectology, code-switching trends, speaker tracking and diarization over a large time period. For further details, we refer the reader to [22].

3 CS detection techniques

The code-switching detection task involves detection of language boundaries (switching points) and identification of the language identity of preceding and following subsegments in a code-switched utterance. The following subsections details the baseline CS detection technique which uses the time alignment of the most likely ASR hypothesis and the techniques relying on phone and language posteriors.

3.1 Baseline approach: time alignment of CS ASR output

One way of obtaining frame-level language labels hypothesized by a CS ASR is to align the most likely hypothesis and assign each frame language labels using language tags appended to the words. CS ASR employs a bilingual acoustic model that captures the phonetic characteristics of both languages and a bilingual language model (LM) which can assign probabilities to code-mixed word sequences as well as monolingual word sequences from both languages. The current system uses data-augmented models described in [24]. The acoustic model is trained on automatically transcribed data from the same archive and a large amount of monolingual data from the high-resourced language (Dutch) together with the manually transcribed data form the FAME! corpus. Moreover, we have created CS text, which is almost nonexistent, in multiple ways providing perplexity reductions on the development and test set transcriptions. The data-augmented models have been shown to provide better CS detection in terms of equal error rate, but have the tendency to hypothesize much more language switches compared to the human annotations [21].

3.2 Proposed approach: language posteriors

Rather than relying on the most likely hypothesis, code-switching detection can also be achieved at frame level using the phone posteriors. Each phone having a language tag, the phones tagged with the same language are further summed to obtain a posterior probability for each language. The CS ASR-generated language posteriors contain more complete language-specific information at frame level than the time alignment of the most likely hypothesis. Therefore, using language posteriors for CS detection is expected to yield more accurate and robust CS detection. The silences are considered as a third class as they do not belong to any language. To reduce the confusion of acoustically similar phones in both languages, a phone LM is incorporated during the phone posterior extraction.

The language decision per frame is made in two ways based on: (1) the language tag of the phone with the highest posterior and (2) the highest language posterior. Figure 1 illustrates an example of phone posteriors for a code-switched utterance with the corresponding language decision based on the maximum phone and language posteriors. The CS detection based on the maximum phone posterior is more susceptible to uncertainty between phones than the maximum language posterior as it can be seen from the example. Summing the posterior probabilities assigned to all phones of a language yields a more reliable evidence for the language identity resulting in a more robust CS detection.

(a) Phone posteriors
(b) Language posteriors
Figure 1: Phone posteriors of a code-switched utterance and the corresponding language posteriors obtained by summing all same-language phone posteriors
Figure 2: Performance of code-switching detection systems implemented with different methods on FAME! development and test data

4 Experimental Setup

4.1 Speech and text data

The training data of the FAME! speech corpus comprises 8.5 hours and 3 hours of speech from Frisian and Dutch speakers respectively. The development and test sets consist of 1 hour of speech from Frisian speakers and 20 minutes of speech from Dutch speakers each. All speech data has a sampling frequency of 16 kHz. The amount of automatically annotated speech data extracted from the target broadcast archive is 125.5 hours.

Monolingual Dutch speech data comprises the complete Dutch and Flemish (language variety spoken in Belgium) components of the Spoken Dutch Corpus (CGN) [25] that contains diverse speech material including conversations, interviews, lectures, debates, read speech and broadcast news. This corpus contains 442.5 and 307.5 hours of Dutch and Flemish data respectively.

The bilingual text corpus used for LM training contains 107.3M words in total (monolingual Frisian text with 37M words, monolingual Dutch text with 8.8M Dutch words and automatically generated CS text with 61.5M words). Almost all Frisian text is extracted from monolingual resources such as Frisian novels, news articles, Wikipedia articles. The Dutch text is extracted from the transcriptions of the CGN speech corpus which has been found to be very effective for LM training compared to other text extracted from written sources. The transcriptions of the FAME! training data is the only source of CS text and contains 140k words. The remaining CS text is automatically generated as described in [24].

4.2 Implementation details

The CS ASR system used in these experiments is trained using the Kaldi ASR toolkit [26]

. We train a conventional context dependent Gaussian mixture model-hidden Markov model (GMM-HMM) system with 40k Gaussians using 39 dimensional mel-frequency cepstral coefficient (MFCC) features including the deltas and delta-deltas to obtain the alignments for training a lattice-free maximum mutual information (LF-MMI) 

[27] TDNN-LSTM [28]

AM (1 standard, 6 time-delay and 3 LSTM layers). We use 40-dimensional MFCC as features combined with i-vectors for speaker adaptation 

[29]

. The LM used for the baseline CS detection system is a standard bilingual 3-gram with interpolated Kneser-Ney smoothing. Further details are provided in

[24]

. We compute phone posteriors from the denominator graph (created using a phone LM estimated from the phone alignments of the training data) of the chain model and map them to phones using the existing implementation in Kaldi (

nnet3-chain-compute-post). The output obtained for each frame is -normalized and the resulting normalized vectors are used for CS detection.

The bilingual lexicon contains 110k Frisian and Dutch words. The number of entries in the lexicon is approximately 160k due to the words with multiple phonetic transcriptions. In this version of the CS ASR system, we have updated the spelling of certain Frisian words in the text corpora, pronunciation lexicon and transcriptions according to the latest spelling rules proposed in 2016 by Fryske Akademy, which is the main difference compared to the previous system in

[24]. The phonetic transcriptions of the words that do not appear in the initial lexicons are learned by applying grapheme-to-phoneme (G2P) bootstrapping [30, 31]. The lexicon learning is carried out only for the words that appear in the training data using the G2P model learned on the corresponding language. We use the Phonetisaurus G2P system [32] for creating phonetic transcriptions. This CS ASR system provided a word error rate of 24.9% and 23.0% on the development and test set of the FAME! speech corpus, respectively.

CS Detection System Development Test
Baseline [21] 9.7 6.3
Phone posterior 9.2 5.2
Language posterior 8.7 4.8
Table 1: EER (%) provided by different CS detection systems on the development and test data

4.3 CS detection experiments

For the baseline CS detection system, we trained a monolingual Frisian and Dutch LM, and interpolated between them with varying weights. This effectively varies the prior for the detected language. The most likely hypothesis is obtained for each utterance using each interpolated LM and its time alignment on phone level is stored in a .ctm format. By comparing these alignments with the ground truth word-level alignments (obtained by applying forced alignment using the recognizer), a duration-based CS detection accuracy metric has been calculated. The missed Frisian (Dutch) time is calculated as the ratio of total duration of frames with Frisian (Dutch) tag in the reference alignment which is aligned to frames without Frisian (Dutch) tag to the total number of frames with Frisian (Dutch) tag in the reference alignment.

The same procedure is followed for the approaches using phone and language posteriors, except the use of interpolated phone LMs with varying weights for manipulating the language priors. After extracting the phone posteriors for each utterance using different phone LMs, a language label is assigned to each frame based on the maximum phone or language posteriors. These frame-level language labels are stacked in a vector for each utterance and later converted to a .ctm file which marks the start times and durations of each monolingual segment. The same duration-based metric is used to evaluate the CS detection accuracy.

The CS detection accuracy is evaluated by reporting the equal error rates (EER) calculated based on the detection error tradeoff (DET) graph [33] plotted for visualizing the CS detection performance. In previous work, we have observed that this duration-based CS detection metric penalizes very short erroneous language switches less compared to incorrect language tags assigned over longer segments which gives a better indication of the general language recognition capability of the corresponding ASR system. Therefore, we further analyze the number of hypothesized language switches and duration distribution of the monolingual segments to gain insight about the CS detection behavior of each system.

5 Results and Discussion

The EERs and DET curves provided by the three CS detection systems are given in Table 1 and Figure 2, respectively. The baseline CS detection system has an EER of 9.7% on the development and 6.3% on the test set. CS detection based on the language tag of phone with the maximum phone posterior has a reduced EER of 9.2% on the development and 5.2% on the test data. The lowest EERs are given by the system using the language posteriors. The consistent improvements on the development and test sets indicate the improved overall CS performance over the baseline and phone posterior-based system.

We further compare the number of language switches hypothesized by each CS detection system with the manually annotated switches in Figure 3. Additionally, the histograms of the duration distributions of the monolingual speech segments are shown in Figure 4. These plots reveal that all CS detection systems tend to overestimate the number of language switches in the code-switched speech. These false alarms are mainly due to a large number of monolingual segments that are shorter than 2 seconds as shown in Figure 4. Using language posteriors helps reducing the amount of false alarms on both sets.

In general, the proposed language posterior-based CS detection system provides a higher overall CS detection accuracy than the baseline technique using the time alignment of the mostly likely hypothesis. Although all systems still suffer from false alarms (by incorrectly hypothesizing very short-duration language switches), using a language posterior-based CS detection system alleviates this problem. This indicates that the proposed CS detection system is more robust compared with other systems. Further investigation needs to be done to reduce these false alarms which remains as a future work.

Figure 3: Hypothesized language switch counts
Figure 4: Duration distribution of monolingual segments

6 Conclusion

This paper addresses code-switching (CS) detection problem and introduces a new method for CS detection by using frame-level language posteriors produced by a CS ASR system. The language posteriors are obtained by summing the posteriors of the same-language phones extracted using the bilingual acoustic model in conjunction with a phone language model. The performance of this CS detection system is compared with a baseline system that uses the time alignment of the most likely hypothesis. The CS detection experiments indicate that the proposed CS detection system provides the lowest EERs on development and test set of the FAME! corpus. We also demonstrate that using language posterior for CS detection yields more robust detection with a considerably reduced number of false alarms due to incorrectly hypothesized very short language switches. Future work directions include investigating smoothing techniques to address the overestimation problem and exploiting the CS detection results to improve the CS ASR performance.

7 Acknowledgements

This research is supported by National Research Foundation through the AI Singapore Programme, the AI Speech Lab: Automatic Speech Recognition for Public Service Project AISG-100E-2018-006.

References

  • [1] Y. Li and P. Fung, “Code switching language model with translation constraint for mixed language speech recognition,” in Proc. COLING, Dec. 2012, pp. 1671–1680.
  • [2]

    H. Adel, N. Vu, F. Kraus, T. Schlippe, H. Li, and T. Schultz, “Recurrent neural network language modeling for code switching conversational speech,” in

    Proc. ICASSP, 2013, pp. 8411–8415.
  • [3] H. Adel, K. Kirchhoff, D. Telaar, N. T. Vu, T. Schlippe, and T. Schultz, “Features for factored language models for code-switching speech,” in Proc. SLTU, May 2014, pp. 32–38.
  • [4]

    Z. Zeng, H. Xu, T. Y. Chong, E.-S. Chng, and H. Li, “Improving N-gram language modeling for code-switching speech recognition,” in

    Proc. APSIPA ASC, 2017, pp. 1–6.
  • [5] I. Hamed, M. Elmahdy, and S. Abdennadher, “Building a first language model for code-switch Arabic-English,” Procedia Computer Science, vol. 117, pp. 208 – 216, 2017.
  • [6] E. van der Westhuizen and T. Niesler, “Synthesising isiZulu-English code-switch bigrams using word embeddings,” in Proc. INTERSPEECH, 2017, pp. 72–76.
  • [7] G. Stemmer, E. Nöth, and H. Niemann, “Acoustic modeling of foreign words in a German speech recognition system,” in Proc. EUROSPEECH, 2001, pp. 2745–2748.
  • [8] D.-C. Lyu, R.-Y. Lyu, Y.-C. Chiang, and C.-N. Hsu, “Speech recognition on code-switching among the Chinese dialects,” in Proc. ICASSP, vol. 1, May 2006, pp. 1105–1108.
  • [9] N. T. Vu, D.-C. Lyu, J. Weiner, D. Telaar, T. Schlippe, F. Blaicher, E.-S. Chng, T. Schultz, and H. Li, “A first speech recognition system for Mandarin-English code-switch conversational speech,” in Proc. ICASSP, March 2012, pp. 4889–4892.
  • [10] T. I. Modipa, M. H. Davel, and F. De Wet, “Implications of Sepedi/English code switching for ASR systems,” in Pattern Recognition Association of South Africa, 2015, pp. 112–117.
  • [11] E. Yılmaz, H. Van den Heuvel, and D. A. Van Leeuwen, “Investigating bilingual deep neural networks for automatic speech recognition of code-switching Frisian speech,” in Proc. SLTU, May 2016, pp. 159–166.
  • [12] J. Weiner, N. T. Vu, D. Telaar, F. Metze, T. Schultz, D.-C. Lyu, E.-S. Chng, and H. Li, “Integration of language identification into a recognition system for spoken conversations containing code-switches,” in Proc. SLTU, May 2012.
  • [13] D.-C. Lyu, E.-S. Chng, and H. Li, “Language diarization for code-switch conversational speech,” in Proc. ICASSP, May 2013, pp. 7314–7318.
  • [14] Y.-L. Yeong and T.-P. Tan, “Language identification of code switching sentences and multilingual sentences of under-resourced languages by using multi structural word information,” in Proc. INTERSPEECH, Sept. 2014, pp. 3052–3055.
  • [15] E. Yılmaz, H. van den Heuvel, and D. van Leeuwen, “Code-switching detection using multilingual DNNs,” in IEEE Spoken Language Technology Workshop (SLT), Dec 2016, pp. 610–616.
  • [16] C. Myers-Scotton, “Codeswitching with English: types of switching, types of communities,” World Englishes, vol. 8, no. 3, pp. 333–346, 1989.
  • [17] M. Volk and S. Clematide, “Detecting code-switching in a multilingual alpine heritage corpus,” in CodeSwitch@EMNLP, 2014.
  • [18] C.-H. Wu, H.-P. Shen, and C.-S. Hsu, “Code-switching event detection by using a latent language space model and the delta-bayesian information criterion,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 11, pp. 1892–1903, Nov 2015.
  • [19] E. Yılmaz, H. Van den Heuvel, and D. A. Van Leeuwen, “Exploiting untranscribed broadcast data for improved code-switching detection,” in Proc. INTERSPEECH, Aug. 2017, pp. 42–46.
  • [20] D. Amazouz, M. Adda-Decker, and L. Lamel, “Addressing code-switching in French/Algerian Arabic speech,” in Proc. INTERSPEECH, 2017, pp. 62–66.
  • [21] E. Yılmaz, H. van den Heuvel, and D. A. van Leeuwen, “Code-switching detection with data-augmented acoustic and language models,” in Proc. SLTU, 2018, pp. 127–131.
  • [22] E. Yılmaz, M. Andringa, S. Kingma, F. Van der Kuip, H. Van de Velde, F. Kampstra, J. Algra, H. Van den Heuvel, and D. Van Leeuwen, “A longitudinal bilingual Frisian-Dutch radio broadcast database designed for code-switching research,” in Proc. LREC, 2016, pp. 4666–4669.
  • [23] J. Popkema, Frisian Grammar: The Basics.   Afûk, Leeuwarden, 2013.
  • [24] E. Yılmaz, H. Van den Heuvel, and D. A. Van Leeuwen, “Acoustic and textual data augmentation for improved ASR of code-switching speech,” in Proc. INTERSPEECH, Sept. 2018, pp. 1933–1937.
  • [25] N. Oostdijk, “The spoken Dutch corpus: Overview and first evaluation,” in Proc. LREC, 2000, pp. 886–894.
  • [26] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, “The Kaldi speech recognition toolkit,” in Proc. ASRU, Dec. 2011.
  • [27] D. Povey, V. Peddinti, D. Galvez, P. Ghahremani, V. Manohar, X. Na, Y. Wang, and S. Khudanpur, “Purely sequence-trained neural networks for ASR based on lattice-free MMI,” in Proc. INTERSPEECH, 2016, pp. 2751–2755.
  • [28] V. Peddinti, Y. Wang, D. Povey, and S. Khudanpur, “Low latency acoustic modeling using temporal convolution and LSTMs,” IEEE Signal Processing Letters, vol. 25, no. 3, pp. 373–377, March 2018.
  • [29] G. Saon, H. Soltau, D. Nahamoo, and M. Picheny, “Speaker adaptation of neural network acoustic models using i-vectors,” in Proc. ASRU, Dec 2013, pp. 55–59.
  • [30] M. Davel and E. Barnard, “Bootstrapping for language resource generation,” in Pattern Recognition Association of South Africa, 2003, pp. 97–100.
  • [31] S. R. Maskey, A. B. Black, and L. M. Tomokiyo, “Bootstrapping phonetic lexicons for new languages,” in Proc. ICLSP, 2004, pp. 69–72.
  • [32] J. R. Novak, N. Minematsu, and K. Hirose, “Phonetisaurus: Exploring grapheme-to-phoneme conversion with joint n-gram models in the WFST framework,” Natural Language Engineering, pp. 1–32, 9 2015.
  • [33] A. Martin, G. Doddington, T. Kamm, M. Ordowski, and M. Przybocki, “The DET curve in assessment of detection task performance,” in Proc. Eurospeech, Sep. 1997, pp. 1895–1898.