South Africa is a multilingual society with 11 official languages and since the majority of South Africans are multilingual, code-switching (CS) occurs commonly in everyday conversations. CS being a part of daily life, the phenomenon also commonly occurs in radio and TV broadcasts which makes broadcast archives valuable sources of CS speech data. We have recently compiled a CS speech corpus which contains 14.3 hours of language-balanced speech compiled from soap opera broadcasts. Our research focuses on designing the acoustic and language models that can operate on this type of multilingual CS speech.
The impact of CS and other kinds of language switches on the performance of speech-to-text systems has recently received research interest, resulting in several robust acoustic modeling [1, 2, 3, 4, 5, 6, 7, 8, 9] and language modeling [10, 11, 12, 13, 14, 15] approaches for CS speech. In previous work, the Radboud team has explored the ASR and code-switching detection performance of various acoustic models applied to CS Frisian-Dutch speech [16, 8, 9]. We further proposed several ways of increasing the amount of available training speech data by applying several automatic transcription strategies [17, 18].
Meanwhile, the Stellenbosch team has been developing a CS ASR system for isiZulu-English CS speech with a focus on language modeling [7, 15]. The improvements in this system obtained by using speech data from different CS language pairs are explored in another submission to this conference .
In this work, we describe our joint efforts to build a 5-lingual ASR system that can recognize all five languages present in the South African Soap Opera Corpus, namely English, isiZulu, isiXhosa, Setswana and Sesotho. For this purpose, we develop acoustic and language models that enable language switches between these five languages. Unlike the prior work focusing on bilingual CS scenarios, this ASR system can hypothesize CS word sequences containing all five languages.
We first use monolingual and bilingual text to train a language model with words from all target languages. The only source of CS text is the transcriptions of the training speech data which contain 156k words in total. Then, we train several recently proposed acoustic models on the four different CS pairs present in the corpus and apply these to the combination of all development and test data. We report 5-lingual ASR accuracies and language recognition (LR) performance of the developed ASR system.
This paper is organized as follows. Section 2 introduces the demographics and the linguistic properties of the target South African languages. Section 3 summarizes the South African Soap Opera Corpus that has recently been collected for CS speech research. Section 4 describes the details of the 5-lingual CS ASR system. The experimental setup is described in Section 5 and the ASR and LR performance of the described ASR system is presented in Section 6. Section 7 concludes the paper.
2 Target South African Languages
The linguistic and phonetic properties of the target indigenous languages are given in the following paragraphs. All four languages belong to Southern Bantu language family. isiZulu and isiXhosa are Nguni languages and linguistically similar. Furthermore, Sesotho and Setswana belong to the Sotho family and are also linguistically similar. All are agglutinative, tonal and click languages written in the Latin alphabet. The information presented in the coming paragraphs is extracted from the Ethnologue111https://www.ethnologue.com/, UCLA Language Materials Project222http://www.lmp.ucla.edu/ and Census documents333http://www.statssa.gov.za. We refer the reader to these sources (and the references therein) for further details.
The isiZulu language has 11.5M native and 15.5M second language (L2) speakers mostly living in South Africa. As many other indigenous languages in South Africa, it has relatively recently become a written language. The Zulu phonology is characterized by a simple vowel inventory with 5 vowels and a highly marked consonantal system with ejectives, implosives and clicks . Zulu has borrowed many words from other languages, especially Afrikaans and English.
The second language, isiXhosa, has 8M native and 11M L2 speakers mostly living in South Africa. IsiXhosa has 58 consonants including 18 click consonants, 10 vowels and two tones. It is historically related to the Khoisan Languages, i.e. languages of southern Africa hunter-gatherer populations, and it has borrowed many words from these languages and later from English and Afrikaans.
Thirdly, Sesotho is spoken by 6M native and 8M L2 speakers in South Africa and Lesotho. It has 9 vowels and 39 consonants including ejectives, clicks and uvular trill. Various sound changes are observed involving vowels and consonants including various sorts of assimilation, elision, vowel merging and devoicing . The fourth and the final language, Setswana, is spoken by 5M native and 7.5M L2 speakers in South Africa and Botswana. It includes 7 vowels and 29 consonants, 3 of which contain clicks. There are two tones which are orthographically not marked. Despite a high mutual intelligibility with Sesotho, they are generally considered to be two separate languages.
3 South African Soap Opera Corpus
A multilingual corpus containing examples of CS speech has recently been compiled from 626 South African soap opera episodes. The ELAN media annotation tool  has been used to segment and annotate the data. The spontaneous nature of the speech and the presence of various CS types makes this type of speech interesting for designing an ASR system which is expected to operate on CS speech from South Africa.
The corpus is still under development and the version we used corresponds to the language-balanced dataset with 14.3 hours of speech introduced in . The data contains examples of CS between South African English, isiZulu, isiXhosa, Setswana and Sesotho. The corresponding code-switch language pairs are referred to as English-isiZulu, English-isiXhosa, English-Setswana, and English-Sesotho. An overview of the statistics for the training (Train), development (Dev) and test (Test) sets for each language pair is given in Table 1. Each data set is described in terms of its total duration as well as the duration of the monolingual (m) and CS (cs) segments.
The soap opera speech is typically fast, spontaneous and may express emotion, with a speech rate that is between 1.22 and 1.83 times higher than prompted speech in the same languages. Among the 10 343 code-switched utterances in the corpus, 19 207 intra-sentential language switches are observed. Insertional code-switching with English words is observed to be most frequent. Intra-word CS occurring when English words are supplemented with Bantu affixes in an effort to conform to Bantu phonology is also observed. Note that the test utterances always contain CS and are never monolingual.
4 5-lingual CS ASR System
With the ultimate goal of an ASR system that can operate on all South African languages and correctly process the language switches, we build a language and an acoustic model that considers the word inventory of the five languages of our corpus. We apply the ideas that have been shown to be useful in monolingual scenarios to our multilingual CS system to determine the ASR performance that can be obtained using modern acoustic and language models.
For language modeling, both monolingual and CS text resources were used in varying quantities based on availability. CS text is practically non-existent and CS in textual resources hardly occurs. As is common, we use the transcriptions of the CS speech data for language model training purposes. Since the CS text constitutes a relatively small component of all available text data, we merge all available CS text to train a 5-lingual CS language model. Secondly, all monolingual text from the African languages is merged to train a 4-lingual LM. These models are later interpolated with the English monolingual model that has been trained on a much larger monolingual corpus. Following this strategy provided the lowest perplexities on the transcriptions of the development data in pilot experiments.
For training the acoustic models, we have only used the balanced corpora without including any other available monolingual corpora. On this small amount of data, we explore the performance of various NN architectures including conventional fully connected DNNs, time-delay NN (TDNN) [24, 25]
and its combination with recurrent NN architectures such as long short-term memory (LSTM) and bidirectional LSTM using different acoustic features. Since most of the target languages are tonal, we also extract and attach pitch information to the acoustic features.
Together with the ASR performance, we also evaluate the LR accuracies of the described ASR systems to gain an insight into how well it can distinguish between the target languages, especially between those most similar (isiZulu-isiXhosa and Setswana-Sesotho). To achieve this, we analyze the confusion in the language tags assigned to each word during recognition.
5 Experimental Setup
|Language pairs||# of Words||Language||# of Words|
|All CS text||156k||Sesotho||0.2M|
5.1 Language Modeling
The available monolingual and CS text in total number of words is presented in Table 2. The texts used for monolingual LM training were collected from various sources which include online newspapers, magazines and newsletters (South African English, isiZulu, isiXhosa, Setswana), web text from the Leipzig Corpus Collection  (South African English, isiZulu, isiXhosa, Sesotho), parliamentary bulletins (isiXhosa, Setswana), and the Babel corpus transcriptions (isiZulu).
The language models used in these experiments are 5-lingual 3-gram and 5-gram with interpolated Kneser-Ney smoothing  for recognition and lattice rescoring respectively. We interpolate: (1) a CS 3-gram trained on all CS text, (2) a 4-lingual 3-gram trained on all monolingual text from 4 African languages, and (3) an English 3-gram to obtain the final 3-gram LM. The interpolation weights are learned on the transcriptions of the development data. We have observed that assigning relatively higher weights (0.85-0.9) to the CS LM reduces the perplexities considerably. The final 3-gram model has perplexities of 412 and 617 on the development and test transcriptions respectively. For the final 5-gram model, the corresponding figures are 402 and 605.
5.2 Acoustic Modeling
The recognition experiments are performed using the Kaldi ASR toolkit 
. We train a conventional context dependent Gaussian mixture model-hidden Markov model (GMM-HMM) system with 25k Gaussians using 39 dimensional mel-frequency cepstral coefficient (MFCC) features including the deltas and delta-deltas to obtain the alignments for training the NN models.
As a reference, DNNs with 6 hidden layers and 2048 sigmoid hidden units at each hidden layer are trained on the 40-dimensional log-mel filterbank (FBANK) features with the deltas and delta-deltas. DNN training is performed by mini-batch Stochastic Gradient Descent with an initial learning rate of 0.008 and a minibatch size of 256. The time context size is 11 frames achieved by concatenating5 frames. We further apply sequence training using a state-level minimum Bayes risk (sMBR) criterion .
In addition, we train TDNN (3 standard, 6 time-delay layers), TDNN-LSTM (1 standard, 6 time-delay and 3 LSTM layers) and TDNN-BLSTM (1 standard, 2 time-delay and 3 BLSTM layers) acoustic models with the lattice-free maximum mutual information (LF-MMI) criterion  according to the standard recipe provided for the Switchboard database in the Kaldi toolkit (ver. 5.2.99). With these models, we use 40-dimensional MFCCs together with 3-dimensional pitch features (appended when mentioned). The training parameters provided in the recipe are used without performing any parameter tuning. The 3-fold data augmentation  is applied to the training data.
5.3 Pronunciation Lexicon
The pronunciation lexicon contains 23 453 words of which 5965 are English, 7448 are isiZulu, 5975 are isiXhosa, 1625 are Setswana and 2437 are Sesotho. Due to the pronunciation variants, the lexicon has 30 489 entries in total. The phonetic alphabet contains a total of 284 phones of which 45 are English, 49 are isiZulu, 66 are isiXhosa, 59 are Setswana and 65 are Sesotho. The ASR experiments are closed vocabulary implying that there are no out-of-vocabulary words in the development and test data.
6 Results and Discussion
We present the results of the ASR experiments in this section. The ASR quality is quantified by calculating the word error rate (WER) with the language tags of the words removed. The language recognition performance is later evaluated in the form of a confusion matrix with an extra row and column to accommodate insertion and deletion errors.
|CS Pair (-)||-|
|English-isiZulu||44.2 (1572)||52.6 (5658)||33.7 (838)||38.6 (2459)||56.1 (734)||63.3 (3199)|
|English-isiXhosa||51.9 (2300)||62.8 (2651)||39.6 (1153)||44.9 (1149)||64.3 (1147)||76.5 (1502)|
|English-Setswana||56.7 (3707)||52.3 (4939)||38.0 (1170)||33.9 (1970)||65.4 (2537)||64.6 (2969)|
|English-Sesotho||62.6 (3067)||59.1 (4054)||41.5 (843)||41.9 (1794)||70.6 (2224)||72.8 (2260)|
6.1 ASR Results
First, we investigate the best performing NN architecture and then we move to a detailed analysis of the recognition accuracy on each language component. The ASR results obtained on the development and test sets using different acoustic models are presented in Table 3.
A conventional fully connected DNN provides a total WER of 66.5%, while using a TDNN model brings considerable improvements reducing the WER to 60.6%. Adding recurrent layers helps by further reducing the total WER to 58.8%. With a WER of 57.2%, the ASR system using bidirectional recurrent layers together with time-delay layers has the lowest WER among all the aforementioned acoustic models.
We further explore the impact of appending pitch features and lattice rescoring using a larger n-gram language model. Using pitch features is common practice when building ASR systems for tonal languages. In this scenario, using pitch features provides an absolute improvement of 1.7% leading to a WER of 56.1%. 5-gram lattice rescoring brings further marginal improvements reducing the total WER to 55.6%. In the remaining experiments, we analyze the output of this ASR system to gain better insight to language-specific ASR performance and language recognition performance.
The language-specific recognition accuracies for each CS pair and dataset are shown in Table 4. From these results, it can be seen that there is a large performance gap between the general ASR performance on English and the African languages. English being spoken in all 4 CS pairs, the amount of English training data is much larger than it is for the other languages.
A second issue to be addressed is the poor performance for Sesotho. Even though Sesotho has a similar amount of acoustic training data as Setswana, there is very little textual data available in this language for LM training (0.2M words compared to the 2.8M of Setswana). This results in WER that are higher than 70%, while the WER for the English words is approximately 42%.
A final observation is the similar performance on the development and test data of the Sotho languages (Setswana-Sesotho) implying that the ASR performance on monolingual segments and segments with CS are rather similar. It is worth remarking that the utterances in the test sets all contain CS, while some utterances in the development data (except for English-isiZulu) are monolingual which are expected to be easier to recognize. However, we only observe this pattern in isiXhosa where there is a performance gap between the development and test sets in contrast to the Sotho languages. Having no monolingual utterances in both sets and a large difference in development and test set size (which may result in larger variance in WERs), we do not consider English-isiZulu results in this comparison.
6.2 LR Results
Using the language tags assigned to each word by the best-performing ASR system in Table 3 for the development and test sets, the confusion matrices shown in Table 5 are obtained. Confusions between the isiZulu-isiXhosa and Setswana-Sesotho language pairs are marked in purple and green respectively. This is done to highlight the cells where higher confusion is expected due to the similarity between the two languages.
Focusing firstly on isiZulu-isiXhosa, we see that the confusion occurs mostly in a single direction, i.e. many more isiXhosa words are identified as isiZulu words. In the second language pair, Setswana-Sesotho, the confusions occur in both directions in both development and test sets. The language recognition performance of the ASR system is significantly worse than any other language couple. This can be explained by the greater linguistic similarity between the languages and their larger intersection in the phoneme set and vocabulary. The lower acoustic and written resources further reduces the LR performance of the ASR system on the Sotho languages.
We present a first 5-lingual CS ASR system that is designed to recognize CS speech in 5 South African languages. Using a recently compiled soap opera speech corpus, we explore how well modern NN-based acoustic models can deal with the language switches given the limited availability of resources for the target languages. Language recognition performance implicit in the ASR is also evaluated, especially between the linguistically similar languages. We believe that these first findings are encouraging and provide insight into the challenges in building a unified CS system for multilingual countries such as South Africa.
This research is funded by the NWO Project 314-99-119 (Frisian Audio Mining Enterprise). The authors would like to thank the Department of Arts & Culture of the South African government for funding this research as well as Stellenbosch University for the travel grant that enabled the first author’s visit to Stellenbosch.
-  G. Stemmer, E. Nöth, and H. Niemann, “Acoustic modeling of foreign words in a German speech recognition system,” in Proc. EUROSPEECH, 2001, pp. 2745–2748.
-  D.-C. Lyu, R.-Y. Lyu, Y.-C. Chiang, and C.-N. Hsu, “Speech recognition on code-switching among the Chinese dialects,” in Proc. ICASSP, vol. 1, May 2006, pp. 1105–1108.
-  N. T. Vu, D.-C. Lyu, J. Weiner, D. Telaar, T. Schlippe, F. Blaicher, E.-S. Chng, T. Schultz, and H. Li, “A first speech recognition system for Mandarin-English code-switch conversational speech,” in Proc. ICASSP, March 2012, pp. 4889–4892.
-  T. I. Modipa, M. H. Davel, and F. De Wet, “Implications of Sepedi/English code switching for ASR systems,” in Pattern Recognition Association of South Africa, 2015, pp. 112–117.
-  T. Lyudovyk and V. Pylypenko, “Code-switching speech recognition for closely related languages,” in Proc. SLTU, 2014, pp. 188–193.
-  C. H. Wu, H. P. Shen, and C. S. Hsu, “Code-switching event detection by using a latent language space model and the delta-Bayesian information criterion,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 11, pp. 1892–1903, Nov 2015.
-  E. van der Westhuizen and T. Niesler, “Automatic speech recognition of English-isiZulu code-switched speech from South African soap operas,” Procedia Computer Science, vol. 81, pp. 121–127, 2016.
E. Yılmaz, H. Van den Heuvel, and D. A. Van Leeuwen, “Investigating bilingual deep neural networks for automatic speech recognition of code-switching Frisian speech,” inProc. SLTU, May 2016, pp. 159–166.
-  E. Yılmaz, H. van den Heuvel, and D. van Leeuwen, “Code-switching detection using multilingual DNNs,” in IEEE Spoken Language Technology Workshop (SLT), Dec 2016, pp. 610–616.
-  Y. Li and P. Fung, “Code switching language model with translation constraint for mixed language speech recognition,” in Proc. COLING, Dec. 2012, pp. 1671–1680.
H. Adel, N. Vu, F. Kraus, T. Schlippe, H. Li, and T. Schultz, “Recurrent neural network language modeling for code switching conversational speech,” inProc. ICASSP, 2013, pp. 8411–8415.
-  H. Adel, K. Kirchhoff, D. Telaar, N. T. Vu, T. Schlippe, and T. Schultz, “Features for factored language models for code-switching speech,” in Proc. SLTU, May 2014, pp. 32–38.
-  Z. Zeng, H. Xuy, T. Y. Chongy, E.-S. Chngy, and H. Li, “Improving N-gram language modeling for code-switching speech recognition,” in Proc. APSIPA ASC, 2017, pp. 1–6.
-  I. Hamed, M. Elmahdy, and S. Abdennadher, “Building a first language model for code-switch Arabic-English,” Procedia Computer Science, vol. 117, pp. 208 – 216, 2017.
-  E. van der Westhuizen and T. Niesler, “Synthesising isiZulu-English code-switch bigrams using word embeddings,” in Proc. INTERSPEECH, 2017, pp. 72–76.
-  E. Yılmaz, H. Van den Heuvel, and D. A. Van Leeuwen, “Exploiting untranscribed broadcast data for improved code-switching detection,” in Proc. INTERSPEECH, Aug. 2017, pp. 42–46.
-  E. Yılmaz, M. McLaren, H. Van den Heuvel, and D. A. Van Leeuwen, “Language diarization for semi-supervised bilingual acoustic model training,” in Proc. ASRU, Dec. 2017, pp. 91–96.
-  ——, “Semi-supervised acoustic model training for speech with code-switching,” Submitted to Speech Communication, 2018. [Online]. Available: goo.gl/NKiYAF
-  A. Biswas, F. de Wet, E. van der Westhuizen, E. Yılmaz, and T. Niesler, “Multilingual neural network acoustic modelling for ASR of under-resourced English-isiZulu code-switched speech,” in Proc. INTERSPEECH, 2018, Accepted for publication.
-  A. T. Cope, “Zulu phonology, tonology, and tonal grammar,” Ph.D. dissertation, Durban: University of Natal, 1966.
-  D. P. Kunene, “The sound system of Southern Sotho,” Ph.D. dissertation, Cape Town: University of Cape Town, 1961.
-  E. van der Westhuizen and T. Niesler, “A first South African corpus of multilingual code-switched soap opera speech,” in Proc. LREC, 2018, pp. 2854–2859.
-  P. Wittenburg, H. Brugman, A. Russel, A. Klassmann, and H. Sloetjes, “ELAN: a professional framework for multimodality research,” in Proc. LREC, vol. 2006, 2006, p. 5th.
-  A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J. Lang, “Phoneme recognition using time-delay neural networks,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 37, no. 3, pp. 328–339, Mar 1989.
-  V. Peddinti, D. Povey, and S. Khudanpur, “A time delay neural network architecture for efficient modeling of long temporal contexts,” in Proc. INTERSPEECH, 2015, pp. 3214–3218.
-  V. Peddinti, Y. Wang, D. Povey, and S. Khudanpur, “Low latency acoustic modeling using temporal convolution and LSTMs,” IEEE Signal Processing Letters, vol. 25, no. 3, pp. 373–377, March 2018.
-  C. Biemann, G. Heyer, U. Quasthoff, and M. Richter, “The Leipzig Corpora Collection-monolingual corpora of standard size,” Proceedings of Corpus Linguistic, vol. 2007, 2007.
-  R. Kneser and H. Ney, “Improved backing-off for M-gram language modeling,” in Proc. ICASSP, vol. I, 1995, pp. 181–184.
-  D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, “The Kaldi speech recognition toolkit,” in Proc. ASRU, Dec. 2011.
-  K. Vesely, A. Ghoshal, L. Burget, and D. Povey, “Sequence-discriminative training of deep neural networks,” in Proc. INTERSPEECH, 2013, pp. 2345–2349.
-  D. Povey, V. Peddinti, D. Galvez, P. Ghahremani, V. Manohar, X. Na, Y. Wang, and S. Khudanpur, “Purely sequence-trained neural networks for ASR based on lattice-free MMI,” in Proc. INTERSPEECH, 2016, pp. 2751–2755.
-  T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, “Audio augmentation for speech recognition,” in Proc. INTERSPEECH, 2015, pp. 3586–3589.