Hindi-English Code-Switching Speech Corpus

09/24/2018 ∙ by Ganji Sreeram, et al. ∙ iit guwahati 0

Code-switching refers to the usage of two languages within a sentence or discourse. It is a global phenomenon among multilingual communities and has emerged as an independent area of research. With the increasing demand for the code-switching automatic speech recognition (ASR) systems, the development of a code-switching speech corpus has become highly desirable. However, for training such systems, very limited code-switched resources are available as yet. In this work, we present our first efforts in building a code-switching ASR system in the Indian context. For that purpose, we have created a Hindi-English code-switching speech database. The database not only contains the speech utterances with code-switching properties but also covers the session and the speaker variations like pronunciation, accent, age, gender, etc. This database can be applied in several speech signal processing applications, such as code-switching ASR, language identification, language modeling, speech synthesis etc. This paper mainly presents an analysis of the statistics of the collected code-switching speech corpus. Later, the performance results for the ASR task have been reported for the created database.





(CS) is a linguistic phenomenon defined as “the

alternation of two languages within a single dis-

course, sentence or constituent.”

From India 


english switching code is amazing and nice blogging its cover all aspects and cover all topics . it nice blogging city. i learnt more and more


page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In multilingual communities, the speakers often switch or mix between two or more languages or language varieties during communication in their day to day lives. In linguistics, this phenomenon is referred to as code-switching Gumperz_1982_Discourse ; nilep2006code . This phenomenon poses some interesting research challenges to speech recognition Lyu_2006_Speech ; Bhuvanagirir_2012_Mixed ; ahmed2012automatic , language identification Lyu_2008_Language and language modelling Cao_2010_Semantics ; Yeh_2010_Integrated ; hamed2017building domains. Over the years, due to urbanization and geographical distribution, people have moved from one place to another for a better livelihood. Hence, communicating in two or more languages helps to interact better with people from different places and cultures. There are many reasons for the occurrence of code-switching. The people belonging to the bilingual communities say that the main reason for code-switching between languages is due to the lack of words in the vocabulary of that particular native language Grosjean_1982_Life . According to  myers1993social ; malik1994socio ; milroy1995one ; dey2014hindi , some possible reasons for code-switching are: (i) to qualify the message by emphasizing specific words, (ii) to convey a personalized message, (iii) to maintain confidentiality during verbal communication, (iv) to show expertise, authority, status, etc. Another reason for code-switching is to enrich communication between speakers without any change in the situation. Hence, a native language speaker actively embeds meanings into the conversation by mixing non-native language words su2001code

. Based on the locations of the non-native words, code-switching can be broadly classified into two modes. When the switching happens within the sentence it is referred to as the

intra-sentential code-switching and the one predominantly happening at the sentence boundary is referred to as the inter-sentential code-switching myers1989codeswitching . Intra-sentential mode of switching is a common phenomenon and has become an identifying characteristic in bilingual communities.

Over the years, the English language has become the most widely spoken language in the world. After gaining independence from the British rule, though the Indian constitution declared Hindi as the primary official language, the usage of English was continued as a secondary language for its dominance in administration, education, and law malhotra1980hindi . Thereby, the urban population has started a trend to communicate in English for economic and social purposes. Over the years, substantial code-switching to English while speaking Hindi, as well as many other Indian languages, has become a common feature kumar1986certain ; Smita2009 . Note that, of the Indian population are native Hindi speakers and hence the switching between Hindi and English is very common. Also, in the recent past, the researchers have reported that the native language of the speaker influences the foreign (non-native) language acquisition flege1995second . In India, English is taught in schools from elementary level across the country, but very few schools are able to impart correct English pronunciations devoid of native language influences to their pupils. The recent works bali2014 ; Das_2015_Code have highlighted that the code-switching phenomenon is also observed in chats, comments, and messages posted on the social media sites like Facebook, Twitter, WhatsApp, YouTube, etc. Table 1 shows a few example sentences of different modes of code-switching while highlighting the differences in the contextual information carried by the non-native words. In Type-1 intra-sentential code-switching, the non-native language words either occur in sequence or form a phrase, thus carry some contextual information. Whereas, in Type-2 case, the non-native language words are embedded into the native language sentences in such a manner that virtually no contextual information could be derived from those words. Also, during code-switching, we observe that the majority of the sentences belong to Type-2 intra-sentential mode. However, due to lack of availability of the domain-specific resources, the research activity is somewhat limited.

The monolingual automatic speech recognition (ASR) systems may be capable of recognizing a few words from a foreign language but are unable to handle a significant amount of code-switching in the data. On account of the existence of different variants of English pronunciations and code-switching effects, the development of an ASR system for Hindi-English (Hinglish) code-switching speech data is a challenging task. To the best of our knowledge, there is no large-sized Hinglish corpus available for carrying out the research. Towards addressing that constraint, we recently created a Hinglish corpus covering all typical sources of variations such as accent, session, channel, age, gender, etc. In this work, we describe the details of that corpus and also present basic experimental evaluation is done on the same.

Table 1: Example Hinglish sentences showing the inter-sentential code-switching and the variants of the intra-sentential code-switching. Type-1 and Type-2 variants of intra-sentential code-switching refer to high and low contextual information being carried by the non-native (English) words, respectively.

The remainder of this paper is organized as follows: In Section 2, we review the code-switching corpora currently reported in the literature. In Section 3, the details about Hinglish speech and text corpus along with that of the necessary lexical resources for developing the Hinglish ASR system, are presented in detail. The experimental evaluations using the created Hinglish corpus has been presented in Section 5. The paper is concluded in Section 6.

2 Literature Review on Code-switching Corpora

In literature, a few code-switching speech corpora are already reported and they happen to cover different native and non-native language combinations. In the following, we briefly review those code-switching corpora while summarizing their salient attributes.

  • The CUMIX Cantonese-English code-switching speech corpus developed by Joyce Y. C. Chan, et al., at the Chinese University of Hong Kong  chan2005development . It contains code-switched speech utterances read by the speakers. The database contains hours of data read by speakers.

  • A small Mandarin-Taiwanese code-switching speech corpus was developed for testing purpose in lyu2006speech by Dau-cheng Lyu and Ren-yuan Lyu. The corpus contains Mandarin-Taiwanese code-switching utterances recorded from speakers.

  • The English-Spanish code-switching speech corpus was compiled by Franco J. C. and Solorio at the University of Texas franco2007baby . The corpus contains minutes of transcribed spontaneous conversations of speakers.

  • The SEAME is a Mandarin-English code-switching conversational speech corpus developed by Dau-Cheng Lyu and Tien Ping Tan from Nanyang Technological University, Singapore, and Universiti Sains Malaysia lyu2010seame ; vu2012first . The database contains hours of spontaneous Mandarin-English code-switching interview and conversational speech uttered by Singaporean and Malaysian speakers.

  • Han-Ping Shen, et al., developed the CECOS, a Chinese-English code-switching speech corpus at the National Cheng Kung University in Taiwan shen2011cecos . It contains hours of speech data collected from speakers uttering prompted code-switching sentences.

  • A small Hindi-English code-switching speech corpus was collected by Anik Dey and Pascale Fung at Hong Kong University of Science and Technology. This corpus is primarily made up of student interview speech dey2014hindi . It is about minutes of data collected from speakers.

  • A corpus of Sepedi-English code-switching speech corpus was created by the South African CSIR modipa2013implications . The database consists of hours of prompted speech, sourced from radio broadcasts and read by Sepedi speakers.

  • Emre Yılmaz, et al., developed FAME!, a Frisian-Dutch code-switching speech corpus of radio broadcast speech at Radboud University, Nijmegen yilmaz2016longitudinal . The recordings are collected from the archives of Omrop Fryslan, the regional public broadcaster of the province Fryslan. The database covers almost a years time span.

  • The Malay-English corpus developed by Basem H. A. Ahmed, et al., consists of hours of Malaysian Malay-English code-switching speech data from Chinese, Malay and Indian speakers. ahmed2012automatic .

  • MediaParl is a Swiss accented bilingual database developed by David Imseng, et al. contains recordings in both French and German as they are spoken in Switzerland. The data was recorded at the Valais Parliament. Valais is a bi-lingual Swiss canton with many local accents and dialects imseng2012mediaparl .

  • The FACST, a French-Arabic speech corpus consists of records of code-switching read and conversational utterances by bilingual adult speakers who tend to code-switch in their daily lives amazouz2018french . It is about hours of data.

  • A South African speech corpus containing English-isiZulu, English-isiXhosa, English-Setswana, and English-Sesotho code-switching speech utterances is created from South African soap operas by Ewald van der Westhuizen and Thomas Niesler. The soap opera speech is typically fast, spontaneous and may express emotion, with a speech rate higher than prompted speech in the same languages van2018first .

  • The Arabic-English is recently developed by Injy Hamed, et al., by conducting the interviews with participants hamed2018collection .

From the literature review, it can be noted that very small sized code-switching acoustic and linguistic resources have been available so far covering the Indian context. This motivated us to create moderate sized Hinglish resources so that current technological advances in acoustic and language modeling can be explored for Hinglish ASR task.

3 Creation of Hinglish Corpus

This section describes the details of the creation of Hinglish (code-switching) corpus. Firstly, we describe the context and means employed for the creation of Hinglish sentences. Secondly, the details of the procedure followed by the speakers while recording the speech data corresponding to the created Hinglish sentences, are described. Finally, the creation of the lexical resources is discussed.

3.1 Hinglish text transcripts

For the experimental purpose, the Hinglish code-switching text data has been collected by crawling a few blogging websites 111https://shoutmehindi.com222https://notesinhinglish.blogspot.in333https://www.techyukti.com444http://www.learncpp.com

having different contexts. The crawled data is normalized into meaningful sentences and further processed to remove extra spaces, special characters, emoticons, etc. Data thus obtained is used for training the language models, creating the lexicon and also as the text transcription for recording the acoustic data. The salient details of the Hinglish code-switching text corpus created is summarized in Table 


# sentences # words # unique words
Hindi English Hindi English
13,071 179,798 71,143 3,649 4,980
Table 2: The details of the vocabulary size and the word count of the Hinglish code-switching database involved in this study.

3.2 Hinglish speech corpus

Hinglish code-switching acoustic data is recorded over the phone from speakers belonging to different states in India. A consultant was hired for enrolling the speakers to call a toll free number from their mobile phones. The speakers called from various acoustic environments such as home, office, etc. Each speaker was given unique sentences taken from the above-processed text data. These sentences are partitioned into groups which contain sentences each. Each speaker is requested to record those groups in different sessions in order to capture the session variations such as emotions, environment, etc. It is worth highlighting that the duration of the sentences given to each speaker varies from seconds. Each speaker took about minutes to complete recording the sentences in each session. On an average, to complete recording the sentences, each speaker took about minutes. The volunteering speakers were compensated with for their time and effort.

The speech data is recorded at kHz sampling frequency and a bit rate of . This set of speech files was manually inspected and pruned. At the end of the data collection phase, the Hinglish code-switching database contained utterances in total spoken by speakers.

3.3 Development of the lexical resources

For the creation of a lexicon for development of an ASR system for Hinglish data, a unified phone list has been created for Hindi and English words. Also, a unique word list is extracted from the sentences obtained from Sub-section 3.1. The phone level transcription for those words has been done manually. Thus created lexicon covers all the pronunciation variations.

4 Statistical Analysis of the Database

This section provides the statistical analysis of the Hinglish code-switching speech corpus. The following subsection provides information about the speakers in the database. Later, a description of the size and linguistic features of the database is provided.

4.1 Speaker information

In order to collect the Hinglish code-switching speech data, the field data consultant recruited speakers from Indian Institute of Technology Guwahati (IITG) who are natively from different states of India. A total of speakers are involved in the development of this database. To model a robust ASR system for Hinglish code-switching data, we need to have a database that covers variations due to different geographical distribution, gender, age, etc. Aiming at this, we have collected the database which covers all such variations. The details of the database are discussed below.

4.1.1 Geographical distribution

Since the speakers residing in IITG are from different states of India and from different geographical locations, diversity in the acoustic data is guaranteed. The geographical distribution of the speakers is shown as a pie-chart in Figure 1. The area-wise distribution of the speakers involved in this study is provided in Table 3.

Figure 1: Geographical distribution of the speakers involved in the collection of database. It is worth highlighting that the collected database covers speakers from states od India.

4.1.2 Age information

The Hinglish code-switching speech data has been recorded from the speakers between to years of age. The age distribution is shown as a bar-diagram in Figure 2.















Population ( )
Figure 2: Age distribution of the speakers involved in the collection of database. The speakers between years of age are involved in this study.

4.1.3 Gender information

The Hinglish code-switching speech data is recorded from female speakers and male speakers resulting in a total of speakers from different states of India. The gender distribution is shown as a pie-chart in Figure 3.

Figure 3: Gender distribution of the speakers involved in creation of the Hinglish code-switching database.
East 14 15 29
West 05 02 07
North 16 06 22
South 09 04 13
Total 44 27 71
Table 3: The area and gender wise distribution of the speakers employed for the creation of the Hinglish code-switching database. In total we have speakers out of which are male and are female

5 Experimental Evaluation and Discussion

The Hinglish code-switching database has been validated by developing an ASR system. For this purpose, the recorded utterances are partitioned into training and testing sets containing and , respectively. Later, the GMM/DNN based acoustic models are trained using the training set by employing the Kaldi toolkit povey2011kaldi . A -gram language model (LM) is trained over the entire text data obtained from Sub-section 3.1 after excluding those sentences that are used in testing. Therefore, number of sentences are used for training the LM. For developing the -gram LM, we have employed the IRSTLM toolkit federico2008irstlm . The evaluation results in terms of percentage word error rate (%WER) are given in Table 4. The DNN-based acoustic model with -gram LM resulted in the best score when compared to other models.

5.1 Parameter tuning

The context-dependent GMM acoustic models are trained by tuning the number of senones. After tuning, the number of senones is set to be . The Gaussian mixtures per senone are set to be in all the cases. The DNN based acoustic models are trained with hidden layers and nodes with tanh as non-linearity function in each of the hidden layers. These models are trained with epochs and mini-batch size of .

Model Features %WER
Mono MFCC 53.51
Tri1 MFCC 33.52
Tri2 MFCC + LDA 32.73
Tri3 MFCC + LDA + SAT 27.20
DNN MFCC + LDA + SAT 25.40
Table 4: Evaluation of Hinglish code-switching speech corpus in context od ASR task. The performance results in terms of percentage word error rate (%WER) are reported.

6 Conclusion

In this work, the procedure followed to develop a Hinglish code-switching speech database has been presented. It contains utterances spoken by speakers from different parts of India. The database has been validated by developing an ASR system. The collection of the database is still in progress.

7 Acknowledgment

The authors wish to acknowledge with gratitude for the financial assistance received towards data collection from an ongoing project grant no. 11(18)/2012-HCC(TDIL) from the Ministry of Electronics and Information Technology, Govt. of India.


  • (1) John J Gumperz, Discourse Strategies, Cambridge University Press, 1982.
  • (2) Chad Nilep, “’Code switching’in sociocultural linguistics,” Colorado Research in Linguistics, vol. 19(1), pp. 1–22, 2006.
  • (3) Dau Cheng Lyu, Ren Yuan Lyu, Yuang Chin Chiang, and Chun Nan Hsu, “Speech recognition on code-switching among the Chinese dialects,” in Proc. of the International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2006, vol. 1.
  • (4) Kiran Bhuvanagirir and Sunil Kumar Kopparapu, “Mixed language speech recognition without explicit identification of language,” American Journal of Signal Processing, vol. 2, no. 5, pp. 92–97, 2012.
  • (5) Basem HA Ahmed and Tien-Ping Tan, “Automatic speech recognition of code switching speech using 1-best rescoring,” in Proc. of the International Conference on Asian Language Processing (IALP). IEEE, 2012, pp. 137–140.
  • (6) Dau Cheng Lyu and Ren Yuan Lyu, “Language identification on code-switching utterances using multiple cues,” in Proc. of the Interspeech, an Annual Conference of International Speech Communication Association, 2008.
  • (7) Houwei Cao, PC Ching, Tan Lee, and Yu Ting Yeung, “Semantics-based language modeling for Cantonese-English code-mixing speech recognition,” in Proc. of the 7th International Symposium on Chinese Spoken Language Processing (ISCSLP). IEEE, 2010, pp. 246–250.
  • (8) Ching Feng Yeh, Chao Yu Huang, Liang Che Sun, Che Liang, and Lin Shan Lee, “An integrated framework for transcribing Mandarin-English code-mixed lectures with improved acoustic and language modeling,” in Proc. of the 7th International Symposium on Chinese Spoken Language Processing (ISCSLP). IEEE, 2010, pp. 214–219.
  • (9) Injy Hamed, Mohamed Elmahdy, and Slim Abdennadher, “Building a First Language Model for Code-switch Arabic-English,” Procedia Computer Science, vol. 117, pp. 208–216, 2017.
  • (10) François Grosjean, Life with Two Languages: An Introduction to Bilingualism, Harvard University Press, 1982.
  • (11) Carol Myers-Scotton, “Social motivations for code-switching: evidence from Africa. Clarendon,” 1993.
  • (12) Lalita Malik, Socio-linguistics: A study of code-switching, Anmol Publications PVT. LTD., 1994.
  • (13) Lesley Milroy and Pieter Muysken, One speaker, two languages: Cross-disciplinary perspectives on code-switching, Cambridge University Press, 1995.
  • (14) Anik Dey and Pascale Fung, “A Hindi-English Code-Switching Corpus.,” in Proc. of the Language Resources and Evaluation Conference (LREC), 2014, pp. 2410–2413.
  • (15) Hsi-Yao Su, “Code-switching between mandarin and taiwanese in three telephone conversation: The negotiation of interpersonal relationships among bilingual speakers in taiwan,” in Proc. of the Symposium about Language and Society, 2001.
  • (16) Carol Myers-Scotton, “Codeswitching with English: types of switching, types of communities,” World Englishes, vol. 8, no. 3, pp. 333–346, 1989.
  • (17) Sunita Malhotra, “Hindi-English, Code Switching and Language Choice in Urban, Uppermiddle-class Indian Families,” Kansas Working Papers in Linguistics, 1980.
  • (18) Ashok Kumar, “Certain aspects of the form and functions of Hindi-English code-switching,” Anthropological Linguistics, pp. 195–205, 1986.
  • (19) Smita Sinha, “Code Switching and Code Mixing Among Oriya Trilingual Children - A Study,” Academic Journal on Language in India, vol. 9(4), pp. 274, 2009.
  • (20) James E Flege, “Second-language speech learning: Theory, findings, and problems,” Speech perception and linguistic experience, 1995.
  • (21) Kalika Bali, Jatin Sharma, Monojit Choudhury, and Yogarshi Vyas, “I am borrowing ya mixing? An Analysis of English-Hindi Code Mixing in Facebook,” in Proc. of the First Workshop on Computational Approaches to Code Switching, 2014, pp. 116–126.
  • (22) Amitava Das and Björn Gambäck, “Code-mixing in social media text: the last language identification frontier?,” in Proc. of the Traitement Automatique des Langues (TAL), Special Issue on Social Networks and NLP, vol. 54(3), 2015.
  • (23) Chan Joyce Y. C., P. C. Ching, and Tan Lee, “Development of a Cantonese-English code-mixing speech corpus,” in Proc. of the 9th European Conference on Speech Communication and Technology, 2005.
  • (24) Dau-Cheng Lyu, Ren-Yuan Lyu, Yuang-chin Chiang, and Chun-Nan Hsu, “Speech recognition on code-switching among the Chinese dialects,” in Proc. of International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2006, vol. 1.
  • (25) Juan Carlos Franco and Thamar Solorio, “Baby-steps towards building a Spanglish language model,” in Proc. of International Conference on Intelligent Text Processing and Computational Linguistics. Springer, 2007, pp. 75–84.
  • (26) Dau-Cheng Lyu, Tien-Ping Tan, Eng Siong Chng, and Haizhou Li, “SEAME: a Mandarin-English code-switching speech corpus in south-east asia,” in Proc. of Interspeech, an Annual Conference of the International Speech Communication Association, 2010.
  • (27) Ngoc Thang Vu, Dau-Cheng Lyu, Jochen Weiner, Dominic Telaar, Tim Schlippe, Fabian Blaicher, Eng-Siong Chng, Tanja Schultz, and Haizhou Li, “A first speech recognition system for Mandarin-English code-switch conversational speech,” in Proc. of International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2012, pp. 4889–4892.
  • (28) Han-Ping Shen, Chung-Hsien Wu, Yan-Ting Yang, and Chun-Shan Hsu, “Cecos: A Chinese-English code-switching speech database,” in Proc. of International Conference on Speech Database and Assessments (Oriental COCOSDA). IEEE, 2011, pp. 120–123.
  • (29) Thipe I Modipa, Marelie H Davel, and Febe De Wet, “Implications of Sepedi/English code switching for ASR systems,” 2013.
  • (30) Emre Yilmaz, Maaike Andringa, Sigrid Kingma, Jelske Dijkstra, Frits Van der Kuip, Hans Van de Velde, Frederik Kampstra, Jouke Algra, H Heuvel, and David Van Leeuwen, “A longitudinal bilingual Frisian-Dutch radio broadcast database designed for code-switching research,” 2016.
  • (31) David Imseng, Hervé Bourlard, Holger Caesar, Philip N Garner, Gwénolé Lecorvé, and Alexandre Nanchen, “MediaParl: Bilingual mixed language accented speech database,” in Proc. of Spoken Language Technology Workshop (SLT). IEEE, 2012, pp. 263–268.
  • (32) Djegdjiga Amazouz, Martine Adda-Decker, and Lori Lamel, “The French-Algerian Code-Switching Triggered audio corpus (FACST).,” in Proc. of Language Resources and Evaluation Conference (LREC), 2018.
  • (33) Ewald van der Westhuizen and Thomas Niesler, “A first south african corpus of multilingual code-switched soap opera speech.,” in Proc. of Language Resources and Evaluation Conference (LREC), 2018.
  • (34) Injy Hamed, Mohamed Elmahdy, and Slim Abdennadher, “Collection and Analysis of Code-switch Egyptian Arabic-English Speech Corpus.,” in Proc. of Language Resources and Evaluation Conference (LREC), 2018.
  • (35) Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, et al., “The Kaldi speech recognition toolkit,” in Workshop on automatic speech recognition and understanding. IEEE Signal Processing Society, 2011, number EPFL-CONF-192584.
  • (36) Marcello Federico, Nicola Bertoldi, and Mauro Cettolo, “IRSTLM: an open source toolkit for handling large scale language models,” in Proc. of 19th Annual Conference of the International Speech Communication Association, 2008.