Speech recognition has made remarkable progress in the past few years especially after the advent of deep learning. End to End networks become an attractive choice for multilingual ASR’s and code switching ASR since they combine the acoustic model, pronunciation and lexicon model into a single network. This is of particular importance for countries such as India which has 22 official languages with an additional 1500 minor languages/dialects.
1.1 Multilingual ASR
The use of multilingual ASR models which simultaneously transcribe many languages has gained a lot of research interest in the past [5495646, SCHULTZ200131, 6639348]. Multilingual models can not only simplify pipelines in a production setting but also training on small set of similar languages can significantly improve recognition. A straight forward approach for dealing with many languages at the same time would be to identify the language with the help of a language identification system and choose the appropriate monolingual model for decoding [kumar2005multilingual].
To train a multilingual ASR model we can take a union over all the language characters and jointly train a model on all the languages. Our approach is based on the wav2vec [baevski2020wav2vec]. We solve the problem of multilingual ASR system for six Indic languages: Hindi, Marathi, Odia, Tamil, Telugu, Gujarati. We train both, a multilingual ASR model for all 6 languages combined, and also monolingual models for each language. To create a language identification model(LID) we analyze the transcripts of multilingual ASR model and find out that including a LID before monolingual models gives an improvement of % for the final system.
1.2 Code Switching ASR
Bilingual speakers tend to have the ability to switch linguistically according to situational changes. When a bilingual speaker articulates two languages successively in the same discourse, there is interference between the two modes of speaking. When two languages are articulated in this manner, they differ significantly from the same languages as spoken in their separate social systems. This kind of interference on the discourse level is known as code-switching[dey2014hindi].
Code-Switching is more commonly used in a non-literary way. Hence there is a very limited amount of code-switching corpus[nj2020investigation]. Several challenges appear in this area, including the lack of training data for language modeling[adel2015syntactic], the co-articulation effects[vu2012first], and the need of expert linguistic knowledge. We solve code switching problem between two Indic language pairs: Hindi-English and Bengali-English. We use the wav2vec 2.0 for modelling this problem as well.
2 MUCS Dataset
This dataset111https://navana-tech.github.io/MUCS2021/data.html was a part of MUCS 2021[Diwan_2021] competition held at Interspeech 2021. The competition organizers in addition to the data provided baseline models as well. For multilingual, total duration of train set is 403 hours with sentence-wise file transcriptions. Every transcript contains text in a single script. For code switching, total duration for training set is 136 hours and every transcript contains text in two scripts pairs.
2.1 Data Preprocessing
2.1.1 Audio Data Preprocessing
For Hindi, Marathi and Odia, we upsample the audios to 16kHz from 8kHz. Although this is not going to add any information, this is done to keep a standard sample rate across different languages. Also, the ASR algorithm we use works on 16kHz audio data. Additionally, we also perform loudness normalization of all audios in the training set.
2.1.2 Text Data Preprocessing
We clean all the audios for any punctuations and special characters which have no spoken form. Apart from this we also clean symbols which occur rarely in the training data. Any symbol having less than 10 frequency was removed from the training transcripts and hence ASR vocabulary too.
2.1.3 Language Merger
After analyzing the vocabulary of unique characters in different languages, one interesting finding was that there is just a difference of 1 character between Hindi and Marathi. So we decide to merge both the languages together during the multilingual/monolingual trainings. We denote this merger by HindMara in the sections below. Table 2 denotes the final vocab length.
We use wav2vec 2.0 [baevski2020wav2vec] for modelling both our setups. We built our own experimentation platform222https://github.com/Open-Speech-EkStep/vakyansh-wav2vec2-experimentation on top of fairseq toolkit for the entire modelling process. wav2vec 2.0 has 2 stages: pretraining and finetuning. In pretraining, the speech input is masked in the latent space and a contrastive task with predictions from the transformer and quantized latent speech representations[baevski2020wav2vec] is solved to learn contextualized information. This enables learning powerful representations from speech audio alone. Pretraining is then followed by finetuning using the labelled data. The general methodology across experiments in both the problems remains same.
We experiment with two pretrained models. One is Hindi-4200, trained on 4200 hours of unlabelled Hindi data[clsril]333$https://github.com/Open-Speech-EkStep/vakyansh-models#pretrained-asr-models$ and is based on base architecture which has parameters. Another model that we experiment is XLSR-53[xlsr] which is trained on 56k hours of data across 53 languages and is based on large architecture which has parameters.
Pretrained models are fine-tuned by adding a fully connected layer on top of the context network with the size of output layer equal to the vocabulary of the task. Models are optimized using a CTC loss [Graves06connectionisttemporal]. We finetune on both the pretrained models as described in 3.1 and perform unilingual and multilingual trainings.
3.3 Language Model
The output from speech recognition model is fed to a statistical KenLM[heafield2011kenlm]
based language model where a beam search is performed to find the appropriate words and correct spellings of some common words. We use IndicCorp to train most of our 5-gram KenLM based language models. The corpus for LM training is cleaned of all the characters/words that do not appear in the vocabulary of ASR training. The words are sorted based on their frequency and the top 5,00,000 words are picked. The probabilities are then calculated using 1-gram, 2-gram upto 5-gram.
4 Problem Formulation
4.1 Multilingual ASR
4.1.1 Approach M1: Train one common multilingual model
This is the most straightforward approach, lets say we have languages . Each language has characters. We train our common multilingual model on combined character set and a union of all training samples in fine-tuning. The model is trained by sharing parameters across all languages. We finetune on both the Hindi-4200 and XLSR-53 pretrained models. We also train a KenLM based gram language model (LM) by combining text from all the languages.
4.1.2 Approach M2: Language Identification + Monolingual Models
The key part here is to utilize the common multilingual model trained in 4.1.1 as a language identification model. When a single model is capable of understanding different languages, it means it is capable of identifying the language first and then performing the transcript decoding. If we can extract the language part somehow from this model then this could be further used in the pipeline to select a monolingual model to run on. We propose the following rule based approach for Language Identification.
Decode the audio with common multilingual model and get transcripts.
Classify every character in the transcript into a language based on appearance of that character in a particular language’s vocab. Since the vocabs are non overlapping we get exact counts of different languages present in the transcript. Then select the language that appears the most, this is a simple majority-voting classification technique.
Once the audios are classified by their language labels predicted in the previous step, we infer them using monolingual models trained for the particular language. The idea behind using monolingual model here is that it will be more accurate than a multilingual model.
There is one problem though, if the language of the audio is identified incorrectly, then we will never be able to get the correct transcript for that audio.
4.2 Code Switching
4.2.1 Approach C1: One common model
The main idea behind this approach is to include more data for English as it is the common language between different pairs. The final model will be capable to output prediction in Hindi-English or Bengali-English pairs. We finetune on both Hindi-4200 and XLSR-53 followed by a 5-gram combined language model.
4.2.2 Approach C2: Individual models on different pairs
The idea behind this experiment was to find out if monolingual model trained on a pair will outperform a multilingual model trained on combination of different pairs with one common language between the pairs.
5.1 Multilingual ASR
5.1.1 Approach M1
Results for Approach M1 for different pretrained models is shown in Table 3. To evaluate the models, we report the WER (Word Error Rate). The takeaway from here was that a bigger finetuned model based on XLSR-53 was performing better on the blind set. So we decide to use XLSR-53 as the common Multilingual model for Language Identification task. The performance on Marathi seems to be poor but actually the poor results were due to presence of blindset in a different format.
5.1.2 Approach M2
For approach M2 to work we need three things: a common multilingual model, language identification majority voting classifier, monolingual models. Common Multilingual model as explained in previous step is XLSR-53 and language identification is performed on the output transcripts from XLSR-53.
For monolingual models we train 5 models as Hindi and Marathi were merged to form HindMara. The pretrained model here was Hindi-4200 as it is more suited for less data. During this training, we also try Data augmentation and we see improvement in WER for Tamil and HindMara but not so for other languages. Augmentations like volume, pitch, pace and Gaussian noise were done twice on each audio sample to get 3x training data. For language models, we use IndicCorp to create 5-gram pruned statistical models. The beam width was set to 3000 during the decoding part so as to get the best possible output with lower WER.
|Without LM||With LM|
The key takeaway from here is that we can have different configurations for monolingual models based on what works well on which language. From the results, we decide not to use a LM with Odia as it increased the WER. Using this setup as the final setup we get an average WER of 26.56 which is an improvement of % over Approach 1. Final Results are in table 5.
5.2 Code Switching ASR
In code switching dataset, there were multiple instances of the same English word appearing both in the Latin script and the native scripts of Hindi and Bengali in the training data. Given this inconsistency in script usage for English words, the ASR predictions of English words could either be in the native script or in the Latin script. To allow for both English words and their transliterations in the respective native scripts to be counted as correct during the final word error rate computations, we calculated transliterated WER (T-WER) metric along with the standard WER metric. T-WER will count an English word in the reference text as being correctly predicted if it is in English or in its transliterated form in the native script
5.2.1 Approach C1
Results from Approach C1 are in table 6. Training data was combined from both pairs and even for the LM training, the training text corpus was combined to create the final LM. The key takeaway from here was that the finetuned model on Hindi-4200 outperformed XLSR-53 again due to the presence of Hindi data in the pretrained model.
5.2.2 Approach C2
Based on the findings from previous Approach, the idea behind this approach was to train separately on Hindi-English pairs and Bengali-English pairs. The main difference between Multilingual ASR and code switched is that during the blindset evaluation, language for multilingual ASR is not provided (it is getting calculated automatically) which is expected from it as well given the name but during evaluation of code switched ASR the pair language is available explicitly.
6 Conclusion and Future Work
For the multilingual ASR, through our solution we were able to beat the baseline provided by competition organizers easily as shown in Table 5. In two languages Marathi and Gujarati, we were not able to get better results. For marathi, later it was on clarified by the competition organizers that the collection format of data was wrong, hence it was excluded from the final rankings.
It has been usually the case with multilingual ASR systems that combined training on multiple languages is able to outperform individual end to end systems for each language. Using pre-trained model in a high resource language together with LID we are able to show that low resource languages can benefit greatly from a high resource language. In the recent times code switching problems got better results when there was a frame level language identification information used as a condition for transcribing the final output.
All authors gratefully acknowledge Ekstep Foundation for supporting this project financially and providing infrastructure. A special thanks to Dr. Vivek Raghavan for constant support, guidance and fruitful discussions. We also thank Rishabh Gaur, Ankit Katiyar, Anshul Gautam, Nikita Tiwari, Heera Ballabh, Niresh Kumar R, Sreejith V, Soujyo Sen and Amulya Ahuja for automated data pipelines and infrastructure support for model training and model testing.