Log In Sign Up

Improving Speech Recognition for Indic Languages using Language Model

by   Ankur Dhuriya, et al.

We study the effect of applying a language model (LM) on the output of Automatic Speech Recognition (ASR) systems for Indic languages. We fine-tune wav2vec 2.0 models for 18 Indic languages and adjust the results with language models trained on text derived from a variety of sources. Our findings demonstrate that the average Character Error Rate (CER) decreases by over 28 % and the average Word Error Rate (WER) decreases by about 36 % after decoding with LM. We show that a large LM may not provide a substantial improvement as compared to a diverse one. We also demonstrate that high quality transcriptions can be obtained on domain-specific data without retraining the ASR model and show results on biomedical domain.


page 1

page 2

page 3

page 4


Is Word Error Rate a good evaluation metric for Speech Recognition in Indic Languages?

We propose a new method for the calculation of error rates in Automatic ...

Discriminative training of RNNLMs with the average word error criterion

In automatic speech recognition (ASR), recurrent neural language models ...

A practical framework for multi-domain speech recognition and an instance sampling method to neural language modeling

Automatic speech recognition (ASR) systems used on smart phones or vehic...

Domain-aware Neural Language Models for Speech Recognition

As voice assistants become more ubiquitous, they are increasingly expect...

Neural Zero-Inflated Quality Estimation Model For Automatic Speech Recognition System

The performances of automatic speech recognition (ASR) systems are usual...

Better Transcription of UK Supreme Court Hearings

Transcription of legal proceedings is very important to enable access to...

Seed Words Based Data Selection for Language Model Adaptation

We address the problem of language model customization in applications w...

1 Introduction

In this study, we are looking into additional methodologies that could help us enhance our output from ASR models. We’re specifically interested in how to improve the language model (LM) in order to improve the ASR accuracy. Rare words continue to be a challenge in developing high-quality speech recognition systems because words based on names, proper nouns or localities, often called as ‘tail’words are crucial to the decoded transcript’s meaning. They are difficult to handle correctly since they do not appear frequently in the audio-text pairs that make up an ASR system’s training set.

Separate acoustic and language models are run one after the other in traditional ASR systems. The separate language model in such systems allows a part of the model to be trained on text-only data, which is often significantly more plentiful than audio-text pairs and can contain many occurrences of terms that are uncommon in the acoustic data. The language model’s independence from the ASR system allows it to tailor its dataset or training approach to specific domains hence making room for tail words in the LM training text. [sak2013language] [huang2010hierarchical]

We demonstrate in our domain-specific transcription tasks that a language model trained on a corpus including rare phrases lacking in speech data can make a speech recognition system adapt towards accurately interpreting those words which ASR never saw during its training.

2 Dataset

We believe that expanding and diversifying the corpus for the language model will lead to a more generalized ASR. We generate the language model with the help of IndicCorp111

2.1 IndicCorp

IndicCorp is a big monolingual corpora with over billion tokens encompassing of India’s major languages. It was created over several months by locating and scraping thousands of web sites, majorly news, periodicals, and books. One of the major advantages of using IndicCorp is that it uses contemporary news text on a variety of topics. We took the same strategy for all languages, while we believe that for additional low-resource languages that do not have print media or a standard written form, another method of data collection is required.

2.2 Data Preprocessing

For preprocessing the text, we use the Indic NLP[kunchukuttan2020indicnlp]

Library which is a python based library for common text processing and Natural Language Processing for Indic Languages.

  1. We normalize the text using IndicNLP normalizer. This is done to keep the text in the normalized form.

  2. After normalizing, we tokenize the text. This is because IndicCorp contains news articles and one article has multiple lines and we require the corpus to be split on line level.

  3. We remove sentences that contain foreign characters. Foreign characters are characters that are not present in the ASR dictionary for a language. This is required because the training text for LM needs to have the same set of characters as in ASR model training.

  4. We also remove duplicate lines from the corpus.

2.3 Dataset Statistics

Table 1 gives raw statistics of sentences and tokens for languages present in IndicCorp. After processing the IndicCorp data using the points mentioned in 2.2 Data Preprocessing we get a clean version of corpus.

Lang Raw Processed
Sentences Tokens Sentences Tokens
bn M M M M
en M B M M
gu M M M M
hi M B M B
kn M M M M
ml M M M M
mr M M M M
or M M M M
pa M M M M
ta M M M M
te M M M M
Table 1: Data description of IndicCorp

We combine this text only data along with text of the audio-text pair data used in acoustic model training to train the language model. Although the audio-text pair data is significantly smaller than this data, it is still very useful. Special care was taken to not include the data from test or valid set into the training text of LM.

2.4 PubMed

We use abstracts from PubMed Articles222, which containes M sentences, to develop a domain-specific language model for the biomedical area. PubMed data was also processed in the same way. It contains K sentences and M tokens after processing.

2.5 Evaluation Dataset: MUCS 2021

We study the effects of different factors that can be used to generate LM on the outputs generated using our trained ASR model. We use the dev and eval sets for reporting our results on publicly available MUCS333 dataset. Table 2 describes MUCS data.

Lang Split Samples Unique Transcripts
hi train
gu train
mr train
or train
ta train
te train
Table 2: Description of MUCS dataset. Data contains many duplicate sentences, there are multiple speakers speaking same sentence

3 Methodology

3.1 ASR Creation

In our previous work [clsril] we investigate the benefit of self supervision based learning and found the best results were achieved by finetuning on a multilingual pretrained model over monolingual pretrained model. Our model is based on wav2vec 2.0 architecture [baevski2020wav2vec]

. Wav2vec 2.0 is a self supervision based learning model which learns speech representations from raw audio. It takes a stream of audio as input, outputs a stream of characters, which are then passed through a CTC loss function to get the final output. In normal decoding, an argmax of the logits is taken to get the final output.

To include the language context we can either plug in a transformer based LM or a statistical language model. For this study we focus only on the statistical language model based on KenLM paradigm. [heafield2011kenlm] We have chosen KenLM for its speed, efficiency and simplicity.

3.2 Language Model

A language model offers a way to assign a probability to a sentence or other sequence of words, and to predict a word from preceding words.

Language model assigns a probability estimate to word sequences, and defines: what the speaker may say, the vocabulary and the probability over possible sequences by training on some text. In our experiments we have used

n-grams models. The idea behind n-gram language model is to truncate the word history to the last n words.

By using efficient language model tool-kits like KenLM[heafield2011kenlm] it is possible to build web scale language models. Code to create LM can be obtained from here444

3.3 Decoding

Language model decoding is common in ASR systems, which re-weights output probabilities to account for greater likelihoods of words occurring in the language. KenLM language model, gives probabilities of tokens occurring in the sequence, and thus represents corpus-level statistics of language. Language model decoding can help prevent unlikely sequences of words from being selected in favor of more likely predictions.

The decoding step determines the most probable sequence of characters given the output probabilities from the acoustic model and language model. Let PA(c|x) represent the probability of a sequence of characters, c = {c1, c2, c3,.}, obtained from the acoustic model for a given input audio (x) obtained from the language model. The objective of the decoding step is to obtain a sequence of characters that maximizes the confidence score Q(C):


Here, and are weights to balance the influence of the acoustic model, the language model, and the word count of utterance. The last term is used because shorter sentences inherently have a higher probability in the n-gram language model. The output sequence of characters that maximizes the confidence score, Q(C), is determined using a beam search algorithm [hannun2014deep]. Optimal values of and and beam width are determined using a parametric analysis such that the average WER over a validation set is minimized. Figure 1 depicts our approach.

Figure 1: Improving ASR output using KenLM language model.

4 Results

We conduct multiple individual tests to determine the impact of various factors such as n-grams, corpus token size, pruning, and vocab width. We use CER/WER as the performance metric in our tests.

For our tests we use MUCS Hindi dev and eval data set for inference. In Tables - 3, 4, 5, 6, 7 inference is done on MUCS Hindi dev and eval set. ASR model is 555 which is finetuned on CLSRIL-[clsril] and 4200 hours of hindi data.

4.1 Should we keep increasing the size of text corpus ?

In order to compare the effect of training corpus size of LM, we created three corpus (corpus1 refers to LM trained using training text only, while corpus2 and corpus3 have M and M lines from IndicCorp Hindi corpus selected randomly). Using these corpus we build -gram, pruned (||) LM, decoded on beam width.

Table 3 shows the results of experiments. Using corpus1 from No LM results in increase of CER/WER as the evaluation set contains text very much different from the training set text. corpus2 shows % decrease in WER and % decrease in CER for dev set, % reduction in WER and % in CER for eval set. Hence providing a clue that by increasing the text corpus size we are in the right direction. Similarly, there is improvement in metric in corpus3, this shows LM made from large diverse corpus helps in improving the ASR output. One important thing here is that by increasing the text corpus we are increasing chances of word overlap between the training text of LM and our test set.

Corpus Size Dev set Eval set
corpus1 643 K
corpus2 23.04 M
corpus3 230.8 M 9.03 17.57 9.5 15.04
Table 3: Effect of increase in text corpus for training LM

4.2 Does a higher n-gram model work better ?

The idea behind the n-gram model is to truncate the word history to the last n words. To study the effect of different n-grams, we have built (, , )-gram models on corpus3 data. All models are pruned (||), vocab width K, beam width .

Table 4 shows a slight decrease in CER/WER on increasing n-grams.

n-grams Dev set Eval set
9.03 17.57 9.5 15.04
Table 4: Effects of increase in n-gram

4.3 What is the effect of pruning ?

Pruning is used to shrink a language model, for example storing n-grams with counts greater than some threshold and the basic idea is to prune less important n-grams. We build -gram LM with vocab width K using corpus2 decoded with beam width .

Table 5 shows a slight increase in CER/WER by introducing pruning. This may be because the test set is not large enough, but it helps in reducing the memory size of LM. LM size is reduced 82 % from Mb without pruning to Mb after pruning.

Pruning Dev set Eval set
Yes 9.8
No 9.29 18.25 15.29
Table 5: Effects of pruning

4.4 What is the effect of increasing the vocab width ?

Vocab width is a word list fixed in advance based on choosing V words by frequency. So, first unique words are found from the corpus and then they are sorted based on their frequency in the corpus. We then take the top n words that will be part of language model. We build gram LM, pruned (||) using corpus3 decoded with beam width .

Table 6 shows as the word list increases by vocab width it results in improved CER/WER.

Vocab Size Dev set Eval set
50 K 9.03 17.57 9.5 15.04
500 K 8.65 17.07 9.32 14.77
Table 6: Effect of vocab width

4.5 What is the effect of increasing the Beam width ?

Beam Search algorithm [hannun2014deep] in decoding uses top-k tokens out of the new tokens based on the probability score (where k is called the beam-width) and moves to the next time step. This k affects the result during decoding.

We decode the same LM build from corpus3 in Table 3 which was evaluated on the dev set, on different k value. Result is shown in Table 7. There is an improvement in metric but the inference time is a problem. The inference time also increases with increase in beam width as more possibilities are being explored. On increasing beam width from to inference duration increases 5 times.

Beam Width CER WER
8.42 16.78
Table 7: Effect of beam width

4.6 Results on other languages

On Table 8 there is a improvement on CER/WER metric after using LM. All the LMs are built from combining IndicCorp and MUCS training text corpora as explained in 2 Dataset. From our multiple experiments on parameters for building LM we have concluded on parameters using which we have built LM for all the six languages present in MUCS.

The language models on Table 3 are all -gram, pruned (||), vocab width K and inferred on on MUCS dev and eval set using beam size .

Lang Size no LM with LM
hi B
gu M
mr M
or M
ta M
te M
Table 8: Effect of LM on CER/WER

Table 8 shows there is 28.58 % average decrease in CER and 36.32 % average decrease in WER. LMs for different languages can be obtained from this repository 666

4.7 Effect of Domain specific LM

Table 9 shows result of LM build from corpus generated from PubMed articles. In this, our evaluation set contains only samples from biomedical domain. We have observed a drastic decrease in error rate using the language model that was trained on domain specific text.

This is one of our key findings as it shows that without retraining the ASR model, that might be trained on a general domain how one can create a domain specific ASR system. There is 72 % and 63 % decrease in WER and CER respectively in comparison to no LM. LM is -gram, pruned (||), vocab width K, beam width .

Eng LM
Bio-LM 4 13
Table 9: Impact of domain specific language model

In the following examples, the first sentence is obtained without LM and the other one using Bio-LM. Incorrect recognized words are marked with bold and correct words by Bio-LM are marked in Bold-Italics. Example 1, 2, 3 shows Bio-LM corrects the output of ASR for biomedical domain words. These words are not usually found in speech corpus, though acoustic model tries to predict a most probable similar pronunciation word for it. With the use of language model which learns from language corpus statistics we re-adjust the sequence of words to get the most probable sequence based on domain.







5 Conclusion & Future work

As increasing the corpora size gives significant improvement in error rate as shown in Table 3, corpus1 to corpus2 there is a large improvement, but corpus2 to corpus3 showed very little improvement. We have also demonstrated that creating the language model for contemporary news text is a very promising and effective way to improve the speech recognition in Indian languages as shown in Table 8 for six languages. By Bio-LM we have demonstrated concretely the importance of using quality data for an appropriate domain over simply using as much as data as possible. We are approaching a level where ASR output starts to be sensible and useful for various purposes, primary of which would be to make transcription easier.

Also, when deciding the final parameters to create a LM, there are a number of parameters to take care of like inferencing speed, size of LM, vocab size. More trials in this direction may include larger language models for low-resource Indian languages like Bhojpuri, Rajasthani, Maithili, and others, proving the importance of corpus size in the efficiency of language models. Also, the addition of text containing proper nouns could be beneficial for rare words (names, location) specific to Indian languages.

Also, as we did in the case of Bio-LM, to support our argument that employing ASR for one specific purpose does not necessitate a big corpus; instead, carefully selected domain data can be quite useful.

6 Acknowledgment

All authors gratefully acknowledge Ekstep Foundation for supporting this project financially and providing infrastructure. A special thanks to Dr. Vivek Raghavan for constant support, guidance and fruitful discussions. We also thank Ankit Katiyar, Heera Ballabh, Niresh Kumar R, Sreejith V, Soujyo Sen, Amulya Ahuja and Nikita Tiwari for helping out when needed and infrastructure support for data processing and model training.