1 Introduction
In this study, we are looking into additional methodologies that could help us enhance our output from ASR models. We’re specifically interested in how to improve the language model (LM) in order to improve the ASR accuracy. Rare words continue to be a challenge in developing high-quality speech recognition systems because words based on names, proper nouns or localities, often called as ‘tail’words are crucial to the decoded transcript’s meaning. They are difficult to handle correctly since they do not appear frequently in the audio-text pairs that make up an ASR system’s training set.
Separate acoustic and language models are run one after the other in traditional ASR systems. The separate language model in such systems allows a part of the model to be trained on text-only data, which is often significantly more plentiful than audio-text pairs and can contain many occurrences of terms that are uncommon in the acoustic data. The language model’s independence from the ASR system allows it to tailor its dataset or training approach to specific domains hence making room for tail words in the LM training text. [sak2013language] [huang2010hierarchical]
We demonstrate in our domain-specific transcription tasks that a language model trained on a corpus including rare phrases lacking in speech data can make a speech recognition system adapt towards accurately interpreting those words which ASR never saw during its training.
2 Dataset
We believe that expanding and diversifying the corpus for the language model will lead to a more generalized ASR. We generate the language model with the help of IndicCorp111https://indicnlp.ai4bharat.org/corpora.
2.1 IndicCorp
IndicCorp is a big monolingual corpora with over billion tokens encompassing of India’s major languages. It was created over several months by locating and scraping thousands of web sites, majorly news, periodicals, and books. One of the major advantages of using IndicCorp is that it uses contemporary news text on a variety of topics. We took the same strategy for all languages, while we believe that for additional low-resource languages that do not have print media or a standard written form, another method of data collection is required.
2.2 Data Preprocessing
For preprocessing the text, we use the Indic NLP[kunchukuttan2020indicnlp]
Library which is a python based library for common text processing and Natural Language Processing for Indic Languages.
-
We normalize the text using IndicNLP normalizer. This is done to keep the text in the normalized form.
-
After normalizing, we tokenize the text. This is because IndicCorp contains news articles and one article has multiple lines and we require the corpus to be split on line level.
-
We remove sentences that contain foreign characters. Foreign characters are characters that are not present in the ASR dictionary for a language. This is required because the training text for LM needs to have the same set of characters as in ASR model training.
-
We also remove duplicate lines from the corpus.
2.3 Dataset Statistics
Table 1 gives raw statistics of sentences and tokens for languages present in IndicCorp. After processing the IndicCorp data using the points mentioned in 2.2 Data Preprocessing we get a clean version of corpus.
Lang | Raw | Processed | ||
---|---|---|---|---|
Sentences | Tokens | Sentences | Tokens | |
bn | M | M | M | M |
en | M | B | M | M |
gu | M | M | M | M |
hi | M | B | M | B |
kn | M | M | M | M |
ml | M | M | M | M |
mr | M | M | M | M |
or | M | M | M | M |
pa | M | M | M | M |
ta | M | M | M | M |
te | M | M | M | M |
We combine this text only data along with text of the audio-text pair data used in acoustic model training to train the language model. Although the audio-text pair data is significantly smaller than this data, it is still very useful. Special care was taken to not include the data from test or valid set into the training text of LM.
2.4 PubMed
We use abstracts from PubMed Articles222https://www.kaggle.com/cvltmao/pmc-articles, which containes M sentences, to develop a domain-specific language model for the biomedical area. PubMed data was also processed in the same way. It contains K sentences and M tokens after processing.
2.5 Evaluation Dataset: MUCS 2021
We study the effects of different factors that can be used to generate LM on the outputs generated using our trained ASR model. We use the dev and eval sets for reporting our results on publicly available MUCS333https://navana-tech.github.io/MUCS2021 dataset. Table 2 describes MUCS data.
Lang | Split | Samples | Unique Transcripts |
---|---|---|---|
hi | train | ||
dev | |||
eval | |||
gu | train | ||
dev | |||
eval | |||
mr | train | ||
dev | |||
eval | |||
or | train | ||
dev | |||
eval | |||
ta | train | ||
dev | |||
eval | |||
te | train | ||
dev | |||
eval |
3 Methodology
3.1 ASR Creation
In our previous work [clsril] we investigate the benefit of self supervision based learning and found the best results were achieved by finetuning on a multilingual pretrained model over monolingual pretrained model. Our model is based on wav2vec 2.0 architecture [baevski2020wav2vec]
. Wav2vec 2.0 is a self supervision based learning model which learns speech representations from raw audio. It takes a stream of audio as input, outputs a stream of characters, which are then passed through a CTC loss function to get the final output. In normal decoding, an argmax of the logits is taken to get the final output.
To include the language context we can either plug in a transformer based LM or a statistical language model. For this study we focus only on the statistical language model based on KenLM paradigm. [heafield2011kenlm] We have chosen KenLM for its speed, efficiency and simplicity.
3.2 Language Model
A language model offers a way to assign a probability to a sentence or other sequence of words, and to predict a word from preceding words.
Language model assigns a probability estimate to word sequences, and defines: what the speaker may say, the vocabulary and the probability over possible sequences by training on some text. In our experiments we have used
n-grams models. The idea behind n-gram language model is to truncate the word history to the last n words.By using efficient language model tool-kits like KenLM[heafield2011kenlm] it is possible to build web scale language models. Code to create LM can be obtained from here444https://github.com/Open-Speech-EkStep/vakyansh-wav2vec2-experimentation.
3.3 Decoding
Language model decoding is common in ASR systems, which re-weights output probabilities to account for greater likelihoods of words occurring in the language. KenLM language model, gives probabilities of tokens occurring in the sequence, and thus represents corpus-level statistics of language. Language model decoding can help prevent unlikely sequences of words from being selected in favor of more likely predictions.
The decoding step determines the most probable sequence of characters given the output probabilities from the acoustic model and language model. Let PA(c|x) represent the probability of a sequence of characters, c = {c1, c2, c3,.}, obtained from the acoustic model for a given input audio (x) obtained from the language model. The objective of the decoding step is to obtain a sequence of characters that maximizes the confidence score Q(C):
(1) |
Here, and are weights to balance the influence of the acoustic model, the language model, and the word count of utterance. The last term is used because shorter sentences inherently have a higher probability in the n-gram language model. The output sequence of characters that maximizes the confidence score, Q(C), is determined using a beam search algorithm [hannun2014deep]. Optimal values of and and beam width are determined using a parametric analysis such that the average WER over a validation set is minimized. Figure 1 depicts our approach.

4 Results
We conduct multiple individual tests to determine the impact of various factors such as n-grams, corpus token size, pruning, and vocab width. We use CER/WER as the performance metric in our tests.
For our tests we use MUCS Hindi dev and eval data set for inference. In Tables - 3, 4, 5, 6, 7 inference is done on MUCS Hindi dev and eval set. ASR model is 555https://github.com/Open-Speech-EkStep/vakyansh-models#finetuned-asr-models which is finetuned on CLSRIL-[clsril] and 4200 hours of hindi data.
4.1 Should we keep increasing the size of text corpus ?
In order to compare the effect of training corpus size of LM, we created three corpus (corpus1 refers to LM trained using training text only, while corpus2 and corpus3 have M and M lines from IndicCorp Hindi corpus selected randomly). Using these corpus we build -gram, pruned (||) LM, decoded on beam width.
Table 3 shows the results of experiments. Using corpus1 from No LM results in increase of CER/WER as the evaluation set contains text very much different from the training set text. corpus2 shows % decrease in WER and % decrease in CER for dev set, % reduction in WER and % in CER for eval set. Hence providing a clue that by increasing the text corpus size we are in the right direction. Similarly, there is improvement in metric in corpus3, this shows LM made from large diverse corpus helps in improving the ASR output. One important thing here is that by increasing the text corpus we are increasing chances of word overlap between the training text of LM and our test set.
Corpus | Size | Dev set | Eval set | ||
---|---|---|---|---|---|
CER | WER | CER | WER | ||
No LM | — | ||||
corpus1 | 643 K | ||||
corpus2 | 23.04 M | ||||
corpus3 | 230.8 M | 9.03 | 17.57 | 9.5 | 15.04 |
4.2 Does a higher n-gram model work better ?
The idea behind the n-gram model is to truncate the word history to the last n words. To study the effect of different n-grams, we have built (, , )-gram models on corpus3 data. All models are pruned (||), vocab width K, beam width .
Table 4 shows a slight decrease in CER/WER on increasing n-grams.
n-grams | Dev set | Eval set | ||
---|---|---|---|---|
CER | WER | CER | WER | |
9.03 | 17.57 | 9.5 | 15.04 | |
4.3 What is the effect of pruning ?
Pruning is used to shrink a language model, for example storing n-grams with counts greater than some threshold and the basic idea is to prune less important n-grams. We build -gram LM with vocab width K using corpus2 decoded with beam width .
Table 5 shows a slight increase in CER/WER by introducing pruning. This may be because the test set is not large enough, but it helps in reducing the memory size of LM. LM size is reduced 82 % from Mb without pruning to Mb after pruning.
Pruning | Dev set | Eval set | ||
CER | WER | CER | WER | |
Yes | 9.8 | |||
No | 9.29 | 18.25 | 15.29 |
4.4 What is the effect of increasing the vocab width ?
Vocab width is a word list fixed in advance based on choosing V words by frequency. So, first unique words are found from the corpus and then they are sorted based on their frequency in the corpus. We then take the top n words that will be part of language model. We build gram LM, pruned (||) using corpus3 decoded with beam width .
Table 6 shows as the word list increases by vocab width it results in improved CER/WER.
Vocab Size | Dev set | Eval set | ||
---|---|---|---|---|
CER | WER | CER | WER | |
50 K | 9.03 | 17.57 | 9.5 | 15.04 |
500 K | 8.65 | 17.07 | 9.32 | 14.77 |
4.5 What is the effect of increasing the Beam width ?
Beam Search algorithm [hannun2014deep] in decoding uses top-k tokens out of the new tokens based on the probability score (where k is called the beam-width) and moves to the next time step. This k affects the result during decoding.
We decode the same LM build from corpus3 in Table 3 which was evaluated on the dev set, on different k value. Result is shown in Table 7. There is an improvement in metric but the inference time is a problem. The inference time also increases with increase in beam width as more possibilities are being explored. On increasing beam width from to inference duration increases 5 times.
Beam Width | CER | WER |
---|---|---|
8.42 | 16.78 |
4.6 Results on other languages
On Table 8 there is a improvement on CER/WER metric after using LM. All the LMs are built from combining IndicCorp and MUCS training text corpora as explained in 2 Dataset. From our multiple experiments on parameters for building LM we have concluded on parameters using which we have built LM for all the six languages present in MUCS.
The language models on Table 3 are all -gram, pruned (||), vocab width K and inferred on on MUCS dev and eval set using beam size .
Lang | Size | no LM | with LM | ||
---|---|---|---|---|---|
CER | WER | CER | WER | ||
hi | B | ||||
gu | M | ||||
mr | M | ||||
or | M | ||||
ta | M | ||||
te | M |
Table 8 shows there is 28.58 % average decrease in CER and 36.32 % average decrease in WER. LMs for different languages can be obtained from this repository 666https://github.com/Open-Speech-EkStep/vakyansh-models.
4.7 Effect of Domain specific LM
Table 9 shows result of LM build from corpus generated from PubMed articles. In this, our evaluation set contains only samples from biomedical domain. We have observed a drastic decrease in error rate using the language model that was trained on domain specific text.
This is one of our key findings as it shows that without retraining the ASR model, that might be trained on a general domain how one can create a domain specific ASR system. There is 72 % and 63 % decrease in WER and CER respectively in comparison to no LM. LM is -gram, pruned (||), vocab width K, beam width .
LM | CER | WER |
---|---|---|
No LM | ||
Eng LM | ||
Bio-LM | 4 | 13 |
In the following examples, the first sentence is obtained without LM and the other one using Bio-LM. Incorrect recognized words are marked with bold and correct words by Bio-LM are marked in Bold-Italics. Example 1, 2, 3 shows Bio-LM corrects the output of ASR for biomedical domain words. These words are not usually found in speech corpus, though acoustic model tries to predict a most probable similar pronunciation word for it. With the use of language model which learns from language corpus statistics we re-adjust the sequence of words to get the most probable sequence based on domain.
-
GASTRIC EXCOLIATIVE SYTOLOGY
GASTRIC EXFOLIATIVE CYTOLOGY -
ATISTIC INDIVIDUALS HAVE A HIGHER LEVEL OF GALATOPOBIA
AUTISTIC INDIVIDUALS HAVE A HIGHER LEVEL OF GELOTOPHOBIA -
MIXED ALCOLING EARTH AND TRANSITION METAL PIROBORATES
MIXED ALKALINE EARTH AND TRANSITION METAL PYROBORATESA
5 Conclusion & Future work
As increasing the corpora size gives significant improvement in error rate as shown in Table 3, corpus1 to corpus2 there is a large improvement, but corpus2 to corpus3 showed very little improvement. We have also demonstrated that creating the language model for contemporary news text is a very promising and effective way to improve the speech recognition in Indian languages as shown in Table 8 for six languages. By Bio-LM we have demonstrated concretely the importance of using quality data for an appropriate domain over simply using as much as data as possible. We are approaching a level where ASR output starts to be sensible and useful for various purposes, primary of which would be to make transcription easier.
Also, when deciding the final parameters to create a LM, there are a number of parameters to take care of like inferencing speed, size of LM, vocab size. More trials in this direction may include larger language models for low-resource Indian languages like Bhojpuri, Rajasthani, Maithili, and others, proving the importance of corpus size in the efficiency of language models. Also, the addition of text containing proper nouns could be beneficial for rare words (names, location) specific to Indian languages.
Also, as we did in the case of Bio-LM, to support our argument that employing ASR for one specific purpose does not necessitate a big corpus; instead, carefully selected domain data can be quite useful.
6 Acknowledgment
All authors gratefully acknowledge Ekstep Foundation for supporting this project financially and providing infrastructure. A special thanks to Dr. Vivek Raghavan for constant support, guidance and fruitful discussions. We also thank Ankit Katiyar, Heera Ballabh, Niresh Kumar R, Sreejith V, Soujyo Sen, Amulya Ahuja and Nikita Tiwari for helping out when needed and infrastructure support for data processing and model training.