Log In Sign Up

Is Word Error Rate a good evaluation metric for Speech Recognition in Indic Languages?

by   Priyanshi Shah, et al.

We propose a new method for the calculation of error rates in Automatic Speech Recognition (ASR). This new metric is for languages that contain half characters and where the same character can be written in different forms. We implement our methodology in Hindi which is one of the main languages from Indic context and we think this approach is scalable to other similar languages containing a large character set. We call our metrics Alternate Word Error Rate (AWER) and Alternate Character Error Rate (ACER). We train our ASR models using wav2vec 2.0<cit.> for Indic languages. Additionally we use language models to improve our model performance. Our results show a significant improvement in analyzing the error rates at word and character level and the interpretability of the ASR system is improved upto 3% in AWER and 7% in ACER for Hindi. Our experiments suggest that in languages which have complex pronunciation, there are multiple ways of writing words without changing their meaning. In such cases AWER and ACER will be more useful rather than WER and CER as metrics. Furthermore, we open source a new benchmarking dataset of 21 hours for Hindi with the new metric scripts.


page 1

page 2

page 3

page 4


Improving Speech Recognition for Indic Languages using Language Model

We study the effect of applying a language model (LM) on the output of A...

Assessing ASR Model Quality on Disordered Speech using BERTScore

Word Error Rate (WER) is the primary metric used to assess automatic spe...

Phonetic and Visual Priors for Decipherment of Informal Romanization

Informal romanization is an idiosyncratic process used by humans in info...

On the Inductive Bias of Word-Character-Level Multi-Task Learning for Speech Recognition

End-to-end automatic speech recognition (ASR) commonly transcribes audio...

Reducing Geographic Disparities in Automatic Speech Recognition via Elastic Weight Consolidation

We present an approach to reduce the performance disparity between geogr...

3D Rendering Framework for Data Augmentation in Optical Character Recognition

In this paper, we propose a data augmentation framework for Optical Char...

1 Introduction

End to End ASR systems have gained momentum over the past years. Over a decade ago, various experiments sailed through by utilizing deep learning based models like HMM-based, DNN-HMM models, among other techniques

[2] [3] [4]. The increase in computation power with efficient End to End techniques has proven to be widely successful in ASR systems allowing better results with less data.

All these years, the common metric, Word Error Rate(WER) and Character Error Rate(CER) still remain the common standards of reporting and comparing the ASR performances of new systems over the old. A few other evaluation methods like word embeddings [5] [6] [7] were proposed during this time. The purpose of this study is to propose a metric similar to WER and CER, with the capability of handling alternative forms in a language so that the ASR is not penalized and interpreted correctly.

We observe during our model building processes, the ambiguity in scripts like Devanagari which is used in languages like Hindi, Sanskrit etc. There are some consistent words and characters which have the same meaning in a sentence but the way of writing is a bit different. This ambiguity is common even in the real word scenario where people might use either of the multiple forms. This does not necessarily mean that those characters can always be used to form an alternative of that sound/phoneme in written form, but it can be common in a lot of scenarios. Thus, it results in multiple words in a language where such ambiguous cases exist. These words might not be mis-spelled but instead are just an alternative of their original. This results in higher WER and CER rates even if the prediction and interpretation of ASR was correct.

We use a semi supervised learning technique for building our ASR model

[1] [8]. Language Models also play a key role in the E2E ASR systems because of their potential to predict and correct the upcoming words based on the text corpus. The Language Model (LM) used in this paper is a statistical LM built from KenLM[9] toolkit.

2 Alternate WER and CER

Automatic Speech Recognition(ASR) systems have improved a lot over the past decade and mostly using WER111 and CER as their metric for evaluation. Word Error Rate (WER) has shown to be the most commonly accepted metric in ASR systems. WER works on the concept of Edit Distance222 In a sentence, for any number of insertions(i), deletions(d) and substitutions(s) with a total of Nt tokens, the WER(w) can be calculated as:

For languages like English which have one spelling for a word, it is quite ideal to use WER and CER as their metric as it can give the overall correctness by comparing the ground truth and hypothesis. This method is not directly suitable for languages like Hindi, which have alternate spellings for many words. In a sentence of four words, if the prediction of one word is an alternate form of the ground truth, then the WER will report it as 25% error which would be incorrect as the word was spelled differently but in an alternate form with the same meaning.

For calculating the AWER, we first perform substitutions of ambiguous characters in the original transcripts and then calculate all the possible combinations of WER. We then take the minimum of the possible alternate combinations:

3 Dataset

We use the Hindi MUCS 333 dataset for our experiment. We first use the dev set and the eval set of Interspeech, with duration of 5 hours each. The punctuations and extra characters are cleaned before doing any further analysis. We calculate the results on the dev set and find out a few valid pairs which have the highest error rate after ASR model prediction. We observe that a lot of the pairs are an alternate form of the same word.

Split Samples Unique transcripts
Table 1: MUCS Hindi Dataset description

4 Methodology

In our experiment, we first train our Automatic Speech Recognition model on audio-text pair data. After that we use language models to improve the ASR outputs and compare our results of alternate metrics both with and without language models.

4.1 ASR-Model

We pretrain CLSRIL-23 model [8] on 23 Indic languages using wav2vec 2.0 architecture[1] on 10000 hours of audio data. It learns the cross lingual speech representations from raw waveforms. For ASR pretraining, audio chunks with a duration of 25s-30s are used. Pretraining is started with a learning rate of 0.0005 and dropout of 0.1 using Adam optimizer for the first 32,000 steps. After this, the learning rate decreases linearly and the model is trained for almost 300,000 steps.

The learned representations are then finetuned on 4200 hours of Hindi speech and text data pairs. These labelled data chunks have a duration upto 15s. Before finetuning, the dataset is cleaned to remove any punctuation marks or extra characters to ensure that the dictionary has only the language characters. The architecture used is a base architecture over the large one because of its faster training and inference time due to less number of parameters. A fully connected layer is added on top of the pretrained context network with the size of output layer equal to the vocabulary of the task during the finetuning step. The models are optimized using a CTC loss function.

At the time of testing, foreign characters are not removed because we want to test on the original data. This is done for proper evaluation of the dataset to consider the overall ASR model performance.

4.1.1 Language Models

During inference, we calculate our ASR output using either a viterbi decoder or a language model. For language modeling, we use the statistical based model, KenLM[9] using the training text data and IndicCorp[10] corpus. The size of the corpus is 1.03 Billion lines. The raw corpus is cleaned by removing all the punctuation symbols and foreign characters.

The 5-gram language model is decoded with a beam search algorithm [11] on a beam width of 128 to find the most appropriate sequence. A vocabulary size of top 500k words is used based on the sorted frequencies from the raw corpus. To increase the LM weightage, the LM weight is set to 2 and word score value to -1. The LM is also pruned in the || setting to reduce the LM size.

4.2 Experiment

Our experiment is devised to identify the common pairs with alternate spellings. We compare the prediction and the hypothesis of many sentences and identified words that are substituted. This is done using the SCTK tool444 From this, we manually select pairs which are grammatically correct in the written form of Hindi language. After this we convert these sets of key-value pairs into a dictionary. This dictionary is then used to change the testing format by using alternate or the original spellings used for comparison (Figure ).

After that we rearrange the test set according to our format of adding alternate spellings to it. We then calculate the predicted sentence’s AWER and ACER based on the minimum WER value that we get out of the combination of multiple pairs of sentences that we might have had.

Figure 1: Pipeline flow for metric calculation

4.2.1 Character/Word Mapping

The error word statistics are calculated by SCTK tool which helps in getting a mapping of the substituted words. These pairs are termed as confusion pairs as they show the number of times a pair was substituted with another word. On analyzing the confusion words, a lot of character and word pairs are released which have grammatically the same meaning in the written form as well. Figure 2 shows a few examples which have the same meaning but have a different way of writing and are therefore added to the mapping dictionary.

Figure 2: Examples of Key value pair mapping

4.2.2 Creating mapped test set

The new test sets are created by including alternative forms in a special format which allows us to evaluate the minimum error rate out of all the possible combinations. Amongst a possible combination of multiple sentences with alternative forms, the best combination closest to the reference sentence is used and hence the AWER and ACER are calculated by measuring the Levenshtein distances between the hypothesis and reference. Figure 3 shows how the modified test set is created from which the combination of best sentence is taken for AWER calculation.

Figure 3: Examples of original vs modified test set mappings

5 Results

The results in Table 2 show an improvement over the general WER as we have handled alternate spellings in written form for a particular word or a character. A significant advantage of this metric is that it considers the ambiguity in the written form of the language and improves the interpretability of the ASR system.

Dataset Split Decoder WER AWER CER ACER
MUCS Dev Viterbi 21.1 6.89
MUCS Dev KenLM 12.45 4.81
MUCS Eval Viterbi 17.36 5.56
MUCS Eval KenLM 10.91 3.92
Table 2: Results on MUCS dev and eval set. MUCS eval set results show an decrease of upto % in AWER and % in ACER

5.1 Creation of new Benchmark Dataset for Hindi

For wider adoption of this benchmark, we open source Vakyansh Hindi, a new Benchmarking dataset for evaluating ASR models trained for Hindi. The duration of the entire dataset is 21 hours containing male and female voices from multiple speakers. We create a platform where the native speakers were invited to record their voices in a closed setting. The text data was shown on the screen and the speakers had to read out aloud the text shown. We were able to collect 50 hours of data using this approach.

The next step is to clean the data. Any audio with a low Sound to Noise Ratio (SNR)[12] is rejected if its SNR value is less than 20. We also run our Hindi ASR system on the audio and compute WER, we then analyze manually the high WER transcripts for errors speakers might have made during the recording process.

After cleaning, the total number of recordings were more than 16500. Table 3 shows the AWER and ACER results in comparison to WER and CER respectively. We also open source the alternate test set script for AWER and ACER calculation555

Dataset Decoder WER AWER CER ACER
Vakyansh data (21h) Viterbi 25.11 9.4
Vakyansh data (21h) KenLM 20.52 11.01
Table 3: Results on Vakyansh Hindi Benchmarking Dataset

6 Conclusions and Future Work

In this work, we propose a novel method of evaluation system for Automatic Speech Recognition. This method has the capability to consider alternate characters and spellings in a language. Our work shows an advancement in Hindi language, but it is extendible to other similar languages. This will help in removing any kind of wrong decisions while calculating the WER and CER based on the ambiguity that exists in some languages.

Even for Hindi, there might be other possible alternative spellings which can be identified in detail with the help of a domain expert. The researchers may add other key-value pairs as there is no specified set present till date. In future, we plan to use a similar strategy in other Indic languages with similar context.

6.1 Performance on Proper Nouns

We plan to further extend our work by evaluating performances on proper nouns in an ASR output for test datasets. Named Entity Recognition (NER) or Parts of Speech (POS) tagging can be used to get the proper nouns category wise which can then be specifically used to compare WER and AWER. By this method, we can conclude two results. First, how much is the ASR output result affected by the errors in proper nouns? Second, how much is the AWER evaluation for common names with multiple spellings, for example: Swathi and Swati, Zoe and Zoey?

7 Acknowledgements

All authors gratefully acknowledge Ekstep Foundation for sup- porting this project financially and providing the infrastructure. A special thanks to Dr. Vivek Raghavan for constant support, guidance and fruitful discussions. We also thank Nikita Tiwari, Ankit Katiyar, Heera Ballabh, Niresh Kumar R, Sreejith V, Soujyo Sen, Amulya Ahuja, Anshul Gautam and Rajat Singhal for helping out when needed and extending infrastructure support for data processing and model training.