End to End ASR systems have gained momentum over the past years. Over a decade ago, various experiments sailed through by utilizing deep learning based models like HMM-based, DNN-HMM models, among other techniques  . The increase in computation power with efficient End to End techniques has proven to be widely successful in ASR systems allowing better results with less data.
All these years, the common metric, Word Error Rate(WER) and Character Error Rate(CER) still remain the common standards of reporting and comparing the ASR performances of new systems over the old. A few other evaluation methods like word embeddings    were proposed during this time. The purpose of this study is to propose a metric similar to WER and CER, with the capability of handling alternative forms in a language so that the ASR is not penalized and interpreted correctly.
We observe during our model building processes, the ambiguity in scripts like Devanagari which is used in languages like Hindi, Sanskrit etc. There are some consistent words and characters which have the same meaning in a sentence but the way of writing is a bit different. This ambiguity is common even in the real word scenario where people might use either of the multiple forms. This does not necessarily mean that those characters can always be used to form an alternative of that sound/phoneme in written form, but it can be common in a lot of scenarios. Thus, it results in multiple words in a language where such ambiguous cases exist. These words might not be mis-spelled but instead are just an alternative of their original. This results in higher WER and CER rates even if the prediction and interpretation of ASR was correct.
We use a semi supervised learning technique for building our ASR model . Language Models also play a key role in the E2E ASR systems because of their potential to predict and correct the upcoming words based on the text corpus. The Language Model (LM) used in this paper is a statistical LM built from KenLM toolkit.
2 Alternate WER and CER
Automatic Speech Recognition(ASR) systems have improved a lot over the past decade and mostly using WER111https://pypi.org/project/jiwer/ and CER as their metric for evaluation. Word Error Rate (WER) has shown to be the most commonly accepted metric in ASR systems. WER works on the concept of Edit Distance222https://pypi.org/project/editdistance/. In a sentence, for any number of insertions(i), deletions(d) and substitutions(s) with a total of Nt tokens, the WER(w) can be calculated as:
For languages like English which have one spelling for a word, it is quite ideal to use WER and CER as their metric as it can give the overall correctness by comparing the ground truth and hypothesis. This method is not directly suitable for languages like Hindi, which have alternate spellings for many words. In a sentence of four words, if the prediction of one word is an alternate form of the ground truth, then the WER will report it as 25% error which would be incorrect as the word was spelled differently but in an alternate form with the same meaning.
For calculating the AWER, we first perform substitutions of ambiguous characters in the original transcripts and then calculate all the possible combinations of WER. We then take the minimum of the possible alternate combinations:
We use the Hindi MUCS 333https://navana-tech.github.io/MUCS2021/ dataset for our experiment. We first use the dev set and the eval set of Interspeech, with duration of 5 hours each. The punctuations and extra characters are cleaned before doing any further analysis. We calculate the results on the dev set and find out a few valid pairs which have the highest error rate after ASR model prediction. We observe that a lot of the pairs are an alternate form of the same word.
In our experiment, we first train our Automatic Speech Recognition model on audio-text pair data. After that we use language models to improve the ASR outputs and compare our results of alternate metrics both with and without language models.
We pretrain CLSRIL-23 model  on 23 Indic languages using wav2vec 2.0 architecture on 10000 hours of audio data. It learns the cross lingual speech representations from raw waveforms. For ASR pretraining, audio chunks with a duration of 25s-30s are used. Pretraining is started with a learning rate of 0.0005 and dropout of 0.1 using Adam optimizer for the first 32,000 steps. After this, the learning rate decreases linearly and the model is trained for almost 300,000 steps.
The learned representations are then finetuned on 4200 hours of Hindi speech and text data pairs. These labelled data chunks have a duration upto 15s. Before finetuning, the dataset is cleaned to remove any punctuation marks or extra characters to ensure that the dictionary has only the language characters. The architecture used is a base architecture over the large one because of its faster training and inference time due to less number of parameters. A fully connected layer is added on top of the pretrained context network with the size of output layer equal to the vocabulary of the task during the finetuning step. The models are optimized using a CTC loss function.
At the time of testing, foreign characters are not removed because we want to test on the original data. This is done for proper evaluation of the dataset to consider the overall ASR model performance.
4.1.1 Language Models
During inference, we calculate our ASR output using either a viterbi decoder or a language model. For language modeling, we use the statistical based model, KenLM using the training text data and IndicCorp corpus. The size of the corpus is 1.03 Billion lines. The raw corpus is cleaned by removing all the punctuation symbols and foreign characters.
The 5-gram language model is decoded with a beam search algorithm  on a beam width of 128 to find the most appropriate sequence. A vocabulary size of top 500k words is used based on the sorted frequencies from the raw corpus. To increase the LM weightage, the LM weight is set to 2 and word score value to -1. The LM is also pruned in the || setting to reduce the LM size.
Our experiment is devised to identify the common pairs with alternate spellings. We compare the prediction and the hypothesis of many sentences and identified words that are substituted. This is done using the SCTK tool444https://github.com/usnistgov/SCTK. From this, we manually select pairs which are grammatically correct in the written form of Hindi language. After this we convert these sets of key-value pairs into a dictionary. This dictionary is then used to change the testing format by using alternate or the original spellings used for comparison (Figure ).
After that we rearrange the test set according to our format of adding alternate spellings to it. We then calculate the predicted sentence’s AWER and ACER based on the minimum WER value that we get out of the combination of multiple pairs of sentences that we might have had.
4.2.1 Character/Word Mapping
The error word statistics are calculated by SCTK tool which helps in getting a mapping of the substituted words. These pairs are termed as confusion pairs as they show the number of times a pair was substituted with another word. On analyzing the confusion words, a lot of character and word pairs are released which have grammatically the same meaning in the written form as well. Figure 2 shows a few examples which have the same meaning but have a different way of writing and are therefore added to the mapping dictionary.
4.2.2 Creating mapped test set
The new test sets are created by including alternative forms in a special format which allows us to evaluate the minimum error rate out of all the possible combinations. Amongst a possible combination of multiple sentences with alternative forms, the best combination closest to the reference sentence is used and hence the AWER and ACER are calculated by measuring the Levenshtein distances between the hypothesis and reference. Figure 3 shows how the modified test set is created from which the combination of best sentence is taken for AWER calculation.
The results in Table 2 show an improvement over the general WER as we have handled alternate spellings in written form for a particular word or a character. A significant advantage of this metric is that it considers the ambiguity in the written form of the language and improves the interpretability of the ASR system.
5.1 Creation of new Benchmark Dataset for Hindi
For wider adoption of this benchmark, we open source Vakyansh Hindi, a new Benchmarking dataset for evaluating ASR models trained for Hindi. The duration of the entire dataset is 21 hours containing male and female voices from multiple speakers. We create a platform where the native speakers were invited to record their voices in a closed setting. The text data was shown on the screen and the speakers had to read out aloud the text shown. We were able to collect 50 hours of data using this approach.
The next step is to clean the data. Any audio with a low Sound to Noise Ratio (SNR) is rejected if its SNR value is less than 20. We also run our Hindi ASR system on the audio and compute WER, we then analyze manually the high WER transcripts for errors speakers might have made during the recording process.
After cleaning, the total number of recordings were more than 16500. Table 3 shows the AWER and ACER results in comparison to WER and CER respectively. We also open source the alternate test set script for AWER and ACER calculation555https://github.com/Open-Speech-EkStep/vakyansh-alternate-wer.
|Vakyansh data (21h)||Viterbi||25.11||9.4|
|Vakyansh data (21h)||KenLM||20.52||11.01|
6 Conclusions and Future Work
In this work, we propose a novel method of evaluation system for Automatic Speech Recognition. This method has the capability to consider alternate characters and spellings in a language. Our work shows an advancement in Hindi language, but it is extendible to other similar languages. This will help in removing any kind of wrong decisions while calculating the WER and CER based on the ambiguity that exists in some languages.
Even for Hindi, there might be other possible alternative spellings which can be identified in detail with the help of a domain expert. The researchers may add other key-value pairs as there is no specified set present till date. In future, we plan to use a similar strategy in other Indic languages with similar context.
6.1 Performance on Proper Nouns
We plan to further extend our work by evaluating performances on proper nouns in an ASR output for test datasets. Named Entity Recognition (NER) or Parts of Speech (POS) tagging can be used to get the proper nouns category wise which can then be specifically used to compare WER and AWER. By this method, we can conclude two results. First, how much is the ASR output result affected by the errors in proper nouns? Second, how much is the AWER evaluation for common names with multiple spellings, for example: Swathi and Swati, Zoe and Zoey?
All authors gratefully acknowledge Ekstep Foundation for sup- porting this project financially and providing the infrastructure. A special thanks to Dr. Vivek Raghavan for constant support, guidance and fruitful discussions. We also thank Nikita Tiwari, Ankit Katiyar, Heera Ballabh, Niresh Kumar R, Sreejith V, Soujyo Sen, Amulya Ahuja, Anshul Gautam and Rajat Singhal for helping out when needed and extending infrastructure support for data processing and model training.
A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,”Advances in Neural Information Processing Systems, vol. 33, pp. 12 449–12 460, 2020.
-  J. Novoa, J. Wuth, J. P. Escudero, J. Fredes, R. Mahu, and N. B. Yoma, “Dnn-hmm based automatic speech recognition for hri scenarios,” in 2018 13th ACM/IEEE International Conference on Human-Robot Interaction (HRI), 2018, pp. 150–159.
G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior,
V. Vanhoucke, P. Nguyen, T. N. Sainath et al.
, “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,”IEEE Signal processing magazine, vol. 29, no. 6, pp. 82–97, 2012.
T. N. Sainath, O. Vinyals, A. Senior, and H. Sak, “Convolutional, long short-term memory, fully connected deep neural networks,” in2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 4580–4584.
-  N.-T. Le, C. Servan, B. Lecouteux, and L. Besacier, “Better evaluation of asr in speech translation context using word embeddings,” in Interspeech 2016, 2016.
-  T. Likhomanenko, Q. Xu, V. Pratap, P. Tomasello, J. Kahn, G. Avidov, R. Collobert, and G. Synnaeve, “Rethinking evaluation in asr: Are our models robust enough?” arXiv preprint arXiv:2010.11745, 2020.
-  B. Favre, K. Cheung, S. Kazemian, A. Lee, Y. Liu, C. Munteanu, A. Nenkova, D. Ochei, G. Penn, S. Tratz et al., “Automatic human utility evaluation of asr systems: Does wer really predict performance?” in INTERSPEECH, 2013, pp. 3463–3467.
-  A. Gupta, H. S. Chadha, P. Shah, N. Chimmwal, A. Dhuriya, R. Gaur, and V. Raghavan, “Clsril-23: cross lingual speech representations for indic languages,” arXiv preprint arXiv:2107.07402, 2021.
-  K. Heafield, “Kenlm: Faster and smaller language model queries,” in Proceedings of the sixth workshop on statistical machine translation, 2011, pp. 187–197.
-  D. Kakwani, A. Kunchukuttan, S. Golla, G. N.C., A. Bhattacharyya, M. M. Khapra, and P. Kumar, “IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages,” in Findings of EMNLP, 2020.
-  A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates et al., “Deep speech: Scaling up end-to-end speech recognition,” arXiv preprint arXiv:1412.5567, 2014.