WER-BERT: Automatic WER Estimation with BERT in a Balanced Ordinal Classification Paradigm

by   Akshay Krishna Sheshadri, et al.

Automatic Speech Recognition (ASR) systems are evaluated using Word Error Rate (WER), which is calculated by comparing the number of errors between the ground truth and the transcription of the ASR system. This calculation, however, requires manual transcription of the speech signal to obtain the ground truth. Since transcribing audio signals is a costly process, Automatic WER Evaluation (e-WER) methods have been developed to automatically predict the WER of a speech system by only relying on the transcription and the speech signal features. While WER is a continuous variable, previous works have shown that positing e-WER as a classification problem is more effective than regression. However, while converting to a classification setting, these approaches suffer from heavy class imbalance. In this paper, we propose a new balanced paradigm for e-WER in a classification setting. Within this paradigm, we also propose WER-BERT, a BERT based architecture with speech features for e-WER. Furthermore, we introduce a distance loss function to tackle the ordinal nature of e-WER classification. The proposed approach and paradigm are evaluated on the Librispeech dataset and a commercial (black box) ASR system, Google Cloud's Speech-to-Text API. The results and experiments demonstrate that WER-BERT establishes a new state-of-the-art in automatic WER estimation.


page 4

page 8


Speech Pattern based Black-box Model Watermarking for Automatic Speech Recognition

As an effective method for intellectual property (IP) protection, model ...

Neural Zero-Inflated Quality Estimation Model For Automatic Speech Recognition System

The performances of automatic speech recognition (ASR) systems are usual...

Unsupervised paradigm for information extraction from transcripts using BERT

Audio call transcripts are one of the valuable sources of information fo...

Unsupervised Uncertainty Measures of Automatic Speech Recognition for Non-intrusive Speech Intelligibility Prediction

Non-intrusive intelligibility prediction is important for its applicatio...

Lenient Evaluation of Japanese Speech Recognition: Modeling Naturally Occurring Spelling Inconsistency

Word error rate (WER) and character error rate (CER) are standard metric...

Alzheimer Disease Classification through ASR-based Transcriptions: Exploring the Impact of Punctuation and Pauses

Alzheimer's Disease (AD) is the world's leading neurodegenerative diseas...

Black-box Adaptation of ASR for Accented Speech

We introduce the problem of adapting a black-box, cloud-based ASR system...

Please sign up or login with your details

Forgot password? Click here to reset