Log In Sign Up

Evaluating User Perception of Speech Recognition System Quality with Semantic Distance Metric

by   Suyoun Kim, et al.

Measuring automatic speech recognition (ASR) system quality is critical for creating user-satisfying voice-driven applications. Word Error Rate (WER) has been traditionally used to evaluate ASR system quality; however, it sometimes correlates poorly with user perception of transcription quality. This is because WER weighs every word equally and does not consider semantic correctness which has a higher impact on user perception. In this work, we propose evaluating ASR output hypotheses quality with SemDist that can measure semantic correctness by using the distance between the semantic vectors of the reference and hypothesis extracted from a pre-trained language model. Our experimental results of 71K and 36K user annotated ASR output quality show that SemDist achieves higher correlation with user perception than WER. We also show that SemDist has higher correlation with downstream NLU tasks than WER.


page 1

page 2

page 3

page 4


Semantic Distance: A New Metric for ASR Performance Analysis Towards Spoken Language Understanding

Word Error Rate (WER) has been the predominant metric used to evaluate t...

Hybrid-SD (H_SD): A new hybrid evaluation metric for automatic speech recognition tasks

Many studies have examined the shortcomings of word error rate (WER) as ...

Evaluating the Usability of Automatically Generated Captions for People who are Deaf or Hard of Hearing

The accuracy of Automated Speech Recognition (ASR) technology has improv...

Assessing ASR Model Quality on Disordered Speech using BERTScore

Word Error Rate (WER) is the primary metric used to assess automatic spe...

WERd: Using Social Text Spelling Variants for Evaluating Dialectal Speech Recognition

We study the problem of evaluating automatic speech recognition (ASR) sy...

Voice Quality and Pitch Features in Transformer-Based Speech Recognition

Jitter and shimmer measurements have shown to be carriers of voice quali...

1 Introduction

As voice-driven interfaces to devices become mainstream, measuring speech recognition system that can reflect user perception becomes increasingly important. Word Error Rate (WER) has been traditionally used to measure automatic speech recognition (ASR) system quality, However, it is sometimes not correlated to user perception of ASR transcription quality. This is because WER weights every word equally, and does not consider semantic correctness which has more impact on user perception. For example, when the reference is “set an alarm for 7 am” and two ASR hypotheses are: “set a alarm for 7 am” and “cancel an alarm for 7 am”, then the former hypothesis would be preferred by the users or downstream tasks. However, WER by itself cannot identify which hypothesis is better as the error rates are identical.

Over the years, prior work has attempted to address some of WER’s issues by taking word importance weight into account [garofolo19991998] or adopting information retrieval to measure the performance [makhoul1999performance, hunt1990figures, mccowan2004use]. All of these metrics have been based on literal-level surface-form word correctness and are not able to measure semantic level correctness.

Meanwhile, many prior studies on transformer[vaswani2017attention]

based pre-trained neural language models , such as BERT, RoBERTa, XLM

[Peters:2018, devlin2018bert, liu2019roberta, lample2019cross, du2020general]

showed promising results in Natural Language Processing (NLP) and Natural Language Understanding (NLU) tasks. These general-purpose language models are pre-trained on billions of words, and have shown the ability to represent textual semantic information in the form of low-dimensional continuous vectors (i.e., embeddings) in textual similarity, question answering, paraphrasing, and sentiment analysis tasks

[devlin2018bert, liu2019roberta].

Recently, [reimers2019sentence, zhang2019bertscore] have attempted to use these embeddings generated from pre-trained language models to evaluate NLP/NLU systems semantically. Thus far, the research in measuring semantic correctness has been more focused on NLP/NLU systems. More recently, Semantic Distance (SemDist) [kim2021semantic] was proposed to measure semantic correctness for ASR systems by using semantic embeddings and showed higher correlation with the downstream NLU task of intent recognition and semantic parsing, compared to WER. To the best of our knowledge, there have been no studies on user perception of ASR quality with SemDist metric.

Figure 1: Comparison of two metrics, WER and SemDist, for evaluating ASR hypotheses

In this work, we first focus on studying user perception of ASR quality using SemDist metric. We evaluated 71K and 36K user annotated ASR output quality and show that SemDist achieves higher correlation with user perception than WER. Secondly, we explore a variety of strategies to compute SemDist as well. We show that XLM-based [lample2019cross, du2020general] token pairwise method [zhang2019bertscore] performed more robustly than RoBERTa-based mean-pooling method that was used in previous work [kim2021semantic]

. Additionally, we show SemDist results on NLU tasks and its higher correlation than WER. Finally, we build a user perception model and show SemDist helps to estimate user perception accurately and provides insight into model selection.

2 Semantic Distance (SemDist)

In this work, we measure ASR output hypothesis quality using SemDist [kim2021semantic] approach with transformer-based [vaswani2017attention] pre-trained LM [liu2019roberta, lample2019cross]. SemDist is calculated in two steps. First, we forward the reference transcription and corresponding ASR hypothesis to the pre-trained LM and obtain semantic embeddings of the reference () and hypothesis (). Second, we calcuate the distance between these two embeddings, () and (

), by using cosine distance function. Although the raw value of SemDist theoretically has the same range as cosine similarity (-1 to 1), we observe that SemDist has a more limited range in practice. Thus, once we obtain SemDist, we optionally multiply it with a scaler (

) just for improving readability.

SemDist can be obtained in various ways depending on (1) which pre-trained language model we use and (2) how to extract the embeddings. In our experiments, we compare four different SemDist: SemDist(RoBERTa-mean-pooling), SemDist(XLM-mean-pooling), SemDist(XLM-[CLS]), and SemDist(XLM-pairwise-token).

SemDist(RoBERTa-mean-pooling): We calculate the cosine distance between and generated by mean-pooling over all token embeddings from RoBERTa model.

SemDist (1)

SemDist(XLM-mean-pooling): The process is same as SemDist(RoBERTa-mean-pooling) but we use the XLM-R model[du2020general].

SemDist(XLM-[CLS]): We directly use the embedding from the CLS token, which is a special token added before the first word and is learned to hold information on the entire sentence, as and and compute the cosine distance between them.

SemDist (2)

SemDist(XLM-pairwise-token): Inspired by [zhang2019bertscore], we compute pairwise cosine similarity between token embeddings from the reference and from the hypothesis. For each token, we take maximum values of cosine similarity and take the mean over all tokens in the reference and hypothesis to obtain (Recall) and hypothesis

(Precision). We then use 1 - the harmonic mean between them,

and as our SemDist.

SemDist (3)

Figure 1 illustrates the overall procedure to obtain SemDist with simple examples of two hypotheses A and B and show how SemDist and WER differ from each other. Naturally, the users or downstream tasks prefer hypothesis A over B, because A has only minor syntactic error (an a) which does not hurt its meaning. As seen in this example, WER cannot separate these two hypotheses (16.7% vs. 16.7%) because it only measures literal word-level correctness. However, SemDist can indicate that hypothesis A is better than B (0.7 vs. 38.0) by measuring semantic correctness.

3 User Perception of ASR Quality

We investigated the correlation of SemDist to two User Perception tasks: HypRating (Section 3.1) and HypChoice (Section 3.2). The hypotheses are generated from our strong baseline ASR system which is an end-to-end sequence transducer (RNN-T) [Graves12transduction] with approximately 83M total parameters. The acoustic encoder is a 20-layer streamable low-latency Emformer model [Shi2021emformer]

. The predictor consists of three Long Short Term Memory (LSTM) layers with 512-dim hidden size, followed by 1024-dim FC projection. The joiner network contains one Rectified Linear Unit (ReLU) and one FC layer. The ASR system is trained on 50K hours of manually transcribed data and 1.7M hours of English Facebook video data, using alignment restricted RNN-T loss

[Mahadeokar2021AR-RNNT] and trie-based deep biasing [le21_interspeech]. Our in-house annotated evaluation set has 36k manually transcribed utterances collected from crowd-sourced workers or volunteer participants who have agreed to have their voice activity reviewed and analyzed. The evaluation set has two main domains: 21k open-domain dictation and 15k assistant-domain voice commands.

Task # utter Avg. Len. WER SemDist-RM SemDist-XM SemDist-XC SemDist-XT
User Perception task
UserChoice 36k 9.8 0.28 0.31 0.35 0.32 0.39
UserRating 71k 9.8 0.36 0.47 0.52 0.51 0.59
NLU task
IntentAcc 10k 2.4 0.33 0.32 0.38 0.37 0.37
SemanticParsing(EM) 10k 2.4 0.26 0.28 0.31 0.30 0.31
SemanticParsing(EMTree) 10k 2.4 0.29 0.29 0.34 0.34 0.33
Table 1: Correlation between user perception and downstream NLU tasks and various ASR metric: WER, and four different SemDist: SemDist-RM(RoBERTa-mean-pooling), SemDist-XM(XLM-mean-pooling), SemDist-XC(XLM-[CLS]), and SemDist-XT(XLM-pairwise-token). The Pearson correlation coefficients are reported.
Figure 2:

Comparison of the distribution of SemDist and WER for each user-rating. The box extends from the lower to upper quartile values, with a line at the median.

3.1 Hypothesis Rating (HypRating)

HypRating consists of 73k user-ratings of ASR hypotheses. We collected the HypRating user annotation twice on our 36k evaluation set. The annotators are asked to listen to the audio and rate the hypotheses. There are four rating levels: ‘exact match’, ‘useful hyp’, ‘wrong hyp’, ‘nonsense hyp’. A ‘useful hyp’ can be thought of as a hyp which has errors, but the downstream task can still be successful. In order to quantify hypothesis ratings, we assign the integer 0, 1, 2, and 3, to ‘exact match’, ‘useful hyp’, ‘wrong hyp’, and ‘nonsense hyp’, respectively.

3.2 Side-by-Side Hypothesis Choice (HypChoice)

HypChoice consists of 38k user annotations for ASR hypothesis pairs. The annotators are asked to listen to the audio and choose which hypothesis is better between two hypotheses A and B, and answer one of three: ‘hyp A’, ‘hyp B’, and ‘equal’; both are equally good (or bad). In order to quantify the user’s choice, we assign the integer -1, 1, and 0, to ‘hyp A’, ‘hyp B’, and ‘equal’, respectively.

4 Downstream NLU tasks

4.1 Intent Recognition and Semantic Parsing

We next investigated the correlation of SemDist to three NLU tasks: intent recognition (IntentAcc), semantic parsing (EM), and semantic parsing (EMTree) [GuptaSMKL18]. We used 10k Assistant domain ASR hypotheses that generated from our strong baseline ASR system then evaluated these hypotheses with our NLU system. The detail of the ASR and NLU system is in [kim2021semantic]. Note that the size of evaluation set for this NLU task (10k) is smaller than the original assistant domain evaluation set (15k) because we selected the utterances that their annotations (i.e. intention, slot) are available. For the intent recognition task, we used 351 intent types. For the semantic parsing tasks, we used the decoupled semantic representation form [AghajanyanMSDHL20] that allows the nested intents and slots. The EM is the strictest metric, which is 1 only when all the intents and the slots in the utterance are predicted correctly. The EM Tree is similar to EM but it only allows ASR errors in recognizing slot tokens.

5 Experiments and Results

Gap WER SemDist Ref/Hyp
12011 16.67 0.00 Ref: hey portal play mister blue sky
Hyp: hey portal play mr blue sky.
8742 50.00 0.00 Ref: I smell hot dogs
Hyp: I smell hotdogs.
6390 6.67 0.01 Ref: keep away it is a nightmare thank God we are separated about four thousand kilometres
Hyp: keep away, it is a nightmare. thank god we are separated about four thousand kilometers.
3807 20.00 0.02 Ref: keep time zones in mind for the next zoom call
Hyp: keep timezones in mind for the next Zoom call.
2725 66.67 3.14 Ref: I’m eagerly waiting
Hyp: I am eagerly waiting.
(a) top-5 ( - ) gap
Gap WER SemDist Ref/Hyp
2486 6.25 302.73 Ref: of course in the first time but one by one I able to handle those complaints
Hyp: of course, in the first time, birth one by one, I able to handle those complaints.
2388 10.00 473.47 Ref: okay then it’s kind of a great weekly for him
Hyp: okay, then it’s kind of a great victory for him.
2345 10.00 426.67 Ref: do you know the most whom he used to admire
Hyp: do you know the most home he used to admire?
2314 7.69 286.03 Ref: sure but uh I think you will have to be patient because usually it’s something like once a year or a little bit more but not much
Hyp: sure but uh I think you will have to be patient because you read something like once a year or a little bit more but not much.
2300 10.00 375.39 Ref: yeah I think I’ve seen that in the news before
Hyp: yeah, I think I’ve seen that in the morning before
(b) top-5 ( - ) gap
Table 2: Examples of Ref/Hyp that has the biggest gap between WER and SemDist.

5.1 Correlation Results

We first evaluated the correlation of SemDist to the various user perception tasks: UserChoice, and UserRating and downstream NLU tasks: IntentAcc, SemanticParsing(EM), SemanticParsing(EMTree). For calculating correlation for UserChoice, we used the subtraction of SemDist between two hyp A and B ( - ) and the subtraction of WER ( - ). For NLU tasks, we used 1 - IntentAcc, 1 - EM, and 1 - EMTree to consistently generate positive correlations across all tasks. As seen in Table 1, we observed that SemDist is significantly higher correlated to all user perception tasks as well as downstream NLU tasks than WER. We also observed that XLM-based SemDist better correlates than RoBERTa-based SemDist with all tasks, and pairwise-token based SemDist shows the highest correlation especially with user perception task. These results indicate that SemDist can be a better indicator for user perception and downstream tasks than WER, and using a good semantic embedding is important.

5.2 How Do We Interpret SemDist Value?

One possible drawback to SemDist metric is that the value of SemDist is less intuitive than WER and hard to interpret. To provide a better understanding of how the SemDist values should be interpreted, we show SemDist distribution for each user-rating in Figure 2. We observed that SemDist is more separated for each rating than WER.

5.3 How Are WER and SemDist Different?

We next investigated how WER and SemDist evaluate differently for the same hypothesis. To do so, we first defined the gap between WER and SemDist as the change of their ranking within the entire evaluation set (36k). We assigned the ranking of WER () and the ranking of SemDist () for each utterance by sorting WER, and SemDist value. Table 2 (a) shows the top-5 ( - ) gap and Table 2 (b) shows the top-5 ( - ) gap. We observed that SemDist more robustly measures ASR errors by not penalizing errors that do not hurt sentence meanings (i.e. contractions and compound words) as seen in Table 2 (a). We also found that SemDist detects semantically nonsense word errors which occur only once within a sentence as seen in Table 2 (b).

5.4 Modeling User Perception

One promising aspect of SemDist is the potential to create models that can predict the user satisfaction of voice-driven applications. This can be done by training a model on pairs of SemDist and user-rating. We created three linear regression models from 71k of pairs of user-rating and (1) WER only, (2) SemDist only, and (3) both SemDist and WER. Table

3 shows the comparison of , MAE, and MSE of three models. The results show that SemDist only can achieve 0.35 of and significantly outperforms than WER only. Thus, using SemDist can be a promising method for estimating user satisfaction without requiring the data annotation cost.

WER only SemDist only WER SemDist
0.12 0.35 0.36
MAE 0.38 0.29 0.29
MSE 0.30 0.23 0.23
Table 3: Comparison of user perception models with three input cases: (1) WER only, (2) SemDist only, and (3) WER SemDist.

6 Conclusion

We evaluated 71k and 32k of user annotated ASR quality and showed that SemDist correlates significantly higher with user perception than traditional metric, WER. Key aspects of SemDist is measuring semantic correctness of ASR output by using semantic embeddings from the pre-trained language model for general purpose. In addition, we explored various strategies to compute SemDist and found that pairwise-token-based SemDist performs best in user perception of ASR quality. We also showed SemDist correlates higher with downstream NLU task as well. Moreover, we demonstrated the potential of SemDist for providing insight into model selection by estimating user perception more accurately. Moving forward, we plan to how SemDist can be used to train ASR systems.