Detecting Mismatch between Text Script and Voice-over Using Utterance Verification Based on Phoneme Recognition Ranking

03/20/2020 ∙ by Yoonjae Jeong, et al. ∙ 0

The purpose of this study is to detect the mismatch between text script and voice-over. For this, we present a novel utterance verification (UV) method, which calculates the degree of correspondence between a voice-over and the phoneme sequence of a script. We found that the phoneme recognition probabilities of exaggerated voice-overs decrease compared to ordinary utterances, but their rankings do not demonstrate any significant change. The proposed method, therefore, uses the recognition ranking of each phoneme segment corresponding to a phoneme sequence for measuring the confidence of a voice-over utterance for its corresponding script. The experimental results show that the proposed UV method outperforms a state-of-the-art approach using cross modal attention used for detecting mismatch between speech and transcription.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The popularization of multimedia and computer games has enabled the emergence of a large number of text scripts and their voice-overs. For example, in massively multiplayer online role-playing games (MMORPGs), non-player characters (NPCs) deliver their messages to users through captions and voice-overs. In several cases, game developers outsource the voice-overs of NPCs’ text scripts to voice actors and manually verify if the outcomes match to their corresponding scripts. This type of a manual verification is extremely time-consuming and labor-intensive; therefore, there is a growing need to automate the verification process.

For this, the application of automatic speech recognition (ASR) system is not considered to be effective, because these systems commonly produce the incorrect output for the utterances containing out-of-vocabulary (OOV) words. Moreover, the voice-overs in a game domain contain more OOVs than ordinary speech.

Utterance verification (UV) [8] is one of the key technologies that can be used to deal with this problem. As a component of ASR systems, the UV prevents the ASR system from producing incorrect recognition results by evaluating the confidence between a user’s utterance and its recognized text. The UV technique can be applied to detect a mismatch between a script and its voice-over. The utterance verifier receives a pair of a script and its voice-over (text and waveform) as an input. Subsequently, with the use of an acoustic model, the verifier calculates the degree of correspondence between them.

Most of the acoustic models are trained using a read- or natural-style speech database because they are relatively easy to construct, and there are several public corpora. However, the speech style in a computer game domain is entirely different. The voice actors tend to utter with exaggerated and emotional intonation.

We found that the conventional UV algorithms are not suitable for verifying the exaggerated or emotional utterances when an acoustic model trained based on a read- and natural-style speech database is used. The conventional UV algorithm measures the confidence based on the gap of phoneme recognition probabilities between an acoustic model and its anti-phoneme model. However, the differences in the speech style reduce the recognition probabilities of exaggerated utterances, and therefore, it induces the decrease of the probability gap.

To resolve the aforementioned problem, we devise a novel UV algorithm based on phonemic recognition rankings. It is observed that the phoneme recognition rankings do not significantly change regardless of the speech styles. The proposed UV algorithm calculates the average phoneme recognition ranking of each speech segment of a phoneme sequence corresponding to its text script as the confidence measure.

2 Related Work

Utterance verification (UV) formulates the confidence measuring problem as a statistical hypothesis testing of the

null hypothesis against the alternative hypothesis [15, 16, 18]. The null hypothesis is that, if a speech recognizer recognizes an utterance as a correct word (or phoneme) sequence, the unit word (or phoneme) of the sequence is also correctly recognized in its corresponding speech segment. Whereas, the alternative hypothesis is that the words (or phonemes) are wrongly recognized and therefore cannot originate from their speech segments.

The studies concerned with the prediction of automatic speech recognition (ASR) errors have focused on building classifiers for discovering incorrect outcomes of an ASR system for given utterances

[3, 4, 6, 19, 9, 11, 2, 12, 1]. The classifiers use the combined features generated from various sources, including the intermediate results of the decoding process, and usually outperform the UV approaches. However, because these studies aim to determine the accuracy of the ASR result for an input speech, it does not satisfy the goal of the present study.

Huang and Hain [7]

recently proposed a new method for detecting a mismatch between speech and transcription using a cross-modal attention mechanism. They also tried to present a robust method that can be applied even in the absence of a well-defined lexicon and large corpus. Because their objective is similar to ours, we use

[7] as one of the baselines of our experiments.

3 Proposed Method

The procedure followed by the proposed utterance verification (UV) system is described below. (1) The system extracts the features from a given voice-over utterance. The features consist of the 13-dimensional Mel-Frequency Cepstral Coefficients (MFCC) appending delta and delta-delta values. (2) We generate the possible phoneme sequences from a given text script using the pronunciation dictionary created based on the grapheme-to-phoneme (G2P) method. (3) The forced alignment module finds the best phoneme sequence of the script and its voice-over and aligns the phoneme sequence to the speech features. (4) Next, our system runs the likelihood ratio test, through which the overly mismatched pair of script and voice-over are filtered. (5) In the final stage, the proposed average phoneme ranking (APR) based UV validates the voice-overs that qualified in the previous stage. The system then measures the correspondence between each script and its voice-over as the average of the phonemic recognition rankings of each phoneme in the aligned phoneme sequence. The higher the average ranking, the greater is the confidence match between a script and its voice-over.

3.1 Likelihood Ratio Test (LRT) based Utterance Verification

The likelihood ratio test (LRT) has widely been the basis for numerous utterance verification (UV) methods. The LRT is a statistical test that determines the goodness of a null model against an alternative model called the anti-model. [15] is adopted as our baseline LRT-based UV method because it is a representative LRT-based UV study and has also been the basis for several subsequent studies.

Equation (1) represents the log-likelihood ratio (LLR) between a script and voice-over . is a model representing the probability that an acoustic model recognizes as the , and the is an anti-model of . The and represent the log-likelihoods of and , respectively. If LLR is higher than threshold , then is considered to match .


3.2 Average Phoneme Ranking (APR) based Utterance Verification

When we apply the LRT-based utterance verification (UV) to the exaggerated voice-overs, significant performance degradation occurs. Figure 1 shows the distribution of the correct script-voiceover pairs (O) and the incorrect ones (X) on the log-likelihood ratio (LLR) and the APR.

(a) Read Style (LLR)
(b) Online Game (LLR)
(c) Read Style (APR)
(d) Online Game (APR)
Figure 1: Distribution of the script-voiceover pairs on LLR and APR. The solid- and dashed-line indicate the correct (O) and incorrect (X) pairs, respectively.

For the read speech, which has the same speech style as the training data of the acoustic model, the LRT-based UV clearly separates the correct and incorrect pairs at a threshold of 1.5 (Figure (a)a). However, when we apply the threshold to the exaggerated speech style of the online game, a significant number of correct pairs are determined to be incorrect. Moreover, the distribution of the correct and incorrect pairs significantly overlap (Figure (b)b). This is because the recognition probabilities decrease due to the different speech styles of the acoustic model. In contrast, the recognition rankings of phonemes do not change significantly, regardless of the speech styles, as shown in Figure (c)c and (d)d. Only a slight drop in ranking occurs in the correct exaggerated voice-over.

Based on this observation, we measure the correspondence between a script and voice-over as the average ranking of phoneme recognition in Equation (2). We call this new UV method as the average phoneme ranking (APR) based UV. In the equation below, indicates the -th phoneme of the script , and is the speech features corresponding to the -th phoneme. is the phoneme recognition rank of for . If the APR is less than threshold , then is considered to correspond to . is the number of phonemes in the phoneme sequence.


3.3 Two-stage APR-based Utterance Verification

We identified some rare cases where the phoneme recognition rankings are high, although the overall recognition probabilities are extremely low. To avoid the occurrence of such scenarios, we devise a two-stage APR-based UV. Equation (3) shows the two-stage verification method that combines the LRT-based UV and the APR-based UV.


If the LLR value of the LRT-based UV is less than or equal to threshold , the lowest phoneme recognition ranking value () is assigned. Otherwise, the average phoneme ranking calculated via the APR-based UV is used to measure the correspondence between script and voice-over .

4 Experiment

4.1 Experimental Setup

4.1.1 Acoustic Model

We build a Deep Neural Network (DNN) based acoustic model and a GMM model for our experiment. The former is used for the forced alignment process, and the latter is used for scoring a segmented speech in a phoneme unit. Although the GMM model is a rudimentary acoustic model, it demonstrates sufficiently impressive performance for verifying general pronunciation. Moreover, in our test environment, the GMM model performs better than the DNN model for phoneme scoring. Using Kaldi-ASR toolkits

[14], we train the English and Korean acoustic models based on the LibriSpeech ASR corpus [13] and roughly 793,000 read-style Korean utterances, respectively.

4.1.2 Test Sets

Considering the purpose of the experiment, we employ two types of test sets. The first set is concerned with the comparison of the proposed method with those of previous studies. The test set of the WSJ-CAM0 corpus [5] is adopted to compare our method with that of Huang and Hain’s study [7], which is the state-of-the-art for detecting a mismatch between speech and transcription. To create the mismatched samples, we randomly delete, insert, and substitute four words in the original test data. The test data for evaluation contains 331 matched samples and 331 mismatched ones.

The second type of test set is used to detect a mismatch between text script and voice-over. We create three test sets (DICT01, BNS-1, and BNS-2). DICT01 is the 1,600 read-style utterances excerpted from a Korean speech DB (DICT01) [17], and BNS-1 comprises the pairs of voice-over and script randomly selected from an MMORPG game of NCSOFT [10]. BNS-2 comprises 483 voice-overs with various tones and sound effects. To create the mismatched pairs, we assign an arbitrary text script to a voice-over similar to the real-world scenario. As a result, we obtain the balanced test sets of the correct and incorrect pairs, as presented in Table 1.

Test Set Description Correct Incorrect
DICT01 Read-style
BNS-1 Exaggerated-style
BNS-2 + various tones & effects
Table 1: Excerpted test sets from a speech database and an MMORPG

4.2 Evaluation

4.2.1 Comparison with Previous Work

We first evaluate the proposed APR-based utterance verification (UV) by comparing it against previous work, as listed in Table 2. We chose the cross-modal attention of Huang and Hain [7] and the LRT-based UV [15] as baselines. [7]

is the state-of-the-art based on a deep-learning approach and is used for detecting a mismatch between speech and transcription, and

[15] is the conventionally representative UV method. We build the test set from the WSJ-CAM0 corpus in the same way as [7].

Del Ins Sub Avg
Huang & Hain [7] 0.781 0.792 0.558 0.710
LRT [15] 0.605 0.798 0.670 0.691
Proposed APR 0.730 0.986 0.918 0.878
0.731 0.986 0.920 0.879
Table 2: Comparison of 4-word mismatch detection accuracy for deletion (Del), insertion (Ins), and substitutions (Sub) in the WSJ-CAM0 test set

Overall, our proposed APR-based UV shows an increase in accuracy by approximately 0.169 (23.8%) and 0.188 (27.2%) compared to [7] and [15], respectively. Our method shows improvements of approximately 0.194 (24.5%) and 0.362 (64.9%) in terms of accuracy for insertion (Ins) and substitution errors (Sub) compared to [7]. In the deletion errors (Del), there was a performance decrease of approximately 0.05, but this decrease is much smaller than the improvement of the Ins and Sub errors.

4.2.2 Performances for Detecting Mismatch between Text Script and Voice-over

In the test sets of text scripts and voice-overs, we perform an experiment for comparing the performances of the proposed APR-based UV and the conventional LRT-based UV. APR is an alternative to the log-likelihood ratio (LLR) of the LRT-based UV. For the evaluation, we apply the optimized thresholds (i.e., and ) that show the best performances for test sets to each method. Table 3 presents the results.

Test Set LRT APR
DICT01 0.992 1.5 0.998 4.0 +0.006 (0.6%)
BNS-1 0.930 1.2 0.968 5.0 +0.038 (4.1%)
BNS-2 0.901 1.1 0.959 6.0 +0.058 (6.4%)
  • ACC is the accuracy, and is the accuracy improvement of the APR-based UV compared to the LRT-based UV

Table 3: Comparison between the proposed APR-based UV and the conventional LRT-based UV with the optimized thresholds.

For read-style utterances (DICT01), the APR-based UV obtains a performance gain of approximately 0.006 (0.6%) in terms of accuracy. The improvement in the DICT01 is not significant, but the APR-based UV presents a significant improvement in exaggerated voice-overs. For BNS-1 and BNS-2, our method shows an improvement of 0.038 (4.1%) and 0.058 (6.4%), respectively. As the improvement is more in BNS-2, the proposed APR-based UV appears to be more robust for a variety of tones and sound effects as compared to the LRT-based UV.

4.2.3 Robustness to Threshold

To inspect the robustness of threshold values to datasets, we investigated the performance drops for the exaggerated voice-overs when the threshold and optimized for read-speech utterances are applied. As detailed in Table 4, the declines in the performance of the LRT-based UV in BNS-1 and BNS-2 are -0.117 (-14.4%) and -0.228 (-33.8%), respectively. However, those in the case of the APR-based UV are -0.016 (-1.7%) and -0.059 (-6.6%), respectively, which are remarkably lower than that of the LRT-based UV.

Test Set LRT APR
BNS-1 0.813 -0.117 0.952 -0.016
(-14.4%) (-1.7%)
BNS-2 0.674 -0.228 0.900 -0.059
(-33.8%) (-6.6%)
  • indicates the reduction in accuracy from the optimized threshold for the test set.

Table 4: Performance degradation in the exaggerated voice-overs, when applying the optimized thresholds of the read-speech utterances.

4.2.4 Effects of Two-stage Approach

We finally investigate the effect of the proposed two-stage APR-based UV. Table 5 presents its improvement in comparison with a pure APR-based UV. Although the improvement is minimal, the two-stage APR-based UV compensates for a few errors of the pure APR-based UV.

Test Set APR
BNS-1 0.9675 0.9677 +0.0002
BNS-2 0.9592 0.9598 +0.0006
Table 5: Performance improvement of the two-stage APR-based UV.

5 Conclusions and Future Work

In this paper, an APR based utterance verification (UV) method was proposed. Our experimental results showed that the proposed method showed performance improvements over the state-of-the-art as well as the conventional LRT-based UV when detecting mismatches between speech and transcriptions. Additionally, our method showed only a small amount of performance degradation with exaggerated voice-overs, even though the model was optimized to read-style utterances.

Although the proposed method showed encouraging results compared to previous approaches, it still has some limitations. One such limitation is concerned with the handling of deletion errors. The proposed APR-based UV method showed performance degradation for speech and transcript pairs with missing words when compared to that of the state-of-the-art. The other limitation is concerned with the handling of laughing-style utterances. Since laughing-style utterances are pronounced differently depending on the situation, transcribing them to proper phoneme sequences is a challenging task. As a direction for future research, we are currently working on the two aforementioned issues.


  • [1] R. Errattahi, S. Deena, A. E. Hannani, H. Ouahmane, and T. Hain (2018) Improving ASR Error Detection with RNNLM Adaptation. In Proceedings of the 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 190–196. External Links: Document Cited by: §2.
  • [2] R. Errattahi, A. E. Hannani, H. Ouahmane, and T. Hain (2016)

    Automatic speech recognition errors detection using supervised learning techniques

    In Proceedings of the 2016 IEEE/ACS 13th International Conference of Computer Systems and Applications (AICCSA), pp. 1–6. External Links: Document Cited by: §2.
  • [3] J. Fayolle, F. Moreau, C. Raymond, G. Gravier, and P. Gros (2010) CRF-based Combination of Contextual Features to Improve A Posteriori Word-level Confidence Measures. In Proceedings of Interspeech 2010, pp. 1942–1945. Cited by: §2.
  • [4] M. Gibson and T. Hain (2012) Application of SVM-based correctness predictions to unsupervised discriminative speaker adaptation. In Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4341–4344. External Links: Document, ISSN 2379-190X Cited by: §2.
  • [5] T. Hain and O. Saz (2013) Factored WSJ-CAM0 Speech Corpus. Note: [Online]. Available: [Accessed: 2019-10-16] Cited by: §4.1.2.
  • [6] P. Huang, K. Kumar, C. Liu, Y. Gong, and L. Deng (2013) Predicting speech recognition confidence using deep learning with word identity and score features. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7413–7417. External Links: Document, ISSN 1520-6149 Cited by: §2.
  • [7] Q. Huang and T. Hain (2019) Detecting Mismatch Between Speech and Transcription Using Cross-Modal Attention. In Proceedings of Interspeech 2019, pp. 584–588. External Links: Document Cited by: §2, §4.1.2, §4.2.1, §4.2.1, Table 2.
  • [8] H. Jiang (2005) Confidence measures for speech recognition: A survey. Speech Communication 45 (4), pp. 455–470. External Links: Document, ISSN 0167-6393 Cited by: §1.
  • [9] M. L. Korenevsky, A. B. Smirnov, and V. S. Mendelev (2015) Prediction of Speech Recognition Accuracy for Utterance Classification. In Proceedings of Interspeech 2015, pp. 1275–1279. Cited by: §2.
  • [10] NCSOFT Blade & Soul. Note: [Online]. Available: [Accessed: 2019-10-16] Cited by: §4.1.2.
  • [11] A. Ogawa and T. Hori (2015)

    ASR error detection and recognition rate estimation using deep bidirectional recurrent neural networks

    In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4370–4374. External Links: Document, ISSN 1520-6149 Cited by: §2.
  • [12] A. Ogawa and T. Hori (2017) Error detection and accuracy estimation in automatic speech recognition using deep bidirectional recurrent neural networks. Speech Communication 89, pp. 70–83. External Links: Document, ISSN 0167-6393 Cited by: §2.
  • [13] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur (2015) Librispeech: An ASR corpus based on public domain audio books. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210. External Links: Document Cited by: §4.1.1.
  • [14] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motliček, Y. Qian, P. Schwarz, J. Silovský, G. Stemmer, and K. Veselý (2011) The Kaldi Speech Recognition Toolkit. In Proceedings of the IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, Cited by: §4.1.1.
  • [15] M. G. Rahim, C. Lee, and B. Juang (1997) Discriminative utterance verification for connected digits recognition. IEEE Transactions on Speech and Audio Processing 5 (3), pp. 266–277. External Links: Document, ISSN 1063-6676 Cited by: §2, §3.1, §4.2.1, §4.2.1, Table 2.
  • [16] R. C. Rose, B. Juang, and C. Lee (1995) A training procedure for verifying string hypotheses in continuous speech recognition. In Proceedings of the 1995 International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 281–284. External Links: Document, ISSN 1520-6149 Cited by: §2.
  • [17] SiTEC Speech Corpora. Note: [Online]. Available: [Accessed: 2019-10-16] Cited by: §4.1.2.
  • [18] R. A. Sukka and C. Lee (1996) Vocabulary independent discriminative utterance verification for nonkeyword rejection in subword based speech recognition. IEEE Transactions on Speech and Audio Processing 4 (6), pp. 420–429. External Links: Document, ISSN 1063-6676 Cited by: §2.
  • [19] Y. Tam, Y. Lei, J. Zheng, and W. Wang (2014) ASR error detection using recurrent neural network language model and complementary ASR. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2312–2316. External Links: Document, ISSN 1520-6149 Cited by: §2.