Utterance-level neural confidence measure for end-to-end children speech recognition

09/16/2021
by   Wei Liu, et al.
The Chinese University of Hong Kong
0

Confidence measure is a performance index of particular importance for automatic speech recognition (ASR) systems deployed in real-world scenarios. In the present study, utterance-level neural confidence measure (NCM) in end-to-end automatic speech recognition (E2E ASR) is investigated. The E2E system adopts the joint CTC-attention Transformer architecture. The prediction of NCM is formulated as a task of binary classification, i.e., accept/reject the input utterance, based on a set of predictor features acquired during the ASR decoding process. The investigation is focused on evaluating and comparing the efficacies of predictor features that are derived from different internal and external modules of the E2E system. Experiments are carried out on children speech, for which state-of-the-art ASR systems show less than satisfactory performance and robust confidence measure is particularly useful. It is noted that predictor features related to acoustic information of speech play a more important role in estimating confidence measure than those related to linguistic information. N-best score features show significantly better performance than single-best ones. It has also been shown that the metrics of EER and AUC are not appropriate to evaluate the NCM of a mismatched ASR with significant performance gap.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

03/24/2022

Automatic Speech recognition for Speech Assessment of Preschool Children

The acoustic and linguistic features of preschool speech are investigate...
04/26/2021

Multi-Task Learning for End-to-End ASR Word and Utterance Confidence with Deletion Prediction

Confidence scores are very useful for downstream applications of automat...
04/09/2019

Performance Monitoring for End-to-End Speech Recognition

Measuring performance of an automatic speech recognition (ASR) system wi...
02/10/2021

NUVA: A Naming Utterance Verifier for Aphasia Treatment

Anomia (word-finding difficulties) is the hallmark of aphasia, an acquir...
03/11/2021

Learning Word-Level Confidence For Subword End-to-End ASR

We study the problem of word-level confidence estimation in subword-base...
03/27/2022

Listen, Adapt, Better WER: Source-free Single-utterance Test-time Adaptation for Automatic Speech Recognition

Although deep learning-based end-to-end Automatic Speech Recognition (AS...
10/07/2021

Improving Confidence Estimation on Out-of-Domain Data for End-to-End Speech Recognition

As end-to-end automatic speech recognition (ASR) models reach promising ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Speech-enabled interactive systems, e.g., voice assistant, intelligent recorder, smart loudspeaker, etc., have been integrated widely into our daily life. Automatic Speech Recognition (ASR) is the most important technology underlying these systems. State-of-the-art ASR systems have been steadily improving in terms of recognition accuracy and processing latency. The system design has evolved from Hidden Markov Models (HMM) and hybrid model of Deep Neural Network and HMM (DNN-HMM) to attention-based end-to-end (E2E) neural model, which requires a substantially larger amount of labelled training data. Despite the continual efforts, the performance of an ASR system would be degraded inevitably in adverse and unstable acoustic conditions, e.g., high-intensity noise, atypical pronunciation and/or speaking style, and inadequately represented speaker groups. The present research is focused on children’s speech.

Apart from the word error rate, confidence measure (CM) is a performance index of particular importance for ASR systems deployed in real-world scenarios. The value of confidence measure for an input speech utterance indicates to what level the user can trust the result of ASR. CM is useful in many downstream tasks of ASR. For example, it can be used to determine whether an input utterance (untranscribed) should be utilized for speaker adaptation [26]. CM is also useful in semi-supervised training [1]

, active learning

[8], spoken dialogue system [6, 25], and intelligent audio stream allocation in distributed ASR systems [13].

Numerous approaches were investigated toward word-level CM in the context of HMM-based ASR [9]

. Most commonly the CM is predicted with a binary classifier from features extracted during ASR decoding. The predictor features can be obtained from the ASR output lattice, for examples, word posterior probability

[5], word trellis stability [21], normalized acoustic likelihood and language model score. For the classifier models, linear discriminant function [24] , Gaussian mixture classifier [3]

, decision tree

[17] and neural networks [29, 10, 14, 11] have been most commonly adopted.

An E2E ASR system is trained to realize sequence-to-sequence mapping typically via an encoder-decoder network [2, 12, 4]

. Softmax probabilities in the auto-regressive decoder are commonly regarded as an intuitive measure of confidence on the mapping

[19]. However, the softmax probability was found to be unreliable and might perform poorly due to the overconfident behaviour of E2E models [7, 15]. To alleviate the problem of unreliability, a neural network can be trained independently to predict a softmax temperature value to re-distribute the original output probabilities at each time step of decoding [30]. In [15], a lightweight neural network was used to estimate neural confidence measure (NCM), which was shown to be more reliable than directly using the softmax probability. In [13], an NCM module was developed to predict utterance-level confidence measures in the context of small-footprint E2E ASR. With predictor features extracted from the encoder, the decoder and the attention blocks in the E2E system, the NCM significantly outperformed conventional word density confidence measure (WDCM) [20] and beam-scatter weighted WDCM.

It is noted that the predictor features play a critical role in the design of robust NCM modules. In the present study we are focused to investigate predictor features that are discriminative to confidence measure and able to generalize well to other domains. Here being discriminative means that the predictor features are effective in differentiating erroneous ASR outputs from correct ones. The feature of “beam scores” as investigated in [13] is extended. The efficacies of acoustic related and linguistic related score components are examined separately.

Figure 1: Neural confidence measure prediction in E2E ASR

To our knowledge, this study is the first to explore utterance-level NCM in E2E ASR for children speech. Compared with adult speech, children speech is less studied and recently raises much research interests [22, 23]. ASR systems for children speech are more likely to generate the erroneous output, making confidence measure a more relevant issue than for adult speech. We also investigate the robustness of NCM to varying input speech conditions and the transferrability to another out-of-domain E2E adult ASR. In this paper, E2E ASR refers specifically to the Transformer based joint CTC-attention speech recognition system.

2 Neural Confidence Measure Module

As shown in Figure 1, the NCM module is built on top of a properly trained E2E ASR system. For each input utterance decoded by the ASR system, the NCM module generates a confidence score based on a set of predictor features acquired from the ASR decoding process. The confidence score is compared against a threshold to determine whether the decoded text should be accepted or rejected.

Figure 2: Diagram of E2E speech recognition system

2.1 End-to-end speech recognition system

In this study, the E2E ASR system is based on an encoder-decoder model with the joint CTC-attention learning framework. As shown in Figure 2, the system consists of three components: a shared encoder, an attention decoder and a connectionist temporal classification (CTC) loss layer. An input sequence of acoustic features, denoted as

, is encoded by the shared encoder into a hidden vector sequence

. The hidden sequence is then processed in parallel by the attention decoder and the CTC loss layer to generate a sequence of output tokens . The three components are jointly optimized in training.

To take full advantages of both CTC and attention mechanisms, a multi-task learning (MTL) based loss function is used

[12],

(1)

where denotes the CTC loss and denotes the attention loss. The value of is between and . The CTC objective function acts as an auxiliary task to help speed up the alignment process at both training and decoding stages. The attention decoder relieves the limitation of conditional independence assumed in CTC.

2.2 Predictor features

Different network structures like feed forward network (FFN) [15, 30]

, recurrent neural network (RNN)

[10, 11] and self-attention Transformer [13] could be applied to realize the NCM module. In this study, a residual FFN with three hidden layers is adopted as the classification model. Our main focus is on selecting better predictor features.

Design of predictor features depends highly on the ASR decoding process. In our E2E ASR system, the beam search algorithm is adopted to perform one-pass decoding [28]

, in which the decoder computes a score for each partial hypothesis. In practice, external language models, e.g., n-gram and RNN language model (rnnlm), are employed by shallow fusion in the decoding. The score given to a recognition hypothesis would be the weighted combination of four components, namely the CTC score

, the attention score , the n-gram score and the rnnlm score . They can be explained as the log probability to form the decoded token sequence from different information sources. The one with a higher score is more likely to be selected. Here, the term “token” refers to the decoded character at each time step. Each hypothesis is made up of a sequence of tokens. To speed up the search, only a limited number of partial hypotheses are retained at each time step according to the setting of . An -best list is generated at the end of decoding. The list includes complete hypotheses with the highest scores. The best hypothesis is the final output of ASR, i.e.,

(2)

where denotes a set of complete hypotheses, are the component weights in the range of , satisfying and . Specifically, and refer to the negative CTC loss and attention loss respectively.

Intuitively, all of the four component scores are potentially contributive to estimating the confidence score. Note that the normalized score will be used to avoid the influence of various utterance length. Generally speech recognition models tend to assign extreme score value for the best hypothesis, which may adversely influence the classification judgment. The top- best hypotheses’ likelihoods could be utilized to produce a more robust confidence score. In the best hypothesis, apart from a list of utterance-level scores, more information from score distribution over the whole vocabulary can be utilized at each decoder time step. Due to the very large ASR vocabulary size and limitation in the beam search decoding, only top-

score logits are kept to compute the softmax probability distribution and the average token entropy. In previous works

[13, 30], internal neural features of E2E ASR were studied. Embedding features extracted at the last layers of the encoder and the decoder were shown useful for confidence score estimation, given that they contain rich acoustic and linguistic information respectively. Token duration feature was also suggested based on the observation that tokens with short duration are prone to recognition errors. Table 1 lists the above predictor features, which are investigated in the following experiments.

Predictor feature Feature form Notation
CTC score scalar
attention score
ngram score
rnnlm score
average token duration
average token entropy
- best scores vector
encoder embedding
vector
sequence
decoder embedding
- score logits
Table 1: Basic predictor features derived from E2E ASR decoding process for NCM.

3 Experimental Setup

3.1 Data sets

The dataset used in the experiments on NCM are from the 2021 SLT Children Speech Recognition Challenge (CSRC) [31]. The CSRC provided both adult and children speech data, each part being divided into training, validation (denoted as dev) and test (denoted as test) [18]. Additional evaluation sets released by the CSRC, namely the children read speech (eval_child_read) and children conversational speech (eval_child_convers) are also used in this study. The NCM module is trained on the dev part of children speech in the CSRC dataset. Other parts of children speech, i.e., test, eval_child_read and eval_child_convers, as well as the adult test speech, i.e., test_adult_1w, are used for NCM evaluation. test_adult_1w is part of the original adult test set, which contains utterances. Table 2 gives a summary of the above data sets. For each of them, the decoding accuracy produced by the system is given in terms of character error rate (CER) and sentence error rate (SER). is an E2E ASR system trained for children speech as described in the next section.

Data set Hours CER% SER%
Train dev 5 22.0 66.4
Eval test 6 20.1 63.9
eval_child_read 10 9.1 41.1
eval_child_convers 10 35.3 87.6
test_adult_1w 10 26.1 84.5
Table 2: Speech data sets used in the experiments on NCM

3.2 End-to-end ASR system for children speech

Utterance-level NCM is evaluated with an E2E ASR system, which is trained to generate Chinese characters from Mandarin speech [18]. Input features of the ASR system comprise -dimension filter-bank features and pitch features. Both the shared encoder and the attention decoder adopt the self-attention Transformer structure [27]. There are two different versions of E2E models involved in the following experiments. The system is trained only on the adult read speech, i.e., the training set of the CSRC adult speech. The system is then fine-tuned with children speech in both read and conversational speaking styles, i.e., the training set of the CSRC children speech.

3.3 Evaluation metric

The prediction of NCM is performed as a binary classification task. Thus the Equal Error Rate (EER) and the area under the ROC curve (AUC) are adopted for performance evaluation. These metrics have been commonly used in previous studies on confidence measures. EER refers to the error rate achieved with the operating threshold at which the false acceptance and false rejection rates are equal. AUC measures the average classification performance over the full range of operating threshold. Perfect performance is attained when the EER is equal to 0 or the AUC is equal to 1.

3.4 Loss function for NCM module training

The training labels for NCM module training are determined as follows. If the ASR decoding result on a test utterance perfectly matches with its ground-truth transcription, the label for the NCM classifier is set to 1; otherwise it is set to 0. For the binary classification problem, the binary cross-entropy (BCE) loss is regarded as the default choice. Our preliminary experiment showed that, however, to obtain a stable operating threshold (for determining the EER), which is approximately equal to across different datasets, the weighted focal loss [16] could be a better choice since it not only handles the class imbalance issue, e.g., SER of on dev means that of the utterances have the label of , but also pays more attention to hard samples. In the present study, the weighted focal loss is adopted as the loss function for NCM module training.

4 Results and Analysis

4.1 Utterance-level scores

The results of utterance-level NCM prediction with scores from the top-1 hypothesis are shown as in Table 3. The performance with the four component scores described in Section 2.2 are compared. The best performance was achieved by using the attention score, with the EER of and AUC of . The two language models’ scores are clearly inferior to the CTC and attention scores. This suggests that acoustic information is more pertinent to estimating the confidence score, and linguistic information alone is not sufficient. As shown in Table 3, the weighted sum score which takes direct effect in ASR decoding performs not well (EER: ; AUC: ) in comparison with and .

Predictor feature EER threshold AUC
0.2445 0.4848 0.8353
0.4848
0.3847 0.5152 0.6560
0.4440 0.5152 0.5777
0.2852 0.5051 0.7842
Table 3: NCM Results of set of utterance-level scores of the best hypothesis on the test set. denotes the weighted sum of the above four component scores. ()

4.2 NCM based on n-best scores

The experimental results with features are shown by the plots of EER in Figure 3. By incorporating multiple hypotheses ( to ) into the predictor features, the NCM tends to perform better. Using the best weighted score achieves the best EER of , which significantly surpasses the best weighted score by . It suggests that feature can capture more discriminative pattern to help classification. Particularly, the weighted score outperforms the attention score when it comes to cases. The is considered the most preferred feature since it can achieve similar value of EER ( vs. ) to the one and higher AUC ( vs. ).

Figure 3: The performance of features in terms of EER on the test set.
Predictor feature EER AUC
1best score 0.2852 0.7842
0.2141 0.8570
0.2152 0.8607
10best score 0.1946 0.8901
0.1929 0.8917
;    0.2647 0.8150
;   
0.4499 0.5668
- score logits 0.3730 0.6810
0.2894 0.7802
0.2300 0.8519
- score logits 0.3097 0.7508
0.2803 0.7940
0.2192 0.8616
- score logits 0.2777 0.7941
0.2681 0.8093
0.2126 0.8662
0.2342 0.8497
0.2194 0.8660
0.2097 0.8751
Table 4: NCM Results of , , , , - score logits and their variants on the test set. The subscript represents the component weight is jointly learnable via network, the prefix denotes the weight is adaptively adjusted based on the component score. In the rows of - score logits, means the score logits is scaled by a learnable temperature value, means that the temperature value is adaptively adjusted according to the at each decoder time step via an independent neural network. denotes a sequence of softmax probability distributions of score logits and represents the averaged vector.

4.3 Other predictor features and variants

It is worth noting that the component weights can be adjusted automatically. They are treated as learnable parameters jointly optimized in the model training (denoted as subscript), either in fixed () or adaptive way (, the prefix denotes the weight is adaptively adjusted based on the component score), as illustrated in the first block of Table 4. A large performance gain (EER: ) can be attained in the case of , suggesting that balancing the component scores helps a lot. The lowest EER and the highest AUC are achieved with the 10best feature.

In the middle part of Table 4, the effect of embedding features and are compared. The decoder embedding slightly outperforms the encoder one. As a matter of fact, the attention decoder has already attended to the encoder embedding. Yet both of the embeddings are prone to overfit. The is represented by the length ratio () of and . It does not provide any discriminative information for confidence measure.

The lowest part of Table 4 shows the performance of a few feature variants related to the -

score logits at each time step of decoding for the best hypothesis. The entropy measures the confidence of assigning the maximum index of a softmax probability distribution as the decoding output. That is, the recognition result would be highly uncertain if a distribution is close to a uniform distribution. The average entropy (

) of token distribution is expected to reflect the confidence of an output hypothesis. We experiment with . It is noted that the entropy with smaller value of , i.e., less - score logits, consistently outperforms that with larger . Here, is exactly equal to the setting of . Furthermore, the score logits can be multiplied with a constant temperature value (denoted as ), which is determined by jointly optimizing the NCM module, in order to sharpen or smooth its distribution. Inspired by technique used in [30], we can train an additional neural network to predict a dynamic softmax temperature that takes different values at different decoder time steps (denoted as ). Input to this predictive network is the decoder embedding , which contains both acoustic and linguistic information related to the corresponding time step. It can be observed that always performs better than across different settings of , and both two temperature scaling approaches consistently outperform the vanilla entropy feature. Within the - score logits, the achieves a comparable performance (EER: vs. ; AUC: vs. ) with the , though the entropy-based feature is a scalar.

4.4 Fusion of predictor features and robustness test

As shown in the top of Table 5, fusion of different predictor features is evaluated on the test set of children speech. Generally speaking, fusion of multiple features is beneficial and gives better performance than using them individually. The fusion of and - features achieve the best NCM quality (EER: ; AUC: ), which is exactly the two most potential features observed from the Table 4.

The robustness of NCM is investigated in three different speech domains, namely children read speech, children conversational speech and adult read speech (Table 2). The results are reported as in Table 5. It can be observed that in general the performance of NCMs is slightly better on eval_child_convers set than test set, yet make a EER/AUC degradation about on eval_child_read set. A large performance gain is attained when being applied on adult speech, which is unexpected. As can be seen, NCM tends to perform better on data sets that have higher sentence error rates (SER). This is probably due to that a high SER (e.g., larger than ) leads to imbalance distribution of erroneous transcription and correct transcription. This imbalance causes low EER since it becomes easier to separate the two classes.

Fusion of predictor features
-
-
-
EER AUC EER AUC EER AUC EER AUC EER AUC
test 0.1846 0.8980 0.1841 0.8939 0.1811 0.9003 0.1830 0.8949
eval_child_read 0.2083 0.8777 0.2097 0.8753 0.2034 0.8797 0.2091 0.8736
eval_child_convers 0.1788 0.9021 0.1755 0.9005 0.1788 0.9039 0.1765 0.9047
test_adult_1w 0.1641 0.9193 0.1607 0.9174 0.1712 0.9152 0.1748 0.9106
Table 5: The performance on NCM obtained by fusing predictor features
Evaluation set
decoded by
decoded by
test EER 0.1946
AUC 0.8901
eval_child_read EER 0.2272
AUC 0.8602
eval_child_convers EER 0.1867
AUC 0.8862
test_adult_1w EER 0.2153
AUC 0.8654
Table 6: Performance comparison of NCM on two different ASR domains. The NCM matches with .

4.5 Confidence measure on mismatched ASR

Different ASR systems may exhibit different decoding behaviours. A confidence measure module is designed typically for a specific ASR system. The transferability of our proposed NCM module, i.e., how well it can be used with another ASR system, is investigated in this section. The system is taken as a mismatched ASR system against . The performance of the NCM with feature is shown as in Table 6. A clear EER improvement from to is observed on the test set. And the EER performance degraded from to on the test_adult_1w set. Decoding children speech utterances by would produce more erroneous transcriptions (higher SER), resulting in more imbalanced class distribution. In this case, EER/AUC seems not to be appropriate for performance comparison.

Motivated by the work in [15], we evaluate the performance of NCM by plotting the CER on filtered utterances with respect to the confidence threshold. Utterances with confidence score higher than a specific threshold are selected to form a set of filtered utterances. Since a good NCM should exhibit a strong correlation with the CER, i.e., a higher threshold will result in a set of utterances with lower CER, a monotonically decreasing relation is expected. As shown in Figure 4, most curves show a trend of monotonic descending. Nevertheless, two prominent spikes are noted in the region of high confidence for the children speech in the test and eval_child_convers data sets decoded by . The spikes reveal the over-confidence behaviour related to transferability. We suspect this transferability issue is highly related to decoding the children conversational speech since it is the common speech data type in both test and eval_child_convers sets.

Figure 4: CERs of filtered utterances w.r.t confidence threshold for different evaluation sets that decoded by (denoted by suffix C with circle marker) and (denoted by suffix A with triangle marker), respectively.

5 Conclusions

This research is focused on investigating the efficacy of NCMs which are derived from different predictor features in E2E ASR systems. It is found that properly balanced weights on the CTC score, attention score and the language model scores play a critical role in the reliability of confidence measure. Incorporating the n-best hypothesis scores can lead to further improvement. In addition, the average token entropy with adaptive softmax temperature is demonstrated to be effective. The fusion of these features can achieve better performance. Experimental results also suggest that the EER/AUC metrics are not sufficient to evaluate the NCM performance on a mismatched ASR with large SER difference.

References

  • [1] H. Y. Chan and P. Woodland (2004) Improving broadcast news transcription by lightly supervised discriminative training. In Proc. of ICASSP, Vol. 1, pp. I–737. Cited by: §1.
  • [2] W. Chan, N. Jaitly, Q. Le, and O. Vinyals (2016) Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In Proc. of ICASSP, pp. 4960–4964. Cited by: §1.
  • [3] B. Chigier (1992) Rejection and keyword spotting algorithms for a directory assistance city name recognition application. In Proc. of ICASSP, Vol. 2, pp. 93–96. Cited by: §1.
  • [4] L. Dong, S. Xu, and B. Xu (2018) Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. In Proc. of ICASSP, pp. 5884–5888. Cited by: §1.
  • [5] G. Evermann and P. C. Woodland (2000) Large vocabulary decoding and confidence estimation using word posterior probabilities. In Proc. of ICASSP, Vol. 3, pp. 1655–1658. Cited by: §1.
  • [6] T. J. Hazen, S. Seneff, and J. Polifroni (2002) Recognition confidence scoring and its use in speech understanding systems. Computer Speech & Language 16 (1), pp. 49–67. Cited by: §1.
  • [7] D. Hendrycks and K. Gimpel (2016) A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136. Cited by: §1.
  • [8] J. Huang, R. Child, V. Rao, H. Liu, S. Satheesh, and A. Coates (2016) Active learning for speech recognition: the power of gradients. arXiv preprint arXiv:1612.03226. Cited by: §1.
  • [9] H. Jiang (2005) Confidence measures for speech recognition: a survey. Speech communication 45 (4), pp. 455–470. Cited by: §1.
  • [10] K. Kalgaonkar, C. Liu, Y. Gong, and K. Yao (2015) Estimating confidence scores on asr results using recurrent neural networks. In Proc. of ICASSP, pp. 4999–5003. Cited by: §1, §2.2.
  • [11] A. Kastanos, A. Ragni, and M. J. Gales (2020) Confidence estimation for black box automatic speech recognition systems using lattice recurrent neural networks. In Proc. of ICASSP, pp. 6329–6333. Cited by: §1, §2.2.
  • [12] S. Kim, T. Hori, and S. Watanabe (2017) Joint ctc-attention based end-to-end speech recognition using multi-task learning. In Proc. of ICASSP, pp. 4835–4839. Cited by: §1, §2.1.
  • [13] A. Kumar, S. Singh, D. Gowda, A. Garg, S. Singh, and C. Kim (2020) Utterance confidence measure for end-to-end speech recognition with applications to distributed speech recognition scenarios. In Proc. Interspeech, Cited by: §1, §1, §1, §2.2, §2.2.
  • [14] Q. Li, P. Ness, A. Ragni, and M. J. Gales (2019) Bi-directional lattice recurrent neural networks for confidence estimation. In Proc. of ICASSP, pp. 6755–6759. Cited by: §1.
  • [15] Q. Li, D. Qiu, Y. Zhang, B. Li, Y. He, P. C. Woodland, L. Cao, and T. Strohman (2020) Confidence estimation for attention-based sequence-to-sequence models for speech recognition. arXiv preprint arXiv:2010.11428. Cited by: §1, §2.2, §4.5.
  • [16] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar (2017-10) Focal loss for dense object detection. In

    Proceedings of the IEEE International Conference on Computer Vision (ICCV)

    ,
    Cited by: §3.4.
  • [17] C. V. Neti, S. Roukos, and E. Eide (1997) Word-based confidence measures as a guide for stack search in speech recognition. In Proc. of ICASSP, Vol. 2, pp. 883–886. Cited by: §1.
  • [18] S. Ng, W. Liu, Z. Peng, S. Feng, H. Huang, O. Scharenborg, and T. Lee (2020) The cuhk-tudelft system for the slt 2021 children speech recognition challenge. arXiv preprint arXiv:2011.06239. Cited by: §3.1, §3.2.
  • [19] D. S. Park, Y. Zhang, Y. Jia, W. Han, C. Chiu, B. Li, Y. Wu, and Q. V. Le (2020) Improved noisy student training for automatic speech recognition. arXiv preprint arXiv:2005.09629. Cited by: §1.
  • [20] B. Rueber (1997) Obtaining confidence measures from sentence probabilities. In Fifth European Conference on Speech Communication and Technology, Cited by: §1.
  • [21] A. Sanchis, A. Juan, and E. Vidal (2003)

    Estimating confidence measures for speech recognition verification using a~ smoothed naive bayes model

    .
    In

    Iberian Conference on Pattern Recognition and Image Analysis

    ,
    pp. 910–918. Cited by: §1.
  • [22] P. G. Shivakumar and P. Georgiou (2020) Transfer learning from adult to children for speech recognition: evaluation, analysis and recommendations. Computer speech & language 63, pp. 101077. Cited by: §1.
  • [23] P. G. Shivakumar and S. Narayanan (2021) End-to-end neural systems for automatic children speech recognition: an empirical study. arXiv preprint arXiv:2102.09918. Cited by: §1.
  • [24] R. A. Sukkar and C. Lee (1996) Vocabulary independent discriminative utterance verification for nonkeyword rejection in subword based speech recognition. IEEE Transactions on Speech and Audio Processing 4 (6), pp. 420–429. Cited by: §1.
  • [25] G. Tur, D. Hakkani-Tür, and R. E. Schapire (2005)

    Combining active and semi-supervised learning for spoken language understanding

    .
    Speech Communication 45 (2), pp. 171–186. Cited by: §1.
  • [26] L. Uebel and P. Woodland (2001) Speaker adaptation using lattice-based mllr. In ISCA Tutorial and Research Workshop (ITRW) on Adaptation Methods for Speech Recognition, Cited by: §1.
  • [27] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. arXiv preprint arXiv:1706.03762. Cited by: §3.2.
  • [28] S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi (2017) Hybrid ctc/attention architecture for end-to-end speech recognition. IEEE Journal of Selected Topics in Signal Processing 11 (8), pp. 1240–1253. Cited by: §2.2.
  • [29] M. Weintraub, F. Beaufays, Z. Rivlin, Y. Konig, and A. Stolcke (1997) Neural-network based measures of confidence for word recognition. In Proc. of ICASSP, Vol. 2, pp. 887–890. Cited by: §1.
  • [30] A. Woodward, C. Bonnın, I. Masuda, D. Varas, E. Bou-Balust, and J. C. Riveiro (2020) Confidence measures in encoder-decoder models for speech recognition. Proc. Interspeech, Shanghai. Cited by: §1, §2.2, §2.2, §4.3.
  • [31] F. Yu, Z. Yao, X. Wang, K. An, L. Xie, Z. Ou, B. Liu, X. Li, and G. Miao (2021) The slt 2021 children speech recognition challenge: open datasets, rules and baselines. In Spoken Language Technology Workshop (SLT), pp. 1117–1123. Cited by: §3.1.