Speech-enabled interactive systems, e.g., voice assistant, intelligent recorder, smart loudspeaker, etc., have been integrated widely into our daily life. Automatic Speech Recognition (ASR) is the most important technology underlying these systems. State-of-the-art ASR systems have been steadily improving in terms of recognition accuracy and processing latency. The system design has evolved from Hidden Markov Models (HMM) and hybrid model of Deep Neural Network and HMM (DNN-HMM) to attention-based end-to-end (E2E) neural model, which requires a substantially larger amount of labelled training data. Despite the continual efforts, the performance of an ASR system would be degraded inevitably in adverse and unstable acoustic conditions, e.g., high-intensity noise, atypical pronunciation and/or speaking style, and inadequately represented speaker groups. The present research is focused on children’s speech.
Apart from the word error rate, confidence measure (CM) is a performance index of particular importance for ASR systems deployed in real-world scenarios. The value of confidence measure for an input speech utterance indicates to what level the user can trust the result of ASR. CM is useful in many downstream tasks of ASR. For example, it can be used to determine whether an input utterance (untranscribed) should be utilized for speaker adaptation . CM is also useful in semi-supervised training 8], spoken dialogue system [6, 25], and intelligent audio stream allocation in distributed ASR systems .
Numerous approaches were investigated toward word-level CM in the context of HMM-based ASR 
. Most commonly the CM is predicted with a binary classifier from features extracted during ASR decoding. The predictor features can be obtained from the ASR output lattice, for examples, word posterior probability, word trellis stability , normalized acoustic likelihood and language model score. For the classifier models, linear discriminant function  , Gaussian mixture classifier 17] and neural networks [29, 10, 14, 11] have been most commonly adopted.
. Softmax probabilities in the auto-regressive decoder are commonly regarded as an intuitive measure of confidence on the mapping. However, the softmax probability was found to be unreliable and might perform poorly due to the overconfident behaviour of E2E models [7, 15]. To alleviate the problem of unreliability, a neural network can be trained independently to predict a softmax temperature value to re-distribute the original output probabilities at each time step of decoding . In , a lightweight neural network was used to estimate neural confidence measure (NCM), which was shown to be more reliable than directly using the softmax probability. In , an NCM module was developed to predict utterance-level confidence measures in the context of small-footprint E2E ASR. With predictor features extracted from the encoder, the decoder and the attention blocks in the E2E system, the NCM significantly outperformed conventional word density confidence measure (WDCM)  and beam-scatter weighted WDCM.
It is noted that the predictor features play a critical role in the design of robust NCM modules. In the present study we are focused to investigate predictor features that are discriminative to confidence measure and able to generalize well to other domains. Here being discriminative means that the predictor features are effective in differentiating erroneous ASR outputs from correct ones. The feature of “beam scores” as investigated in  is extended. The efficacies of acoustic related and linguistic related score components are examined separately.
To our knowledge, this study is the first to explore utterance-level NCM in E2E ASR for children speech. Compared with adult speech, children speech is less studied and recently raises much research interests [22, 23]. ASR systems for children speech are more likely to generate the erroneous output, making confidence measure a more relevant issue than for adult speech. We also investigate the robustness of NCM to varying input speech conditions and the transferrability to another out-of-domain E2E adult ASR. In this paper, E2E ASR refers specifically to the Transformer based joint CTC-attention speech recognition system.
2 Neural Confidence Measure Module
As shown in Figure 1, the NCM module is built on top of a properly trained E2E ASR system. For each input utterance decoded by the ASR system, the NCM module generates a confidence score based on a set of predictor features acquired from the ASR decoding process. The confidence score is compared against a threshold to determine whether the decoded text should be accepted or rejected.
2.1 End-to-end speech recognition system
In this study, the E2E ASR system is based on an encoder-decoder model with the joint CTC-attention learning framework. As shown in Figure 2, the system consists of three components: a shared encoder, an attention decoder and a connectionist temporal classification (CTC) loss layer. An input sequence of acoustic features, denoted as
, is encoded by the shared encoder into a hidden vector sequence. The hidden sequence is then processed in parallel by the attention decoder and the CTC loss layer to generate a sequence of output tokens . The three components are jointly optimized in training.
To take full advantages of both CTC and attention mechanisms, a multi-task learning (MTL) based loss function is used,
where denotes the CTC loss and denotes the attention loss. The value of is between and . The CTC objective function acts as an auxiliary task to help speed up the alignment process at both training and decoding stages. The attention decoder relieves the limitation of conditional independence assumed in CTC.
2.2 Predictor features
, recurrent neural network (RNN)[10, 11] and self-attention Transformer  could be applied to realize the NCM module. In this study, a residual FFN with three hidden layers is adopted as the classification model. Our main focus is on selecting better predictor features.
Design of predictor features depends highly on the ASR decoding process. In our E2E ASR system, the beam search algorithm is adopted to perform one-pass decoding 
, in which the decoder computes a score for each partial hypothesis. In practice, external language models, e.g., n-gram and RNN language model (rnnlm), are employed by shallow fusion in the decoding. The score given to a recognition hypothesis would be the weighted combination of four components, namely the CTC score, the attention score , the n-gram score and the rnnlm score . They can be explained as the log probability to form the decoded token sequence from different information sources. The one with a higher score is more likely to be selected. Here, the term “token” refers to the decoded character at each time step. Each hypothesis is made up of a sequence of tokens. To speed up the search, only a limited number of partial hypotheses are retained at each time step according to the setting of . An -best list is generated at the end of decoding. The list includes complete hypotheses with the highest scores. The best hypothesis is the final output of ASR, i.e.,
where denotes a set of complete hypotheses, are the component weights in the range of , satisfying and . Specifically, and refer to the negative CTC loss and attention loss respectively.
Intuitively, all of the four component scores are potentially contributive to estimating the confidence score. Note that the normalized score will be used to avoid the influence of various utterance length. Generally speech recognition models tend to assign extreme score value for the best hypothesis, which may adversely influence the classification judgment. The top- best hypotheses’ likelihoods could be utilized to produce a more robust confidence score. In the best hypothesis, apart from a list of utterance-level scores, more information from score distribution over the whole vocabulary can be utilized at each decoder time step. Due to the very large ASR vocabulary size and limitation in the beam search decoding, only top-13, 30], internal neural features of E2E ASR were studied. Embedding features extracted at the last layers of the encoder and the decoder were shown useful for confidence score estimation, given that they contain rich acoustic and linguistic information respectively. Token duration feature was also suggested based on the observation that tokens with short duration are prone to recognition errors. Table 1 lists the above predictor features, which are investigated in the following experiments.
|Predictor feature||Feature form||Notation|
|average token duration|
|average token entropy|
|- best scores||vector|
|- score logits|
3 Experimental Setup
3.1 Data sets
The dataset used in the experiments on NCM are from the 2021 SLT Children Speech Recognition Challenge (CSRC) . The CSRC provided both adult and children speech data, each part being divided into training, validation (denoted as dev) and test (denoted as test) . Additional evaluation sets released by the CSRC, namely the children read speech (eval_child_read) and children conversational speech (eval_child_convers) are also used in this study. The NCM module is trained on the dev part of children speech in the CSRC dataset. Other parts of children speech, i.e., test, eval_child_read and eval_child_convers, as well as the adult test speech, i.e., test_adult_1w, are used for NCM evaluation. test_adult_1w is part of the original adult test set, which contains utterances. Table 2 gives a summary of the above data sets. For each of them, the decoding accuracy produced by the system is given in terms of character error rate (CER) and sentence error rate (SER). is an E2E ASR system trained for children speech as described in the next section.
3.2 End-to-end ASR system for children speech
Utterance-level NCM is evaluated with an E2E ASR system, which is trained to generate Chinese characters from Mandarin speech . Input features of the ASR system comprise -dimension filter-bank features and pitch features. Both the shared encoder and the attention decoder adopt the self-attention Transformer structure . There are two different versions of E2E models involved in the following experiments. The system is trained only on the adult read speech, i.e., the training set of the CSRC adult speech. The system is then fine-tuned with children speech in both read and conversational speaking styles, i.e., the training set of the CSRC children speech.
3.3 Evaluation metric
The prediction of NCM is performed as a binary classification task. Thus the Equal Error Rate (EER) and the area under the ROC curve (AUC) are adopted for performance evaluation. These metrics have been commonly used in previous studies on confidence measures. EER refers to the error rate achieved with the operating threshold at which the false acceptance and false rejection rates are equal. AUC measures the average classification performance over the full range of operating threshold. Perfect performance is attained when the EER is equal to 0 or the AUC is equal to 1.
3.4 Loss function for NCM module training
The training labels for NCM module training are determined as follows. If the ASR decoding result on a test utterance perfectly matches with its ground-truth transcription, the label for the NCM classifier is set to 1; otherwise it is set to 0. For the binary classification problem, the binary cross-entropy (BCE) loss is regarded as the default choice. Our preliminary experiment showed that, however, to obtain a stable operating threshold (for determining the EER), which is approximately equal to across different datasets, the weighted focal loss  could be a better choice since it not only handles the class imbalance issue, e.g., SER of on dev means that of the utterances have the label of , but also pays more attention to hard samples. In the present study, the weighted focal loss is adopted as the loss function for NCM module training.
4 Results and Analysis
4.1 Utterance-level scores
The results of utterance-level NCM prediction with scores from the top-1 hypothesis are shown as in Table 3. The performance with the four component scores described in Section 2.2 are compared. The best performance was achieved by using the attention score, with the EER of and AUC of . The two language models’ scores are clearly inferior to the CTC and attention scores. This suggests that acoustic information is more pertinent to estimating the confidence score, and linguistic information alone is not sufficient. As shown in Table 3, the weighted sum score which takes direct effect in ASR decoding performs not well (EER: ; AUC: ) in comparison with and .
4.2 NCM based on n-best scores
The experimental results with features are shown by the plots of EER in Figure 3. By incorporating multiple hypotheses ( to ) into the predictor features, the NCM tends to perform better. Using the best weighted score achieves the best EER of , which significantly surpasses the best weighted score by . It suggests that feature can capture more discriminative pattern to help classification. Particularly, the weighted score outperforms the attention score when it comes to cases. The is considered the most preferred feature since it can achieve similar value of EER ( vs. ) to the one and higher AUC ( vs. ).
|- score logits||0.3730||0.6810|
|- score logits||0.3097||0.7508|
|- score logits||0.2777||0.7941|
4.3 Other predictor features and variants
It is worth noting that the component weights can be adjusted automatically. They are treated as learnable parameters jointly optimized in the model training (denoted as subscript), either in fixed () or adaptive way (, the prefix denotes the weight is adaptively adjusted based on the component score), as illustrated in the first block of Table 4. A large performance gain (EER: ) can be attained in the case of , suggesting that balancing the component scores helps a lot. The lowest EER and the highest AUC are achieved with the 10best feature.
In the middle part of Table 4, the effect of embedding features and are compared. The decoder embedding slightly outperforms the encoder one. As a matter of fact, the attention decoder has already attended to the encoder embedding. Yet both of the embeddings are prone to overfit. The is represented by the length ratio () of and . It does not provide any discriminative information for confidence measure.
The lowest part of Table 4 shows the performance of a few feature variants related to the -
score logits at each time step of decoding for the best hypothesis. The entropy measures the confidence of assigning the maximum index of a softmax probability distribution as the decoding output. That is, the recognition result would be highly uncertain if a distribution is close to a uniform distribution. The average entropy () of token distribution is expected to reflect the confidence of an output hypothesis. We experiment with . It is noted that the entropy with smaller value of , i.e., less - score logits, consistently outperforms that with larger . Here, is exactly equal to the setting of . Furthermore, the score logits can be multiplied with a constant temperature value (denoted as ), which is determined by jointly optimizing the NCM module, in order to sharpen or smooth its distribution. Inspired by technique used in , we can train an additional neural network to predict a dynamic softmax temperature that takes different values at different decoder time steps (denoted as ). Input to this predictive network is the decoder embedding , which contains both acoustic and linguistic information related to the corresponding time step. It can be observed that always performs better than across different settings of , and both two temperature scaling approaches consistently outperform the vanilla entropy feature. Within the - score logits, the achieves a comparable performance (EER: vs. ; AUC: vs. ) with the , though the entropy-based feature is a scalar.
4.4 Fusion of predictor features and robustness test
As shown in the top of Table 5, fusion of different predictor features is evaluated on the test set of children speech. Generally speaking, fusion of multiple features is beneficial and gives better performance than using them individually. The fusion of and - features achieve the best NCM quality (EER: ; AUC: ), which is exactly the two most potential features observed from the Table 4.
The robustness of NCM is investigated in three different speech domains, namely children read speech, children conversational speech and adult read speech (Table 2). The results are reported as in Table 5. It can be observed that in general the performance of NCMs is slightly better on eval_child_convers set than test set, yet make a EER/AUC degradation about on eval_child_read set. A large performance gain is attained when being applied on adult speech, which is unexpected. As can be seen, NCM tends to perform better on data sets that have higher sentence error rates (SER). This is probably due to that a high SER (e.g., larger than ) leads to imbalance distribution of erroneous transcription and correct transcription. This imbalance causes low EER since it becomes easier to separate the two classes.
|Fusion of predictor features||
4.5 Confidence measure on mismatched ASR
Different ASR systems may exhibit different decoding behaviours. A confidence measure module is designed typically for a specific ASR system. The transferability of our proposed NCM module, i.e., how well it can be used with another ASR system, is investigated in this section. The system is taken as a mismatched ASR system against . The performance of the NCM with feature is shown as in Table 6. A clear EER improvement from to is observed on the test set. And the EER performance degraded from to on the test_adult_1w set. Decoding children speech utterances by would produce more erroneous transcriptions (higher SER), resulting in more imbalanced class distribution. In this case, EER/AUC seems not to be appropriate for performance comparison.
Motivated by the work in , we evaluate the performance of NCM by plotting the CER on filtered utterances with respect to the confidence threshold. Utterances with confidence score higher than a specific threshold are selected to form a set of filtered utterances. Since a good NCM should exhibit a strong correlation with the CER, i.e., a higher threshold will result in a set of utterances with lower CER, a monotonically decreasing relation is expected. As shown in Figure 4, most curves show a trend of monotonic descending. Nevertheless, two prominent spikes are noted in the region of high confidence for the children speech in the test and eval_child_convers data sets decoded by . The spikes reveal the over-confidence behaviour related to transferability. We suspect this transferability issue is highly related to decoding the children conversational speech since it is the common speech data type in both test and eval_child_convers sets.
This research is focused on investigating the efficacy of NCMs which are derived from different predictor features in E2E ASR systems. It is found that properly balanced weights on the CTC score, attention score and the language model scores play a critical role in the reliability of confidence measure. Incorporating the n-best hypothesis scores can lead to further improvement. In addition, the average token entropy with adaptive softmax temperature is demonstrated to be effective. The fusion of these features can achieve better performance. Experimental results also suggest that the EER/AUC metrics are not sufficient to evaluate the NCM performance on a mismatched ASR with large SER difference.
-  (2004) Improving broadcast news transcription by lightly supervised discriminative training. In Proc. of ICASSP, Vol. 1, pp. I–737. Cited by: §1.
-  (2016) Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In Proc. of ICASSP, pp. 4960–4964. Cited by: §1.
-  (1992) Rejection and keyword spotting algorithms for a directory assistance city name recognition application. In Proc. of ICASSP, Vol. 2, pp. 93–96. Cited by: §1.
-  (2018) Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. In Proc. of ICASSP, pp. 5884–5888. Cited by: §1.
-  (2000) Large vocabulary decoding and confidence estimation using word posterior probabilities. In Proc. of ICASSP, Vol. 3, pp. 1655–1658. Cited by: §1.
-  (2002) Recognition confidence scoring and its use in speech understanding systems. Computer Speech & Language 16 (1), pp. 49–67. Cited by: §1.
-  (2016) A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136. Cited by: §1.
-  (2016) Active learning for speech recognition: the power of gradients. arXiv preprint arXiv:1612.03226. Cited by: §1.
-  (2005) Confidence measures for speech recognition: a survey. Speech communication 45 (4), pp. 455–470. Cited by: §1.
-  (2015) Estimating confidence scores on asr results using recurrent neural networks. In Proc. of ICASSP, pp. 4999–5003. Cited by: §1, §2.2.
-  (2020) Confidence estimation for black box automatic speech recognition systems using lattice recurrent neural networks. In Proc. of ICASSP, pp. 6329–6333. Cited by: §1, §2.2.
-  (2017) Joint ctc-attention based end-to-end speech recognition using multi-task learning. In Proc. of ICASSP, pp. 4835–4839. Cited by: §1, §2.1.
-  (2020) Utterance confidence measure for end-to-end speech recognition with applications to distributed speech recognition scenarios. In Proc. Interspeech, Cited by: §1, §1, §1, §2.2, §2.2.
-  (2019) Bi-directional lattice recurrent neural networks for confidence estimation. In Proc. of ICASSP, pp. 6755–6759. Cited by: §1.
-  (2020) Confidence estimation for attention-based sequence-to-sequence models for speech recognition. arXiv preprint arXiv:2010.11428. Cited by: §1, §2.2, §4.5.
Focal loss for dense object detection.
Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: §3.4.
-  (1997) Word-based confidence measures as a guide for stack search in speech recognition. In Proc. of ICASSP, Vol. 2, pp. 883–886. Cited by: §1.
-  (2020) The cuhk-tudelft system for the slt 2021 children speech recognition challenge. arXiv preprint arXiv:2011.06239. Cited by: §3.1, §3.2.
-  (2020) Improved noisy student training for automatic speech recognition. arXiv preprint arXiv:2005.09629. Cited by: §1.
-  (1997) Obtaining confidence measures from sentence probabilities. In Fifth European Conference on Speech Communication and Technology, Cited by: §1.
Estimating confidence measures for speech recognition verification using a~ smoothed naive bayes model. In
Iberian Conference on Pattern Recognition and Image Analysis, pp. 910–918. Cited by: §1.
-  (2020) Transfer learning from adult to children for speech recognition: evaluation, analysis and recommendations. Computer speech & language 63, pp. 101077. Cited by: §1.
-  (2021) End-to-end neural systems for automatic children speech recognition: an empirical study. arXiv preprint arXiv:2102.09918. Cited by: §1.
-  (1996) Vocabulary independent discriminative utterance verification for nonkeyword rejection in subword based speech recognition. IEEE Transactions on Speech and Audio Processing 4 (6), pp. 420–429. Cited by: §1.
Combining active and semi-supervised learning for spoken language understanding. Speech Communication 45 (2), pp. 171–186. Cited by: §1.
-  (2001) Speaker adaptation using lattice-based mllr. In ISCA Tutorial and Research Workshop (ITRW) on Adaptation Methods for Speech Recognition, Cited by: §1.
-  (2017) Attention is all you need. arXiv preprint arXiv:1706.03762. Cited by: §3.2.
-  (2017) Hybrid ctc/attention architecture for end-to-end speech recognition. IEEE Journal of Selected Topics in Signal Processing 11 (8), pp. 1240–1253. Cited by: §2.2.
-  (1997) Neural-network based measures of confidence for word recognition. In Proc. of ICASSP, Vol. 2, pp. 887–890. Cited by: §1.
-  (2020) Confidence measures in encoder-decoder models for speech recognition. Proc. Interspeech, Shanghai. Cited by: §1, §2.2, §2.2, §4.3.
-  (2021) The slt 2021 children speech recognition challenge: open datasets, rules and baselines. In Spoken Language Technology Workshop (SLT), pp. 1117–1123. Cited by: §3.1.