Automatic Speech Recognition (ASR) has made remarkable improvement since the advances of deep learning and powerful computational resources . However, current ASR systems are still not perfect because of the constraint of objective physical conditions, such as the variability of different microphones or background noises. Thus, quality evaluation (QE) is a practical desideratum for developing and deploying speech language technology, enabling users and researchers to judge the overall performance of the output, detect bad cases, refine algorithms, and choose among competing systems in a specific target environment. This paper focuses on such a situation where the golden references are not available for estimating a reliable word error rate (WER).
In the research direction of ASR QE without transcripts, a two-stage framework including feature extraction and WER prediction, has been a long-standing criteria. Classical pioneering works mainly rely on hand-crafted features
and utilize them to build a linear regression based algorithms, includes aggregation method with extremely randomized trees, SVM based TranscRater , and e-WER . In this work, instead of heavily using manual labors, we propose to derive the feature representations from a pre-trained conditional bidirectional language model – speech-BERT, which aims to predict the relationships between the raw fbank features and utterances by analyzing them holistically. The training data required for speech-BERT is exactly the same as the one for conventional ASR, without any additional human annotations. Subsequently, during the WER prediction stage, we analyze the empirical distribution of WER for most ASR systems (in Fig. 1
), and find that WER values are prone to distribute near 0 and the non-zero values approximately follow a Beta or Gaussian distribution between 0 and 1. Therefore, during the fine-tuning procedure, we introduce a neural zero-inflated regression layer on top of the speech-BERT, fitting the target distribution more appropriately.
In summary, this paper makes the following main contributions. i) We propose a bidirectional language model conditional on speech features, which aims to improve the feature representations for ASR downstream tasks. A bonus experiment shows that tying the parameter of speech-BERT and Speech-Transformer can accelerate the convergence during training. ii) We introduce a neural zero-inflated Beta regression layer, particularly fitting the empirical distribution of WER. For the gradient back-propagation of neural Beta regression layer, we design an efficient pre-computation method. iii) Our experimental results demonstrate our ASR quality estimation model can achieve the state of the art performance with fair comparison.
2 Related Works
has been extensively explored in natural language processing[8, 9], and become popular in speech [10, 11]. The motivation of our speech-BERT comes from the success of BERT 
which has demonstrated the importance of bidirectional pre-training for language representations and reduced the need for many heavily-engineered task specific architectures. We will also adopt the loss function of masked language modelas our training criteria, where represents all tokens/utterances in one sentence. However, the major difference is that speech-BERT is conceptually a conditional language model, in order to capture the subtle correlations between speech features and utterances as well as the syntactic information.
In order to build the conditional masked language model , where is the speech features corresponding to , we have to discard the single transformer encoder architecture of BERT since it is difficult to consume two sequences of different modalities 111It is doable if using XLM .. Instead, we modify the speech-Transformer  by changing its auto-regressive decoder to a paralleled memory encoder, resulting in an encoder-memory encoder architecture, where the speech and text domains can be separately controlled by two different encoders. The memory means the outputs of the speech encoder by consuming the spectrogram inputs.
When the feature representations are ready, the quality estimation task is typically reduced to either regression or classification problem, given the type of predicated values. In machine translation (MT) quality estimation , a similar real-valued metric translation error rate (TER) between 0 and 1 is the target of the model, and will be predicted at sentence level. The transformer based predictor-estimator framework 
established a state-of-the-art record in WMT 2018 QE competition, which restricts the output within the expected interval by applying a sigmoid function before the regression. However, a standard regression model is probably not suitable to fit the WER of ASR systems. Due to the subjectivity and non-uniqueness of the translation task, it is relatively easy to produce a significant gap between the machine and human translations. As aforementioned, we observe that the distribution of WER is empirically more zero-concentrated than TER, making a straightforward linear regression easily biased.
, which is capable of simulating a mixed continuous-discrete distributions. For the random variablefollowing such distribution, one typical representation is given by a weighted mixture of two distributions.
where is a finite set, and is an indicator function whose value equals to 1 if equals to .
In our case that denotes WER, then we have and . Particularly, we recommend to assume as a Beta distribution for ASE-QE. Therefore, simply represents the probability of the event that
takes the value 0 or not, resulting a mixture of Bernoulli and Beta distribution. Additionally, we use one classification neural network to simulate the Bernoulli variable and a regression neural network to simulate the continuous variable, thus resulting a differentiable deep neural architecture that can be fine-tuned together with the parameters of speech-BERT. In this way, we can divide the ASR-QE modeling into a hierarchical multi-task learning, where the first step is to decide whether the ASR output is perfect or not, and the second step is only to regress the WER value for the imperfect one.
The backbone structure of speech-BERT originates from speech-Transformer  by adapting the transformer decoder to a memory encoder (in Fig. 2). To achieve this goal, we need two simple modifications. First, we randomly change 15% utterances in the transcription at each training step. We introduce a new special token “[mask]” analogous to standard BERT and substitute it for the tokens required masking. Notice that in practice 15% of utterances that required prediction during pre-training includes 12% masked, 1.5% substituted and 1.5% unchanged.
Secondly, we also remove the future mask matrix in the self-attention of the decoder, which can be concisely written as a unified formulation.
where the indicator equals to 1 if the model architecture is speech-Transformer. The are the output keys, queries, and values from the previous layer of the decoder or memory encoder. is a triangular matrix where if . In the case of the decoder, the self-attention represents the information of all positions in the decoder up to and including that position. In the case of the memory encoder, it represents the information of all positions excepted the masked positions. Other details are similar to the standard transformer referring to . The advantage of using the unified formulation is that it allows us to straightforwardly implement a multi-task learning task in a weights-tying architecture via altering the mask matrix, resulting in the following loss.
where model parameter is shared cross speech-BERT and speech transformer. The extra ASR loss also differentiates our model to the standard BERT whose additional loss is designed for next sentence prediction task. In the experiments of multi-task learning, we set to keep two different losses at a consistent scale.
3.2 Neural Zero-Inflated Regression Layer
The speech-BERT is able to unambiguously output a sequence of feature representations corresponding to every single utterance in the transcription. Theoretically, we can use a single feature representation of arbitrarily selected token for many downstream tasks like “[CLS]” in standard BERT, since the self-attention mechanism has successfully integrated all syntactic information into every feature, but in different ways. Intuitively, it is reasonable to use another feature fusion layer to encode the sequence of features together. Thus, we use one Bi-LSTM  layer to re-encode the features and output a single final encode state as the feature for the quality estimation task.
Referring to Eq. (1.
For the subsequent regression model, it becomes not necessary to predict the case of zero WER due to the existence of above classifier. This fact naturally advocates the choice of Beta regression because the Beta distribution has no definition on zero. For statistical distributions, the most importance statistics are usually the first two moments, i.e., mean and variance. In our proposal, we mainly model the meanwhich is the actual target in our final prediction, and derive the variance of Beta distribution.
where is a hyper-parameter that can be interpreted as a precision parameter, which can be estimated from the training data. The parameterized density distribution function is expressed as follows.
This is a hierarchical loss for two consecutive sub-problems but can be simultaneously optimized, where the first term requires the whole fine-tuning dataset, while the second term is only fed with the data of inaccurate transcriptions. With this loss function, we use the expected prediction during inference, i.e., .
3.3 Gradient Pre-Computation
A crucial issue for Beta regression layer is the involved gradient computation of is not straightforward, since the direct auto-differentiation with respect to the training objective is obliged to calculate the gradient of a compound Gamma function , where simply denotes the computational graph or function with the input and the output .
Instead, we utilize a gradient pre-computation trick such that the back-propagation becomes less cumbersome, by introducing an equivalent objective as to log-likelihood .
where is digamma function 222Most deep learning packages have the built-in function, e.g., tf.digamma in TensorFlow.
in TensorFlow.. The equivalence is essentially in the sense of gradient computation, in other words, the stochastic gradient optimization will still remain the same, because we can derive the following identical relation with some algebra calculations.
In the new objective, we have successfully circumvented the direct gradient back-propagation with respect to , since the complicated term merely involves forward digamma function computation with a stop gradient operation, while the term can readily and efficiently contribute to the back-propagation because it just consists of the common operations in deep neural networks.
In order to validate the effectiveness of our approach, the quality estimation model of ASR was evaluated by two popular measures, Pearson correlation (larger is better) and mean absolute error (MAE, smaller is better).
4.1 Experimental Settings
We conduct our experiments on two types of data. One is a large Mandarin speech recognition corpus containing 20,000 hours training data with about 20 million sentences which is used for speech-BERT pre-training. We evaluate the performance of pre-training via the prediction accuracy on masked tokens. The other is a small size speech recognition quality estimation data including 240 hours, which never appears in the pre-training dataset. The speech recognition system that we want to evaluate the quality is an in-house ASR engine based on Kaldi333https://github.com/kaldi-asr/kaldi. The WER computed from the ASR results and the ground truth transcripts is the target we will predict by our model. Correspondingly, we have two test sets of the quality estimation model for in-domain and out-of-domain, where both include 3000 sentences. The acoustic feature used for our implemented model are 80-dimensional log-mel filter-bank (fbank) energies computed on 25 ms window with 10 ms shift. We stack the consecutive frames within a context window of 4 to produce the 320-dimensional features in purpose of computational efficiency for the speech encoder. The speech-BERT is trained on 8 Tesla V-100 GPUs for about 10 days until convergence. The quality estimation model is fine-tuned on 4 Tesla V-100 GPUs for several hours.
4.2 Pre-Training Results
|Masked Token Predict Acc||95.39%||NA||94.81%|
We train the speech-BERT model with three different loss functions and summarize the results in Table 1. Basically, we observe that the jointly trained model can achieve comparable performance to the two separately training tasks. We also visualize the attention between speech encoder and the text decoder or memory encoder in Fig. 3
. The attention weights are averaged over 8 heads, and the overall patterns between joint training and separate training are relatively similar. We prefer to adopt the simultaneously pre-trained model as our downstream quality estimation task, since we hypothesize that the more supervisions in multi-task learning may incorporate more syntactic information in the hidden representations.
|Zero-Inflated Beta Regression||0.5786 0.0041|
|Linear Regression||0.5486 0.0086|
|Zero-Inflated Linear Regression||0.5738 0.0019|
|Logistic Regression||0.5501 0.0006|
|Zero-Inflated Logistic Regression||0.5726 0.0061|
4.3 Fine-Turning Results
For the quality estimation model, we first explore the advantage of zero-inflated model and Beta regression by varying the prior distribution of WER. The performances of setting five different last layers are evaluated at the out-of-domain test set and shown in Table 2. Notice that i) we cannot simply apply Beta regression to the last layer without zero-inflation, since the zero WER will violate the support of Beta distribution. ii) The linear regression does not necessarily mean a pure Gaussian distribution, since the output still has to conform with the interval . Thus, the mean of the Gaussian distribution cannot be arbitrarily large but be applied a sigmoid function in advance. iii) As the description in ii), the logistic regression is merely different from linear regression in the loss functions (cross-entropy v.s. mean squared loss). iv) The precision parameter of Beta regression is a hyper-parameter estimated from the training data satisfying . We employ the maximum likelihood estimation with another common parameterization , where .
The second experiment we conduct on our in-domain test dataset is to compare our ASR-QE model as an integrated pipeline with the state-of-the-art quality estimation model QEBrain  in machine translation. Notice that for fair comparison, we have to modify the text encoder of QEBrain to the exactly same speech encoder of ours and the last layer to a zero-inflated regression one. In addition to previous metrics, we also introduce F1-OK/BAD to evaluate the recognition result is acceptable or not, which is prevalent in word-level quality estimation of machine translation. The overall results are illustrated in Table 3, where we label the acceptable recognition results with WER . It shows that speech-BERT outperform QEBrain in all aspects. Furthermore, we have a detailed analysis on the Pearson with respect to different sentence lengths in Fig. 4. We simply use two linear regressions fit the trend of performance decreasing when the sentence length grows. QEBrain demonstrates more excellent performance when the sentence is shorter, but speech-BERT has a stable performance across all length ranges. This finding makes sense because the longer sentence is likely to have lower even zero WER, which can be better dealt with by zero-inflated Beta regression layer.
In this study, we first proposed a deep architecture speech-BERT, which is seamlessly connected to speech-Transformer. The key purpose is to pre-train the model on large scale ASR dataset, so that the last layer of whole architecture can be directly fed as downstream features without any manual labors. Meanwhile, we designed a neural zero-inflated Beta regression layer, which practically coheres with the empirical distribution of WER. The main intuition is to regress a variable defined as a mixture of the discrete and continuous distributions. With the elaborated gradient pre-computation method, the loss function can still be efficiently optimized. However, we also notice that the disadvantage of our approach is the heavy model built upon speech-Transformer, even through no auto-regressive property is proceeded during inference. Investigation of building the zero-inflation regression module on Kaldi framework remains as a future work. Following the recent work , we will also attempt to improve the confidence scores over token levels.
-  Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Brian Kingsbury, et al., “Deep neural networks for acoustic modeling in speech recognition,” IEEE Signal processing magazine, vol. 29, 2012.
-  Zhehuai Chen, Justin Luitjens, Hainan Xu, Yiming Wang, Daniel Povey, and Sanjeev Khudanpur, “A gpu-based wfst decoder with exact lattice generation,” in Proc. Interspeech 2018, 2018, pp. 2212–2216.
-  Ngoc Tien Le, Advanced Quality Measures for Speech Translation, Ph.D. thesis, 2018.
-  Matthias Sperber, Graham Neubig, Jan Niehues, Sebastian Stüker, and Alex Waibel, “Lightly supervised quality estimation,” in Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, 2016, pp. 3103–3113.
-  Shahab Jalalvand, Matteo Negri, Marco Turchi, José GC de Souza, Falavigna Daniele, and Mohammed RH Qwaider, “Transcrater: a tool for automatic speech recognition quality estimation,” in Proceedings of ACL-2016 System Demonstrations, 2016, pp. 43–48.
-  Ahmed Ali and Steve Renals, “Word error rate estimation for speech recognition: e-wer,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2018, pp. 20–24.
-  Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.
-  Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 4171–4186.
-  Guillaume Lample and Alexis Conneau, “Cross-lingual language model pretraining,” in Advances in Neural Information Processing Systems, 2019.
-  Linhao Dong, Shuang Xu, and Bo Xu, “Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5884–5888.
-  Pei Zhang, Niyu Ge, Boxing Chen, and Kai Fan, “Lattice transformer for speech translation,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019.
-  Ondrej Bojar, Christian Federmann, Barry Haddow, Philipp Koehn, Matt Post, and Lucia Specia, “Ten years of wmt evaluation campaigns: Lessons learnt,” Translation Evaluation: From Fragmented Tools and Data Sets to an Integrated Ecosystem, p. 27, 2016.
Kai Fan, Jiayi Wang, Bo Li, Fengming Zhou, Boxing Chen, and Luo Si,
“Bilingual expert can find translation errors,”
Proceedings of the AAAI Conference on Artificial Intelligence, 2019, vol. 33, pp. 6367–6374.
-  Raydonal Ospina and Silvia LP Ferrari, “Inflated beta distributions,” Statistical Papers, vol. 51, no. 1, pp. 111, 2010.
-  Veronika Ročková and Edward I George, “The spike-and-slab lasso,” Journal of the American Statistical Association, vol. 113, no. 521, pp. 431–444, 2018.
-  Alex Graves, Navdeep Jaitly, and Abdel-rahman Mohamed, “Hybrid speech recognition with deep bidirectional lstm,” in 2013 IEEE workshop on automatic speech recognition and understanding. IEEE, 2013, pp. 273–278.
-  Prakhar Swarup, Roland Maas, Sri Garimella, Sri Harish Mallidi, and Björn Hoffmeister, “Improving ASR Confidence Scores for Alexa Using Acoustic and Hypothesis Embeddings,” in Proc. Interspeech 2019, 2019, pp. 2175–2179.