1 Introduction
Automatic Speech Recognition (ASR) has made remarkable improvement since the advances of deep learning
[1] and powerful computational resources [2]. However, current ASR systems are still not perfect because of the constraint of objective physical conditions, such as the variability of different microphones or background noises. Thus, quality evaluation (QE) is a practical desideratum for developing and deploying speech language technology, enabling users and researchers to judge the overall performance of the output, detect bad cases, refine algorithms, and choose among competing systems in a specific target environment. This paper focuses on such a situation where the golden references are not available for estimating a reliable word error rate (WER).In the research direction of ASR QE without transcripts, a twostage framework including feature extraction and WER prediction, has been a longstanding criteria. Classical pioneering works mainly rely on handcrafted features
[3]and utilize them to build a linear regression based algorithms, includes aggregation method with extremely randomized trees
[4], SVM based TranscRater [5], and eWER [6]. In this work, instead of heavily using manual labors, we propose to derive the feature representations from a pretrained conditional bidirectional language model – speechBERT, which aims to predict the relationships between the raw fbank features and utterances by analyzing them holistically. The training data required for speechBERT is exactly the same as the one for conventional ASR, without any additional human annotations. Subsequently, during the WER prediction stage, we analyze the empirical distribution of WER for most ASR systems (in Fig. 1), and find that WER values are prone to distribute near 0 and the nonzero values approximately follow a Beta or Gaussian distribution between 0 and 1. Therefore, during the finetuning procedure, we introduce a neural zeroinflated regression layer on top of the speechBERT, fitting the target distribution more appropriately.
In summary, this paper makes the following main contributions. i) We propose a bidirectional language model conditional on speech features, which aims to improve the feature representations for ASR downstream tasks. A bonus experiment shows that tying the parameter of speechBERT and SpeechTransformer can accelerate the convergence during training. ii) We introduce a neural zeroinflated Beta regression layer, particularly fitting the empirical distribution of WER. For the gradient backpropagation of neural Beta regression layer, we design an efficient precomputation method. iii) Our experimental results demonstrate our ASR quality estimation model can achieve the state of the art performance with fair comparison.
2 Related Works
Transformer [7]
has been extensively explored in natural language processing
[8, 9], and become popular in speech [10, 11]. The motivation of our speechBERT comes from the success of BERT [8]which has demonstrated the importance of bidirectional pretraining for language representations and reduced the need for many heavilyengineered task specific architectures. We will also adopt the loss function of masked language model
as our training criteria, where represents all tokens/utterances in one sentence. However, the major difference is that speechBERT is conceptually a conditional language model, in order to capture the subtle correlations between speech features and utterances as well as the syntactic information.In order to build the conditional masked language model , where is the speech features corresponding to , we have to discard the single transformer encoder architecture of BERT since it is difficult to consume two sequences of different modalities ^{1}^{1}1It is doable if using XLM [9].. Instead, we modify the speechTransformer [10] by changing its autoregressive decoder to a paralleled memory encoder, resulting in an encodermemory encoder architecture, where the speech and text domains can be separately controlled by two different encoders. The memory means the outputs of the speech encoder by consuming the spectrogram inputs.
When the feature representations are ready, the quality estimation task is typically reduced to either regression or classification problem, given the type of predicated values. In machine translation (MT) quality estimation [12], a similar realvalued metric translation error rate (TER) between 0 and 1 is the target of the model, and will be predicted at sentence level. The transformer based predictorestimator framework [13]
established a stateoftheart record in WMT 2018 QE competition, which restricts the output within the expected interval by applying a sigmoid function before the regression. However, a standard regression model is probably not suitable to fit the WER of ASR systems. Due to the subjectivity and nonuniqueness of the translation task, it is relatively easy to produce a significant gap between the machine and human translations. As aforementioned, we observe that the distribution of WER is empirically more zeroconcentrated than TER, making a straightforward linear regression easily biased.
We propose a neural zeroinflated regression layer, enlightened by the statistical inflated distribution [14, 15]
, which is capable of simulating a mixed continuousdiscrete distributions. For the random variable
following such distribution, one typical representation is given by a weighted mixture of two distributions.(1) 
where is a finite set, and is an indicator function whose value equals to 1 if equals to .
In our case that denotes WER, then we have and . Particularly, we recommend to assume as a Beta distribution for ASEQE. Therefore, simply represents the probability of the event that
takes the value 0 or not, resulting a mixture of Bernoulli and Beta distribution. Additionally, we use one classification neural network to simulate the Bernoulli variable and a regression neural network to simulate the continuous variable, thus resulting a differentiable deep neural architecture that can be finetuned together with the parameters of speechBERT. In this way, we can divide the ASRQE modeling into a hierarchical multitask learning, where the first step is to decide whether the ASR output is perfect or not, and the second step is only to regress the WER value for the imperfect one.
3 Methodology
3.1 SpeechBERT
The backbone structure of speechBERT originates from speechTransformer [10] by adapting the transformer decoder to a memory encoder (in Fig. 2). To achieve this goal, we need two simple modifications. First, we randomly change 15% utterances in the transcription at each training step. We introduce a new special token “[mask]” analogous to standard BERT and substitute it for the tokens required masking. Notice that in practice 15% of utterances that required prediction during pretraining includes 12% masked, 1.5% substituted and 1.5% unchanged.
Secondly, we also remove the future mask matrix in the selfattention of the decoder, which can be concisely written as a unified formulation.
(2) 
where the indicator equals to 1 if the model architecture is speechTransformer. The are the output keys, queries, and values from the previous layer of the decoder or memory encoder. is a triangular matrix where if . In the case of the decoder, the selfattention represents the information of all positions in the decoder up to and including that position. In the case of the memory encoder, it represents the information of all positions excepted the masked positions. Other details are similar to the standard transformer referring to [7]. The advantage of using the unified formulation is that it allows us to straightforwardly implement a multitask learning task in a weightstying architecture via altering the mask matrix, resulting in the following loss.
(3)  
where model parameter is shared cross speechBERT and speech transformer. The extra ASR loss also differentiates our model to the standard BERT whose additional loss is designed for next sentence prediction task. In the experiments of multitask learning, we set to keep two different losses at a consistent scale.
3.2 Neural ZeroInflated Regression Layer
The speechBERT is able to unambiguously output a sequence of feature representations corresponding to every single utterance in the transcription. Theoretically, we can use a single feature representation of arbitrarily selected token for many downstream tasks like “[CLS]” in standard BERT, since the selfattention mechanism has successfully integrated all syntactic information into every feature, but in different ways. Intuitively, it is reasonable to use another feature fusion layer to encode the sequence of features together. Thus, we use one BiLSTM [16] layer to reencode the features and output a single final encode state as the feature for the quality estimation task.
Referring to Eq. (1
), we can first define a binary classifier to indicate that whether the ASR result is flawless or not, i.e., following the Bernoulli distribution
.(4) 
For the subsequent regression model, it becomes not necessary to predict the case of zero WER due to the existence of above classifier. This fact naturally advocates the choice of Beta regression because the Beta distribution has no definition on zero. For statistical distributions, the most importance statistics are usually the first two moments, i.e., mean and variance. In our proposal, we mainly model the mean
which is the actual target in our final prediction, and derive the variance of Beta distribution.(5) 
where is a hyperparameter that can be interpreted as a precision parameter, which can be estimated from the training data. The parameterized density distribution function is expressed as follows.
(6) 
Combining Eq. (4,5,6), the training objective with the neural zeroinflated beta regression layer is to maximize the loglikelihood of the proposed distribution of WER.
(7) 
This is a hierarchical loss for two consecutive subproblems but can be simultaneously optimized, where the first term requires the whole finetuning dataset, while the second term is only fed with the data of inaccurate transcriptions. With this loss function, we use the expected prediction during inference, i.e., .
3.3 Gradient PreComputation
A crucial issue for Beta regression layer is the involved gradient computation of is not straightforward, since the direct autodifferentiation with respect to the training objective is obliged to calculate the gradient of a compound Gamma function , where simply denotes the computational graph or function with the input and the output .
Instead, we utilize a gradient precomputation trick such that the backpropagation becomes less cumbersome, by introducing an equivalent objective as to loglikelihood .
(8)  
(9)  
(10) 
where is digamma function ^{2}^{2}2Most deep learning packages have the builtin function, e.g., tf.digamma
in TensorFlow.
. The equivalence is essentially in the sense of gradient computation, in other words, the stochastic gradient optimization will still remain the same, because we can derive the following identical relation with some algebra calculations.(11) 
In the new objective, we have successfully circumvented the direct gradient backpropagation with respect to , since the complicated term merely involves forward digamma function computation with a stop gradient operation, while the term can readily and efficiently contribute to the backpropagation because it just consists of the common operations in deep neural networks.
4 Experiments
In order to validate the effectiveness of our approach, the quality estimation model of ASR was evaluated by two popular measures, Pearson correlation (larger is better) and mean absolute error (MAE, smaller is better).
4.1 Experimental Settings
We conduct our experiments on two types of data. One is a large Mandarin speech recognition corpus containing 20,000 hours training data with about 20 million sentences which is used for speechBERT pretraining. We evaluate the performance of pretraining via the prediction accuracy on masked tokens. The other is a small size speech recognition quality estimation data including 240 hours, which never appears in the pretraining dataset. The speech recognition system that we want to evaluate the quality is an inhouse ASR engine based on Kaldi^{3}^{3}3https://github.com/kaldiasr/kaldi. The WER computed from the ASR results and the ground truth transcripts is the target we will predict by our model. Correspondingly, we have two test sets of the quality estimation model for indomain and outofdomain, where both include 3000 sentences. The acoustic feature used for our implemented model are 80dimensional logmel filterbank (fbank) energies computed on 25 ms window with 10 ms shift. We stack the consecutive frames within a context window of 4 to produce the 320dimensional features in purpose of computational efficiency for the speech encoder. The speechBERT is trained on 8 Tesla V100 GPUs for about 10 days until convergence. The quality estimation model is finetuned on 4 Tesla V100 GPUs for several hours.
4.2 PreTraining Results
Loss  

Masked Token Predict Acc  95.39%  NA  94.81% 
WER (beam=5)  NA  9.23  10.14 
We train the speechBERT model with three different loss functions and summarize the results in Table 1. Basically, we observe that the jointly trained model can achieve comparable performance to the two separately training tasks. We also visualize the attention between speech encoder and the text decoder or memory encoder in Fig. 3
. The attention weights are averaged over 8 heads, and the overall patterns between joint training and separate training are relatively similar. We prefer to adopt the simultaneously pretrained model as our downstream quality estimation task, since we hypothesize that the more supervisions in multitask learning may incorporate more syntactic information in the hidden representations.
Last Layer  Pearson 

ZeroInflated Beta Regression  0.5786 0.0041 
Linear Regression  0.5486 0.0086 
ZeroInflated Linear Regression  0.5738 0.0019 
Logistic Regression  0.5501 0.0006 
ZeroInflated Logistic Regression  0.5726 0.0061 
4.3 FineTurning Results
For the quality estimation model, we first explore the advantage of zeroinflated model and Beta regression by varying the prior distribution of WER. The performances of setting five different last layers are evaluated at the outofdomain test set and shown in Table 2. Notice that i) we cannot simply apply Beta regression to the last layer without zeroinflation, since the zero WER will violate the support of Beta distribution. ii) The linear regression does not necessarily mean a pure Gaussian distribution, since the output still has to conform with the interval . Thus, the mean of the Gaussian distribution cannot be arbitrarily large but be applied a sigmoid function in advance. iii) As the description in ii), the logistic regression is merely different from linear regression in the loss functions (crossentropy v.s. mean squared loss). iv) The precision parameter of Beta regression is a hyperparameter estimated from the training data satisfying . We employ the maximum likelihood estimation with another common parameterization , where .
The second experiment we conduct on our indomain test dataset is to compare our ASRQE model as an integrated pipeline with the stateoftheart quality estimation model QEBrain [13] in machine translation. Notice that for fair comparison, we have to modify the text encoder of QEBrain to the exactly same speech encoder of ours and the last layer to a zeroinflated regression one. In addition to previous metrics, we also introduce F1OK/BAD to evaluate the recognition result is acceptable or not, which is prevalent in wordlevel quality estimation of machine translation. The overall results are illustrated in Table 3, where we label the acceptable recognition results with WER . It shows that speechBERT outperform QEBrain in all aspects. Furthermore, we have a detailed analysis on the Pearson with respect to different sentence lengths in Fig. 4. We simply use two linear regressions fit the trend of performance decreasing when the sentence length grows. QEBrain demonstrates more excellent performance when the sentence is shorter, but speechBERT has a stable performance across all length ranges. This finding makes sense because the longer sentence is likely to have lower even zero WER, which can be better dealt with by zeroinflated Beta regression layer.
Method  MAE  Pearson  F1OK/BAD 

QEBrain  0.073  0.7829  0.4956 
speechBERT  0.056  0.8187  0.5372 
5 Conclusions
In this study, we first proposed a deep architecture speechBERT, which is seamlessly connected to speechTransformer. The key purpose is to pretrain the model on large scale ASR dataset, so that the last layer of whole architecture can be directly fed as downstream features without any manual labors. Meanwhile, we designed a neural zeroinflated Beta regression layer, which practically coheres with the empirical distribution of WER. The main intuition is to regress a variable defined as a mixture of the discrete and continuous distributions. With the elaborated gradient precomputation method, the loss function can still be efficiently optimized. However, we also notice that the disadvantage of our approach is the heavy model built upon speechTransformer, even through no autoregressive property is proceeded during inference. Investigation of building the zeroinflation regression module on Kaldi framework remains as a future work. Following the recent work [17], we will also attempt to improve the confidence scores over token levels.
References
 [1] Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, Abdelrahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Brian Kingsbury, et al., “Deep neural networks for acoustic modeling in speech recognition,” IEEE Signal processing magazine, vol. 29, 2012.
 [2] Zhehuai Chen, Justin Luitjens, Hainan Xu, Yiming Wang, Daniel Povey, and Sanjeev Khudanpur, “A gpubased wfst decoder with exact lattice generation,” in Proc. Interspeech 2018, 2018, pp. 2212–2216.
 [3] Ngoc Tien Le, Advanced Quality Measures for Speech Translation, Ph.D. thesis, 2018.
 [4] Matthias Sperber, Graham Neubig, Jan Niehues, Sebastian Stüker, and Alex Waibel, “Lightly supervised quality estimation,” in Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, 2016, pp. 3103–3113.
 [5] Shahab Jalalvand, Matteo Negri, Marco Turchi, José GC de Souza, Falavigna Daniele, and Mohammed RH Qwaider, “Transcrater: a tool for automatic speech recognition quality estimation,” in Proceedings of ACL2016 System Demonstrations, 2016, pp. 43–48.
 [6] Ahmed Ali and Steve Renals, “Word error rate estimation for speech recognition: ewer,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2018, pp. 20–24.
 [7] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.
 [8] Jacob Devlin, MingWei Chang, Kenton Lee, and Kristina Toutanova, “Bert: Pretraining of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 4171–4186.
 [9] Guillaume Lample and Alexis Conneau, “Crosslingual language model pretraining,” in Advances in Neural Information Processing Systems, 2019.
 [10] Linhao Dong, Shuang Xu, and Bo Xu, “Speechtransformer: a norecurrence sequencetosequence model for speech recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5884–5888.
 [11] Pei Zhang, Niyu Ge, Boxing Chen, and Kai Fan, “Lattice transformer for speech translation,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019.
 [12] Ondrej Bojar, Christian Federmann, Barry Haddow, Philipp Koehn, Matt Post, and Lucia Specia, “Ten years of wmt evaluation campaigns: Lessons learnt,” Translation Evaluation: From Fragmented Tools and Data Sets to an Integrated Ecosystem, p. 27, 2016.

[13]
Kai Fan, Jiayi Wang, Bo Li, Fengming Zhou, Boxing Chen, and Luo Si,
“Bilingual expert can find translation errors,”
in
Proceedings of the AAAI Conference on Artificial Intelligence
, 2019, vol. 33, pp. 6367–6374.  [14] Raydonal Ospina and Silvia LP Ferrari, “Inflated beta distributions,” Statistical Papers, vol. 51, no. 1, pp. 111, 2010.
 [15] Veronika Ročková and Edward I George, “The spikeandslab lasso,” Journal of the American Statistical Association, vol. 113, no. 521, pp. 431–444, 2018.
 [16] Alex Graves, Navdeep Jaitly, and Abdelrahman Mohamed, “Hybrid speech recognition with deep bidirectional lstm,” in 2013 IEEE workshop on automatic speech recognition and understanding. IEEE, 2013, pp. 273–278.
 [17] Prakhar Swarup, Roland Maas, Sri Garimella, Sri Harish Mallidi, and Björn Hoffmeister, “Improving ASR Confidence Scores for Alexa Using Acoustic and Hypothesis Embeddings,” in Proc. Interspeech 2019, 2019, pp. 2175–2179.
Comments
There are no comments yet.