1 Introduction
The most widely used metric for measuring the performance of automatic speech recognition (ASR) system is the word error rate (WER), which is derived from the Levenshtein distance working at the word level:
(1) 
where is the number of words in the th sentence (i.e. reference text of audio) of evaluation dataset, and represents the sum of insertion, deletion, and substitution errors computed from the dynamic string alignment of the recognized word sequence with the reference word sequence.
A practical question being commonly raised in ASR evaluations is that how reliable is an observed improvement of ASR system B comparing to ASR system A. For example, if we obtained an absolute 0.2% WER reduction, how could we tell if this improvement is real and not due to random chance.
Here is where the statistical hypothesis testing comes into play. The use of statistical testing in ASR evaluations has been explored long time ago [gillick1989some, pallett1993benchmark, strik2000comparing, bisani2004bootstrap]. In particular, the work of [bisani2004bootstrap] presents a bootstrap method for significance analysis which makes no distributional approximations and the results are immediately interpretable in terms of WER.
To be more specific, suppose we have a sequence of independent and identically distributed (i.i.d.) random variable
and we are interested in estimating the variance of some statistic . The bootstrap method [efron1994introduction] resamples data from the empirical distribution of , and then recalculates the statistic on each of these “bootstrap” samples. Then the variance of can be estimated from the sample variance of these computed statistics.For the ASR systems comparison problem that we raised previously, authors in [bisani2004bootstrap]
proposed the idea of using bootstrap approach above to resample (with replacement) the utterances in the evaluation dataset for each replicate and then estimate the probability that the absolute WER difference of ASR system B versus system A
(2) 
is positive, where ASR systems A and B have word error counts and on the th sentence, respectively. Note that they calculate the difference in the number of errors of the two systems on identical bootstrap samples.
One of the key issues confronting bootstrap resampling approximations is how to deal with dependent data. This is particularly the case for speech data since the speech recognition errors could be highly correlated across different utterances if they are 1) from the same speaker, 2) similar in the spoken sentence (e.g. in the same domain or topic), or both. When the dependent structure across is nontrivial, the true sampling distribution of
depends on the joint distribution of
and thus the bootstrap subsamples should preserve such dependent structure as well. However, the reshuffled resamples obtained from the ordinary bootstrap method break such dependence and thus lead to wrong variance estimations of the statistic.The work in [hall1985resampling, carlstein1986use] developed the approach of fixed blockwise bootstrap for dependent timeseries data and showed that the variance estimator is consistent under mild conditions. In particular, they propose dividing the time series into nonoverlapping blocks of equal length and resample these blocks instead of original data points.
In this paper, we present the blockwise bootstrap method for statistical testing of ASR performance evaluation. To address the question that we raised previously, we focus on computing the confidence interval (CI) of the absolute WER difference between two ASR systems.
The rest of the paper is organized as follows. Section 2 introduces the use of blockwise bootstrap on ASR evaluation problem. Section 3 shows the statistical property that the blockwise bootstrap variance estimator of the absolute WER difference of two ASR systems is consistent under mild conditions. Section 4 and Section 5 demonstrate the validity of blockwise bootstrap method on simulated synthetic data and realworld speech data. We conclude in Section 6.
2 Methodology
In this section, we describe the blockwise bootstrap method to compute the confidence interval of the absolute WER difference as in the formula (2).
For the th utterance where , we have the evaluation results on two ASR systems A and B as follows:
(3) 
where is the number of words, and represent the numbers of word errors in ASR systems A and B, respectively. The statistic that we are interested in is the absolute WER difference comparing system B versus system A, as given in the formula (2).
Suppose the data above can be partitioned into nonoverlapping blocks
(4) 
where , and for any , . The blockwise bootstrap method works as follows.
For any , we randomly sample (with replacement) elements from the set to generate a bootstrap sample
(5) 
Here any . Then for this bootstrap replicate sample, the statistic is computed as
(6) 
Once we have all , then the 95% confidence interval for the absolute WER difference can be determined by the empirical percentiles at 2.5% and 97.5% of the bootstrap samples:
Alternatively, the uncertainty of
can be quantified by its standard error, which can be approximated by the sample standard deviation of the
bootstrap samples(7) 
where . Then based on Gaussian approximation, the 95% confidence interval for the absolute WER difference can be obtained by
Note that in this paper we mainly focus on the confidence intervals for the absolute WER difference. Similarly, the blockwise bootstrap method can also be utilized to compute the confidence intervals for the WER itself as well as relative WER difference between two ASR systems.
3 Statistical Properties
In this section we work out some statistical theories to show that the blockwise bootstrap variance estimator of is consistent under mild conditions.
For simplicity, we assume all utterances in the evaluation dataset have the same number of words, that is, for all . Let’s denote . Then we have the statistic of interest written as
(8) 
where the subscript in indicates the number of samples (i.e. utterances) corresponding to the quantity.
By dividing the utterances into nonoverlapping blocks, suppose each block has same number of utterances, denoted as . Let as the number of blocks.
Without the loss of generality, assume that the blocks are consecutive and thus the th block consists of the sequence where . We further let
(9) 
where the subscript in indicates the block size and the superscript indicates the index of starting variable in the block. Note that .
Consider the blockwise variance estimator (i.e. standardized sample variance of blockwise estimators of the statistic)
(10) 
The following theorem establishes its consistency results.
Theorem 1.
Assume the asymptotic variance of is
(11) 
and for any . Let be s.t. and as . If is uniformly bounded, and for any the sequence of and the sequence of are uncorrelated, then
(12) 
Proof.
Let’s denote . Then the variance estimator can be written as
(13) 
where . Note that .
We will first show that the first term of the right hand side of (13) converges to in . Notice that
(14) 
as and (thus) . Then it suffices to show . This is true since
(15) 
when and are sufficiently large. Here the first less than or equal to sign follows from the assumption that for any two blocks , and are uncorrelated, and the second less than or equal to sign follows from the assumption that is uniformly bounded.
Now we only need to show that the second term of the right hand side of (13) converges to 0 in , or equivalently, converges to 0 in .
Note that and as since is bounded. ∎
Based on Theorem 1, we require both the number of blocks and number of utterances in each block go to infinity as the number of utterances grows to infinity. This is typically the case for speech data if the blocks are partitioned by different speakers or topics. On the other hand, the uncorrelated assumption among different blocks seems strong, where we relax this assumption as the corollary below.
Corollary 1.1.
Theorem 1 still holds if the assumption of uncorrelated blockwise variables is relaxed as follows: for any , if is large enough, for any and , assume
(16) 
Proof.
It suffices to show under the relaxed assumption.
Consider for any
(17)  
(18)  
(19) 
Then under the assumption that if is sufficiently large for any and , we have
(20)  
(21) 
which converges to 0 as . ∎
4 Simulation Studies
In this section, we conduct simulation experiments to show that the blockwise bootstrap approach is capable to generate valid confidence intervals for absolution WER differences between two ASR systems and is superior to the ordinary bootstrap when the utterances in the evaluation dataset are correlated with each other.
4.1 Setup
In the simulation experiments, we generate synthetic data, i.e. counts of ASR errors, to measure the performance of ordinary bootstrap and blockwise bootstrap methods. We assume the total number of utterances in the evaluation set is and the number of words in each utterance is equally . For the ASR system A and B in the comparison, suppose the “groundtruth” WERs are given by and respectively. Thus the absolution WER difference between them is .
Under the scenario that the numbers of errors from different utterances are independent with each other, we generate the number of errors for each utterance from the binomial distribution
where or depending on which ASR system was used. On the other hand, when the ASR errors are correlated across different utterances, we need to make additional correlation structure assumption while keeping the marginal distribution of error count on each utterance to be binomial.Here we assume the numbers of errors across different utterances are blockcorrelated, that is, for any two utterances, their ASR errors are correlated if they belong to the same block while the errors are independent if they belong to different blocks. Without the loss of generality, suppose the blocks are consecutive and the size of block (i.e. number of utterances in each block) is denoted as . Follow the steps below to generate ASR errors for each block and each ASR system:

Generate a sample
from multivariate Gaussian distribution
, where is an by covariance matrix with if and if . Here is the th element of ; 
Turn into correlated uniforms where for and
is the Gaussian cumulative distribution function (CDF);

Generate correlated Binomial samples by inverting the Binomial CDF: for , where is the inverse of the Binomial CDF.
The ASR errors for different blocks and different systems are generated independently. For both ordinary bootstrap and blockwise bootstrap methods in the comparison, we set the resampling size . The block size and correlation parameter are varied in our experiment.
4.2 Results
Strictly speaking, a 95% confidence interval means that if we were able to have 100 different datasets from the same distribution of the original data and compute a 95% confidence interval based on each of these datasets, then approximately 95 of these 100 confidence intervals will contain the true value of statistic.
In our experiments, for each setup of various block size and correlation parameter, we replicate the simulation for 1000 times. Then ideally approximately 950 of these confidence intervals would contain the true absolute WER difference .
Seen from Table 1, the blockwise bootstrap method always gives valid confidence intervals since the percentage of the confidence intervals that contain the true is very close to 95%, regardless of block size ( or 30) and correlation parameter (, 0.05, 0.1, 0.2 or 0.4). As the block size or correlation increases, the width of confidence intervals becomes larger. On the other hand, the ordinary bootstrap method fails to generate correct confidence intervals when the data is dependent (), given its percentage of the confidence intervals that contain the true is much lower than 95%.
Figure 1 plots the confidence intervals computed on the first 50 simulated data among 1000, where we can see that the confidence intervals from the blockwise bootstrap method are wider and capture more true values of .
Bootstrap  Blockwise Bootstrap  
Block Size ()  Correlation ()  Width of CI  Percentage that CI Contains the Truth  Width of CI  Percentage that CI Contains the Truth 
0.0030  94.1%  0.0030  94.7%  
0.0030  92.7%  0.0033  95.2%  
0.0030  90.1%  0.0035  94.3%  
0.0030  86.2%  0.0040  94.9%  
0.0030  76.9%  0.0048  94.0%  
0.0030  94.1%  0.0030  94.7%  
0.0030  78.1%  0.0046  95.2%  
0.0030  69.2%  0.0058  94.9%  
0.0030  54.4%  0.0077  94.7%  
0.0030  41.2%  0.0105  95.9% 
5 Real Data Analysis
In this section we apply blockwise bootstrap method on two realworld speech datasets and demonstrate how it helps compute the confidence intervals of absolute WER difference between two ASR systems.
We consider the following two ASR evaluation datasets:

Conversational Speech dataset. This speech conversation dataset was collected through crowdsourcing from a data supplier for Speech ASR, and the data was properly anonymized. It consists of 235 conversions with more than 20 topics that are common in daily life, including family, travel, foods, etc;

Augmented MultiParty Interaction (AMI) Meeting dataset. The AMI meeting corpus [aran2010multimodal] includes scenario meetings (with roles assigned for each participant) and nonscenario meetings (where participants were free to choose topics). For scenario meetings, each session is divided into 4 onehour meetings. Each meeting has 4 participants.
In this analysis, we only use the “dev” and “eval” split of the entire datasets above.
We use inhouse developed conversation ASR system in this investigation: a baseline model (denoted as ASR system A) and an improved model (denoted as ASR system B), and we are interested in computing a 95% confidence interval of the absolute WER reduction between the two ASRs. Then if the upper bound of the confidence interval is still negative, then we can tell that this improvement is real and not due to random chance.
To apply the blockwise bootstrap method, we need to define the correlated block structures among the utterances in the evaluation data. For Conversation dataset, it’s natural to treat each conversation as a single separated block since the same topics were being discussed. For AMI Meeting data, we treat the utterances from each speaker in each (either scenario or nonscenario) meeting as a block. By doing that, for any two utterances, we assume their ASR errors are correlated if they belong to the same block while the errors have very weak correlations if they belong to different blocks.
Table 2 shows details of the two evaluation datasets in terms of number of utterances, number of total words, and number of estimated blocks with correlated utterances.
We apply both ordinary bootstrap and blockwise bootstrap methods on the two evaluation datasets. Results are shown in Table 3. Again, we observe that the confidence intervals computed from blockwise bootstrap are much wider than the ones generated from ordinary bootstrap: 1.5 times wider on Conversation data and 2 times wider on AMI Meeting data. As we can see that the upper bounds of both confidence intervals are still negative, we can conclude that the improvement of ASR system B versus system A is statistically significant.
Also, we can see that the confidence intervals computed from the empirical percentiles at 2.5% and 97.5% of bootstrap samples are almost the same with the ones computed from Gaussian approximation.
Figure 2 displays the histograms of the absolute WER differences computed from the bootstrap samples, where we can see again that the data distribution from the blockwise bootstrap method is more spread out.
Evaluation Dataset  

Feature  Conversation  AMI Meeting 
Number of Utterances  
Number of Words  
Number of Correlated Blocks 
Evaluation Dataset  

Method  Metric  Conversation  AMI Meeting 
Bootstrap  
Blockwise Bootstrap  
6 Conclusion
In this paper we present blockwise bootstrap approach  by dividing the evaluation utterances into nonoverlapping blocks, this method resamples these blocks instead of original data. We show that the resulting variance estimator of the absolute WER difference of two ASR systems is consistent under mild conditions. We also demonstrate the validity of blockwise bootstrap method on both synthetic and realworld speech data.
Future work might include how to infer the correlated block structures from data, for example, estimating a sparse correlation matrix across evaluation utterances based on the embeddings of speakers and text sentences.