Statistical Testing on ASR Performance via Blockwise Bootstrap

12/19/2019
by   Zhe Liu, et al.
8

A common question being raised in automatic speech recognition (ASR) evaluations is how reliable is an observed word error rate (WER) improvement comparing two ASR systems, where statistical hypothesis testing and confidence intervals can be utilized to tell whether this improvement is real or only due to random chance. The bootstrap resampling method has been popular for such significance analysis which is intuitive and easy to use. However, this method fails in dealing with dependent data, which is prevalent in speech world - for example, ASR performance on utterances from the same speaker could be correlated. In this paper we present blockwise bootstrap approach - by dividing evaluation utterances into nonoverlapping blocks, this method resamples these blocks instead of original data. We show that the resulting variance estimator of absolute WER difference of two ASR systems is consistent under mild conditions. We also demonstrate the validity of blockwise bootstrap method on both synthetic and real-world speech data.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

06/18/2021

On-Device Personalization of Automatic Speech Recognition Models for Disordered Speech

While current state-of-the-art Automatic Speech Recognition (ASR) system...
07/04/2020

Deep Graph Random Process for Relational-Thinking-Based Speech Recognition

Lying at the core of human intelligence, relational thinking is characte...
09/19/2021

Model-Based Approach for Measuring the Fairness in ASR

The issue of fairness arises when the automatic speech recognition (ASR)...
03/31/2022

A Comparative Study on Speaker-attributed Automatic Speech Recognition in Multi-party Meetings

In this paper, we conduct a comparative study on speaker-attributed auto...
03/29/2022

Earnings-22: A Practical Benchmark for Accents in the Wild

Modern automatic speech recognition (ASR) systems have achieved superhum...
07/16/2022

Reducing Geographic Disparities in Automatic Speech Recognition via Elastic Weight Consolidation

We present an approach to reduce the performance disparity between geogr...
09/16/2009

Computing p-values of LiNGAM outputs via Multiscale Bootstrap

Structural equation models and Bayesian networks have been widely used t...

1 Introduction

The most widely used metric for measuring the performance of automatic speech recognition (ASR) system is the word error rate (WER), which is derived from the Levenshtein distance working at the word level:

(1)

where is the number of words in the th sentence (i.e. reference text of audio) of evaluation dataset, and represents the sum of insertion, deletion, and substitution errors computed from the dynamic string alignment of the recognized word sequence with the reference word sequence.

A practical question being commonly raised in ASR evaluations is that how reliable is an observed improvement of ASR system B comparing to ASR system A. For example, if we obtained an absolute 0.2% WER reduction, how could we tell if this improvement is real and not due to random chance.

Here is where the statistical hypothesis testing comes into play. The use of statistical testing in ASR evaluations has been explored long time ago [gillick1989some, pallett1993benchmark, strik2000comparing, bisani2004bootstrap]. In particular, the work of [bisani2004bootstrap] presents a bootstrap method for significance analysis which makes no distributional approximations and the results are immediately interpretable in terms of WER.

To be more specific, suppose we have a sequence of independent and identically distributed (i.i.d.) random variable

and we are interested in estimating the variance of some statistic . The bootstrap method [efron1994introduction] resamples data from the empirical distribution of , and then recalculates the statistic on each of these “bootstrap” samples. Then the variance of can be estimated from the sample variance of these computed statistics.

For the ASR systems comparison problem that we raised previously, authors in [bisani2004bootstrap]

proposed the idea of using bootstrap approach above to resample (with replacement) the utterances in the evaluation dataset for each replicate and then estimate the probability that the absolute WER difference of ASR system B versus system A

(2)

is positive, where ASR systems A and B have word error counts and on the th sentence, respectively. Note that they calculate the difference in the number of errors of the two systems on identical bootstrap samples.

One of the key issues confronting bootstrap resampling approximations is how to deal with dependent data. This is particularly the case for speech data since the speech recognition errors could be highly correlated across different utterances if they are 1) from the same speaker, 2) similar in the spoken sentence (e.g. in the same domain or topic), or both. When the dependent structure across is nontrivial, the true sampling distribution of

depends on the joint distribution of

and thus the bootstrap subsamples should preserve such dependent structure as well. However, the reshuffled resamples obtained from the ordinary bootstrap method break such dependence and thus lead to wrong variance estimations of the statistic.

The work in [hall1985resampling, carlstein1986use] developed the approach of fixed blockwise bootstrap for dependent time-series data and showed that the variance estimator is consistent under mild conditions. In particular, they propose dividing the time series into nonoverlapping blocks of equal length and resample these blocks instead of original data points.

In this paper, we present the blockwise bootstrap method for statistical testing of ASR performance evaluation. To address the question that we raised previously, we focus on computing the confidence interval (CI) of the absolute WER difference between two ASR systems.

The rest of the paper is organized as follows. Section 2 introduces the use of blockwise bootstrap on ASR evaluation problem. Section 3 shows the statistical property that the blockwise bootstrap variance estimator of the absolute WER difference of two ASR systems is consistent under mild conditions. Section 4 and Section 5 demonstrate the validity of blockwise bootstrap method on simulated synthetic data and real-world speech data. We conclude in Section 6.

2 Methodology

In this section, we describe the blockwise bootstrap method to compute the confidence interval of the absolute WER difference as in the formula (2).

For the th utterance where , we have the evaluation results on two ASR systems A and B as follows:

(3)

where is the number of words, and represent the numbers of word errors in ASR systems A and B, respectively. The statistic that we are interested in is the absolute WER difference comparing system B versus system A, as given in the formula (2).

Suppose the data above can be partitioned into nonoverlapping blocks

(4)

where , and for any , . The blockwise bootstrap method works as follows.

For any , we randomly sample (with replacement) elements from the set to generate a bootstrap sample

(5)

Here any . Then for this bootstrap replicate sample, the statistic is computed as

(6)

Once we have all , then the 95% confidence interval for the absolute WER difference can be determined by the empirical percentiles at 2.5% and 97.5% of the bootstrap samples:

Alternatively, the uncertainty of

can be quantified by its standard error, which can be approximated by the sample standard deviation of the

bootstrap samples

(7)

where . Then based on Gaussian approximation, the 95% confidence interval for the absolute WER difference can be obtained by

Note that in this paper we mainly focus on the confidence intervals for the absolute WER difference. Similarly, the blockwise bootstrap method can also be utilized to compute the confidence intervals for the WER itself as well as relative WER difference between two ASR systems.

3 Statistical Properties

In this section we work out some statistical theories to show that the blockwise bootstrap variance estimator of is consistent under mild conditions.

For simplicity, we assume all utterances in the evaluation dataset have the same number of words, that is, for all . Let’s denote . Then we have the statistic of interest written as

(8)

where the subscript in indicates the number of samples (i.e. utterances) corresponding to the quantity.

By dividing the utterances into nonoverlapping blocks, suppose each block has same number of utterances, denoted as . Let as the number of blocks.

Without the loss of generality, assume that the blocks are consecutive and thus the th block consists of the sequence where . We further let

(9)

where the subscript in indicates the block size and the superscript indicates the index of starting variable in the block. Note that .

Consider the blockwise variance estimator (i.e. standardized sample variance of blockwise estimators of the statistic)

(10)

The following theorem establishes its -consistency results.

Theorem 1.

Assume the asymptotic variance of is

(11)

and for any . Let be s.t. and as . If is uniformly bounded, and for any the sequence of and the sequence of are uncorrelated, then

(12)
Proof.

Let’s denote . Then the variance estimator can be written as

(13)

where . Note that .

We will first show that the first term of the right hand side of (13) converges to in . Notice that

(14)

as and (thus) . Then it suffices to show . This is true since

(15)

when and are sufficiently large. Here the first less than or equal to sign follows from the assumption that for any two blocks , and are uncorrelated, and the second less than or equal to sign follows from the assumption that is uniformly bounded.

Now we only need to show that the second term of the right hand side of (13) converges to 0 in , or equivalently, converges to 0 in .

Note that and as since is bounded. ∎

Based on Theorem 1, we require both the number of blocks and number of utterances in each block go to infinity as the number of utterances grows to infinity. This is typically the case for speech data if the blocks are partitioned by different speakers or topics. On the other hand, the uncorrelated assumption among different blocks seems strong, where we relax this assumption as the corollary below.

Corollary 1.1.

Theorem 1 still holds if the assumption of uncorrelated blockwise variables is relaxed as follows: for any , if is large enough, for any and , assume

(16)
Proof.

It suffices to show under the relaxed assumption.

Consider for any

(17)
(18)
(19)

Then under the assumption that if is sufficiently large for any and , we have

(20)
(21)

which converges to 0 as . ∎

4 Simulation Studies

In this section, we conduct simulation experiments to show that the blockwise bootstrap approach is capable to generate valid confidence intervals for absolution WER differences between two ASR systems and is superior to the ordinary bootstrap when the utterances in the evaluation dataset are correlated with each other.

4.1 Setup

In the simulation experiments, we generate synthetic data, i.e. counts of ASR errors, to measure the performance of ordinary bootstrap and blockwise bootstrap methods. We assume the total number of utterances in the evaluation set is and the number of words in each utterance is equally . For the ASR system A and B in the comparison, suppose the “ground-truth” WERs are given by and respectively. Thus the absolution WER difference between them is .

Under the scenario that the numbers of errors from different utterances are independent with each other, we generate the number of errors for each utterance from the binomial distribution

where or depending on which ASR system was used. On the other hand, when the ASR errors are correlated across different utterances, we need to make additional correlation structure assumption while keeping the marginal distribution of error count on each utterance to be binomial.

Here we assume the numbers of errors across different utterances are block-correlated, that is, for any two utterances, their ASR errors are correlated if they belong to the same block while the errors are independent if they belong to different blocks. Without the loss of generality, suppose the blocks are consecutive and the size of block (i.e. number of utterances in each block) is denoted as . Follow the steps below to generate ASR errors for each block and each ASR system:

  1. Generate a sample

    from multivariate Gaussian distribution

    , where is an -by- covariance matrix with if and if . Here is the -th element of ;

  2. Turn into correlated uniforms where for and

    is the Gaussian cumulative distribution function (CDF);

  3. Generate correlated Binomial samples by inverting the Binomial CDF: for , where is the inverse of the Binomial CDF.

The ASR errors for different blocks and different systems are generated independently. For both ordinary bootstrap and blockwise bootstrap methods in the comparison, we set the resampling size . The block size and correlation parameter are varied in our experiment.

Figure 1: Visualisation of confidence intervals computed on the first 50 simulated data, from both ordinary bootstrap and blockwise bootstrap methods.

4.2 Results

Strictly speaking, a 95% confidence interval means that if we were able to have 100 different datasets from the same distribution of the original data and compute a 95% confidence interval based on each of these datasets, then approximately 95 of these 100 confidence intervals will contain the true value of statistic.

In our experiments, for each setup of various block size and correlation parameter, we replicate the simulation for 1000 times. Then ideally approximately 950 of these confidence intervals would contain the true absolute WER difference .

Seen from Table 1, the blockwise bootstrap method always gives valid confidence intervals since the percentage of the confidence intervals that contain the true is very close to 95%, regardless of block size ( or 30) and correlation parameter (, 0.05, 0.1, 0.2 or 0.4). As the block size or correlation increases, the width of confidence intervals becomes larger. On the other hand, the ordinary bootstrap method fails to generate correct confidence intervals when the data is dependent (), given its percentage of the confidence intervals that contain the true is much lower than 95%.

Figure 1 plots the confidence intervals computed on the first 50 simulated data among 1000, where we can see that the confidence intervals from the blockwise bootstrap method are wider and capture more true values of .

Bootstrap Blockwise Bootstrap
Block Size () Correlation () Width of CI Percentage that CI Contains the Truth Width of CI Percentage that CI Contains the Truth
0.0030 94.1% 0.0030 94.7%
0.0030 92.7% 0.0033 95.2%
0.0030 90.1% 0.0035 94.3%
0.0030 86.2% 0.0040 94.9%
0.0030 76.9% 0.0048 94.0%
0.0030 94.1% 0.0030 94.7%
0.0030 78.1% 0.0046 95.2%
0.0030 69.2% 0.0058 94.9%
0.0030 54.4% 0.0077 94.7%
0.0030 41.2% 0.0105 95.9%
Table 1: Comparison results of ordinary bootstrap and blockwise bootstrap methods on simulated data with various block size () and correlation parameter ().

5 Real Data Analysis

In this section we apply blockwise bootstrap method on two real-world speech datasets and demonstrate how it helps compute the confidence intervals of absolute WER difference between two ASR systems.

We consider the following two ASR evaluation datasets:

  • Conversational Speech dataset. This speech conversation dataset was collected through crowd-sourcing from a data supplier for Speech ASR, and the data was properly anonymized. It consists of 235 conversions with more than 20 topics that are common in daily life, including family, travel, foods, etc;

  • Augmented Multi-Party Interaction (AMI) Meeting dataset. The AMI meeting corpus [aran2010multimodal] includes scenario meetings (with roles assigned for each participant) and non-scenario meetings (where participants were free to choose topics). For scenario meetings, each session is divided into 4 one-hour meetings. Each meeting has 4 participants.

In this analysis, we only use the “dev” and “eval” split of the entire datasets above.

We use in-house developed conversation ASR system in this investigation: a baseline model (denoted as ASR system A) and an improved model (denoted as ASR system B), and we are interested in computing a 95% confidence interval of the absolute WER reduction between the two ASRs. Then if the upper bound of the confidence interval is still negative, then we can tell that this improvement is real and not due to random chance.

To apply the blockwise bootstrap method, we need to define the correlated block structures among the utterances in the evaluation data. For Conversation dataset, it’s natural to treat each conversation as a single separated block since the same topics were being discussed. For AMI Meeting data, we treat the utterances from each speaker in each (either scenario or non-scenario) meeting as a block. By doing that, for any two utterances, we assume their ASR errors are correlated if they belong to the same block while the errors have very weak correlations if they belong to different blocks.

Table 2 shows details of the two evaluation datasets in terms of number of utterances, number of total words, and number of estimated blocks with correlated utterances.

We apply both ordinary bootstrap and blockwise bootstrap methods on the two evaluation datasets. Results are shown in Table 3. Again, we observe that the confidence intervals computed from blockwise bootstrap are much wider than the ones generated from ordinary bootstrap: 1.5 times wider on Conversation data and 2 times wider on AMI Meeting data. As we can see that the upper bounds of both confidence intervals are still negative, we can conclude that the improvement of ASR system B versus system A is statistically significant.

Also, we can see that the confidence intervals computed from the empirical percentiles at 2.5% and 97.5% of bootstrap samples are almost the same with the ones computed from Gaussian approximation.

Figure 2 displays the histograms of the absolute WER differences computed from the bootstrap samples, where we can see again that the data distribution from the blockwise bootstrap method is more spread out.

Evaluation Dataset
Feature Conversation AMI Meeting
Number of Utterances
Number of Words
Number of Correlated Blocks
Table 2: Summary of the Conversation and AMI Meeting datasets.
Evaluation Dataset
Method Metric Conversation AMI Meeting
Bootstrap
Blockwise Bootstrap
Table 3: Results of bootstrap and blockwise bootstrap methods on real-world Conversation and AMI meeting datasets.
(a) Conversation (b) AMI Meeting
Figure 2: Histograms of the absolute WER differences computed from the bootstrap samples of both bootstrap and blockwise bootstrap methods.

6 Conclusion

In this paper we present blockwise bootstrap approach - by dividing the evaluation utterances into nonoverlapping blocks, this method resamples these blocks instead of original data. We show that the resulting variance estimator of the absolute WER difference of two ASR systems is consistent under mild conditions. We also demonstrate the validity of blockwise bootstrap method on both synthetic and real-world speech data.

Future work might include how to infer the correlated block structures from data, for example, estimating a sparse correlation matrix across evaluation utterances based on the embeddings of speakers and text sentences.

References