Log In Sign Up

Reducing Bias in Production Speech Models

by   Eric Battenberg, et al.

Replacing hand-engineered pipelines with end-to-end deep learning systems has enabled strong results in applications like speech and object recognition. However, the causality and latency constraints of production systems put end-to-end speech models back into the underfitting regime and expose biases in the model that we show cannot be overcome by "scaling up", i.e., training bigger models on more data. In this work we systematically identify and address sources of bias, reducing error rates by up to 20 for deployment. We achieve this by utilizing improved neural architectures for streaming inference, solving optimization issues, and employing strategies that increase audio and label modelling versatility.


WeNet: Production First and Production Ready End-to-End Speech Recognition Toolkit

In this paper, we present a new open source, production first and produc...

Two-Pass End-to-End Speech Recognition

The requirements for many applications of state-of-the-art speech recogn...

Deep Speech 2: End-to-End Speech Recognition in English and Mandarin

We show that an end-to-end deep learning approach can be used to recogni...

Deep Speech: Scaling up end-to-end speech recognition

We present a state-of-the-art speech recognition system developed using ...

UniST: Unified End-to-end Model for Streaming and Non-streaming Speech Translation

This paper presents a unified end-to-end frame-work for both streaming a...

Parallel Rescoring with Transformer for Streaming On-Device Speech Recognition

Recent advances of end-to-end models have outperformed conventional mode...

Emotional End-to-End Neural Speech Synthesizer

In this paper, we introduce an emotional speech synthesizer based on the...

1 Introduction

Deep learning has helped speech systems attain very strong results on speech recognition tasks for multiple languages [Xiong et al., 2016, Amodei et al., 2015]. One could say therefore that the automatic speech recognition (ASR) task may be considered ‘solved’ for any domain where there is enough training data. However, production requirements such as supporting streaming inference bring in constraints that dramatically degrade the performance of such models – typically because models trained under these constraints are in the underfitting regime and can no longer fit the training data as well. Underfitting is the first symptom of a model with high bias. In this work, we aim to build a deployable model architecture with low bias because 1) It allows us to serve the very best speech models and 2) Identify better architectures to improve generalization performance, by adding more data and parameters.

Typically, bias is induced by the assumptions made in hand engineered features or workflows, by using surrogate loss functions (or assumptions they make) that are different from the final metric, or maybe even implicit in the layers used in the model. Sometimes, optimization issues may also prevent the model from fitting the training data as well – this effect is difficult to distinguish from underfitting, and we also look at approaches to resolve optimization issues.

Sources of Bias in Production Speech Models

End-to-end models like [Amodei et al., 2015] typically tend to have lower bias because they have fewer hand engineered features, so we start from a similar model as the baseline. The model used in [Amodei et al., 2015]

is a recurrent neural network with two 2D-convolutional input layers, followed by multiple bidirectional recurrent layers and one fully connected layer before a softmax layer. The network is trained end-to-end using the Connectionist Temporal Classification (CTC) loss function

[Graves et al., 2006], to directly predict sequences of characters from log spectrograms of the audio. The following assumptions are implicit, that contribute to the bias of the model.

  1. Input modeling:

     Typically, incoming audio is processed using energy normalization, spectrogram featurization, log compression, and finally, feature-wise mean and variance normalization. Figure 

    5 shows however, that log spectrograms can have a high dynamic range across frequency bands (Fig (a)a) or have some bands missing (Fig (c)c). We investigate how the PCEN layer  [Wang et al., 2016a] can parametrize and learn improved versions of these transformations, which simplifies the task of subsequent 2D convolutional layers.

  2. Architectures for streaming inference: English ASR models greatly benefit from using information from a few time frames into the future [Xiong et al., 2016, Sercu and Goel, 2016, Peddinti et al., 2015]. In the baseline model, this is enabled by using bidirectional layers, which are impossible to deploy in a streaming fashion, because the backward looking recurrences can be computed only after the entire input is available. Making the recurrences forward-only immediately removes this constraint and makes these models deployable, but also make the assumption that no future context is useful. We show the effectiveness of Latency Constrained Bidirectional RNNs [Zhang et al., 2016] in controlling the latency while still being able to include future context.

  3. Target modeling:

     CTC models that output characters assume conditional independence between predicted characters given the input features - while this approximation makes maximum likelihood training tractable, this induces a bias on English ASR models and imposes a ceiling on performance. While CTC can easily model commonly co-occuring ngrams together, it is impossible to give roughly equal probability to many possible spellings when transcribing unseen words, because the probability mass has to be distributed between multiple time steps, while assuming conditional independence. We show how GramCTC 

    [Liu et al., 2017] finds the label space where this conditional independence is easier to manage.

  4. Optimization issues: Additionally, the CTC loss is notoriously unstable  [Sak et al., 2015], despite making sequence labeling tractable, since it is forcing the model to align the input and output sequences, as well as recognize output labels. Making the optimization stable can help learn a better model with the same number of parameters. We show two effective ways of using alignment information to improve the rate of convergence of these models.

The rest of the paper is organized as follows: Section 2 introduces related work that address each of the issues outlined above. Sections 3, 4, 5, and 6 investigate solutions for addressing the corresponding issue, and study trade-offs in their application. In section 7, we present experiments where we show the impact of each component independently, as well as the combination of all of them and discuss the results.

2 Related Work

The most direct way to remove all bias in the input-modeling is probably learning a sufficiently expressive model directly from raw waveforms as in [Sainath et al., 2015, Zhu et al., 2016] by parameterizing and learning these transformations. These works suggest that non trivial improvement in accuracy purely from modeling the raw waveform is hard to obtain without a significant increase in the compute and memory requirements. [Wang et al., 2016a] introduced a trainable per-channel energy normalization layer (PCEN) that parametrizes power normalization as well as the compression step, which is typically handled by a static log transform.

Lookahead convolutions have been proposed for streaming inference [Wang et al., 2016b]. Latency constrained Bidirectional recurrent layers (LC-BRNN) and Context sensitive chunks (CSC) have been proposed in [Chen and Huo, 2016] for tractable sequence model training but not explored for streaming inference. Time delay neural networks [Peddinti et al., 2015] and Convolutional networks are also options for controlling the amount of future context.

Alternatives have been proposed to relax the label independence assumption of the CTC loss - Attention models

[Bahdanau et al., 2015, Chan et al., 2016], global normalization [Collobert et al., 2016] and segmental RNNs [Lu et al., 2016] and more end-to-end losses like lattice free MMI (Maximum Mutual Information) [Povey et al., 2016] are all promising approaches to address this problem.

CTC model training has been shown to be made more stable by feeding shorter examples first, like SortaGrad [Amodei et al., 2015] and by warm-starting CTC training from a model pre-trained by Cross-Entropy (CE) loss (using alignment information) [Sak et al., 2015]. SortaGrad additionally helps to converge to a better training error.

Figure 5: Each row shows spectrograms of the same audio segment post-processed with two different methods. The horizontal axis is time (10ms / step) and the vertical axis is frequency bins. The left column is generated by applying log. The right is with PCEN with 0.015 and 0.08 smoothing coefficients. In addition, we also want our models to be robust to pipeline effects, like missing bands (bottom row).

3 Input modeling

ASR systems often have a vital front-end that involves power normalization, (mel) spectrogram calculation followed by log compression, mean and variance normalization apart from other operations. In this section, we show that we can better model a wide variety of speech input by replacing this workflow with a trainable frontend.

While spectrograms strike an excellent balance between compute and representational quality, they have a high dynamic range (Figure 5) and are susceptible to channel effects such as room impulse response, Lombard effects and background noises. To alleviate the first issue, they are typically log compressed, and then mean and variance normalized. However, this only moderately helps with all the variations that can arise in the real world as described before, and we expect the network to learn to be robust to these effects by exposing it to such data. By relieving the network of the task of speech and channel normalization, it can devote more of its capacity for the actual speech recognition task. For this, we replaced the traditional log compression and power normalization steps with a trainable per-channel energy normalization (PCEN) front-end [Wang et al., 2016a], which performs


where is the input spectrogram,

is the causal energy estimate of the input, and

are tunable per-channel parameters. The motivation for this is two-fold. It first normalizes the audio using the automatic gain controller (AGC), , and further compresses its dynamic range using . The latter is designed to approximate an optimized spectral subtraction curve [Porter and Boll, 1984] which helps to improve robustness to background noises. Clearly, Figure 5 shows that PCEN effectively normalizes various speaker and channel effects.

(a) Homogeneous speech dataset
(b) Real world speech dataset
Figure 8: The normalization bias of log compression is clearly shown on the inhomogeneous real-world dataset that is distorted with various channel and acoustical effects.

PCEN was originally motivated to improve keyword spotting systems, but our experiments show that it helps with general ASR tasks, yielding a noticeable improvement in error rates over the baseline (Table 3). Our training data set which was curated in-house consists of speech data collected in multiple realistic settings. The PCEN front-end gave the most improvement in our far-field validation portion where there was an absolute 2 WER reduction. To demonstrate that this was indeed reducing bias, we tried this on WSJ, a much smaller and homogeneous dataset. We observed no improvement on the holdout validation set as shown in Figure (a)a as the read speech is extremely uniform and the standard front-end suffices.

4 Latency Controlled Recurrent layers

Consider a typical use-case for ASR systems under deployment. Audio is typically sent over the network in packets of short durations (e.g.,  - ms). Under these streaming conditions, it is imperative to improve accuracy and reduce the latency perceived by end-users. It’s observed that users tend to be most perceptive to the time between when they stop speaking and when the last spoken word presents to them. As a proxy for perceived latency, we measure last-packet-latency, defined as the time taken to return the transcription to the user after the last audio packet arrived at the server. 111Real-time-factor (RTF) has also been commonly used to measure the speed of an ASR system, but it is in most cases only loosely correlated with latency. While a is necessary for a streaming system, it’s far from sufficient. As one example RTF does not consider the non-uniformity in processing time caused by (stacked) convolutions in neural networks.

Figure 9: Contexts of different structures. (a) Bidirectional RNN. (b) Chunked RNN. (c) Chunked RNN with overlapping. (d) LC-BGRU layer. (e) Lookahead Convolution. Solid arrows represent forward recurrences, and dash arrows represent backward recurrences. States are reset to zero at the start of each arrow. Solid lines represent convolution windows.

To tackle the bias induced by using purely forward only recurrences in deployed models, we examine several structures, including look-ahead convolutions [Wang et al., 2016b] (LA-Conv) and latency-controlled bidirectional RNNs (in our case, LC-BGRU as our recurrent layers employ GRU [D. Bahdanau and Bengio, 2014] cells) [Chen and Huo, 2016, Zhang et al., 2016], which are illustrated in Figure 9.

An LA-Conv layer learns a linear weighted combination (convolution) of activations in the future ([+, +

]) to compute activations for each neuron

, with a context size , as shown in Figure 9 (e). The LA-Conv is placed above all recurrent layers.

In a LC-BGRU layer, an utterance is uniformly divided into several overlapping chunks, each of which can be treated as an independent utterance and computed with bidirectional recurrences. More formally, let be the length of an utterance . represents the th frame of . is divided into overlapping chunks that are each of a fixed context size . In our experiments, the forward recurrences process sequentially as ,…,. Backward recurrences start processing the first chunk ,…,, then move ahead by chunk/step-size to independently process ,…,, and so on. In relation to the first chunk ,…,, we refer to ,…, as the . Hidden-states of the backward recurrences are reset between each chunk, and consequently ,…, produced from each chunk are used in calculating the final output of the LC-BGRU layer. Figure [9] illustrates this operation and compares it with other methods which are proposed for similar purposes. The forward-looking and backward-looking units in this LC-BGRU layer receive the same affine transformations of inputs. We found that this helps reduce computation and save parameters, without affecting the accuracy adversely. The outputs are then concatenated across features at each timestep before being fed into the next layer.

Figure 12: (a)a and (b)b plot CER and serving latency respectively on a dev-set as a function of lookahead, and the number of LC-BGRU layers. The models are trained on a sampled subset (10 ) of the complete training data. The green and blue baselines are the performance of models where all the 3 GRU layers are bidirectional and forward-only correspondingly.

4.1 Accuracy and Serving Latency

We compare the Character Error Rate (CER) and last-packet-latency of using LA-Conv and LC-BGRU, along with those of forward-GRU and Bidrectional GRU for references. Context size is fixed as 30 time steps for both LA-Conv and LC-BGRU, and lookahead timestep ranges from 5 to 25 every 5 steps for LC-BGRU. For latency experiments, we fix the packet size at ms, and send one packet every ms from the client. We send 10 simultaneous streams to simulate a system under moderate load. As shown in Figure (a)a, while LA-Conv reduces almost half of the gap between forward GRU and bidirectional GRU, a model with three LC-BGRUs with lookahead of 25 each (yellow line) performs as well as bidirectional GRU (green line). The accuracy improves, but the serving latency increases exponentially as we stack LC-BGRU layers, because this increases the effective context much like in convolutional layers. Taking both accuracy and serving-latency into consideration, our final models use 1 LC-BGRU layer, with a lookahead of 20 timesteps (400ms) and step-size of 10 timesteps (200ms). 222

1 timestep corresponds to 10ms of the raw-input spectrogram, and then striding in the convolution layers makes that 20ms

4.2 Loading BGRU as LC-BGRU

Figure 13: This plots CER of running BGRU in a LC-BGRU way with different context sizes (C) and lookahead timsteps.

Since Bidrectional GRUs (BGRU) can be considered as an extreme case of LC-BGRUs with infinite context (as long as the utterance length), it is interesting to know whether we could load a trained bidirectional GRU model as an LC-BGRU, so that we don’t have to train LC-BGRUs from scratch. However, we found that loading a model with 3 stacked bidirectional GRUs as stacked LC-BGRUs resulted in significant degradation in performance compared to both the bidirectional baseline and a model trained with stacked LC-BGRUs across a large set of chunk sizes and lookaheads.

We can improve the performance of the model, if we instead chop up the input at each layer to a fixed size , such that it is smaller than the effective context. We run an LC-BGRU layer on an input of length , then stride the input by , discard the last ( - ) outputs, and re-run the layer over the strided input. Between each iteration the forward recurrent states are copied over, but the backward recurrent states are reset each time. The effect of using various and is shown in Figure 13. This approach is much more successful in that with timesteps and timesteps, we are able to obtain nearly identical error rates to the Bidirectional GRU. With this selection of and , the network does twice as much computation as would otherwise be needed, and it also has latencies that are unacceptable for streaming applications. However, it does have the advantage of running bi-directional recurrent layers over arbitrarily long utterances in a production environment at close to no loss in accuracy.

5 Loss function

Figure 14: Illustration of the states and the forward-backward transitions for the label ‘CAT’. Here we let model’s output be over the set , the set of all uni-grams and bi-grams of the English alphabet. The set of all valid states for the label = ‘CAT’ are listed to the left. The set of states and transitions that are common to both CTC and GramCTC are in black, and those that are unique to GramCTC are in orange.

The conditional independence assumption made by CTC forces the model to learn unimodal distributions over predicted label sequences. GramCTC [Liu et al., 2017] attempts to find a transformation of the output space where the conditional independence assumption made by CTC is less harmful. Specifically, GramCTC attempts to predict word-pieces, whereas traditional CTC based end-to-end models aim to predict characters.

GramCTC learns to align and decompose target sequences into word-pieces, or n-grams. N-Grams allow us to address the peculiarities of English spelling and pronunciation, where word-pieces have a consistent pronunciation, but characters don’t. For example, when the model is unsure how to spell a sound, it can choose to distribute probability mass roughly equally between all valid spellings of the sound, and let the language model decide the most appropriate way to spell the word. This is often the safest solution, since language models are typically trained on significantly larger datasets and see even the rarest words. GramCTC is a drop-in replacement for the CTC loss function, with the only requirement being a pre-specified set of n-grams

. In our experiments, we include all uni-grams and high-frequency bi-grams and tri-grams, which composes a set of 1200 n-grams.

5.1 Forward-backward Process of GramCTC

The training process of GramCTC is very similar to CTC. The main difference is that multiple consecutive characters may form a valid gram. Thus, the total number of states in the forward-backward process is much larger, as well as the transition between these states.

Figure 14 illustrates partially the dynamic programming process for the target sequence ‘CAT’. Here we suppose contains all possible uni-grams and bi-grams. Thus, for each character in ‘CAT’, there are three possible states associated with it: ) the current character, ) the bi-gram ending in current character, and the blank after current character. There is also one blank at beginning. In total we have states.

Loss Train Train Holdout Dev
CTC 4.38 12.41 4.60 12.89 11.64 28.68
GramCTC 4.33 10.42 4.66 11.37 12.03 27.1
Table 1: Comparison of CTC and GramCTC.
Loss WER Epoch Time (hours)
Stride 2 4 2 4
GramCTC 21.46 18.27   18.3   9.6
Table 2: Performances and training efficiency of GramCTC with different model strides
Figure 19: (a) Cross correlation between alignments estimated by three reference models: forward, LC-BGRU and bidirectional. Alignments by a forward (or LC-BGRU) model are 5 (or 4) steps later than those by a bidirectional model. (b) Applying alignments from a bidirectional model to the pre-training of a forward model, the amount of delay has little impact on the performance at convergence. (c) Warm-starting a LC-BGRU model from pre-training using different alignments, all of which achieve smaller training loss than no pre-training. (d) CER on dev set for the models trained in (c): they are all smaller than the case of no pre-training. Note also that joint training is on-par with pre-training.

5.2 GramCTC vs CTC

GramCTC effectively reduces the learning burden of ASR network in two ways: it decomposes sentences into pronunciation-meaningful n-grams, and it effectively reduces the number of output time steps. Both aspects simplify the rules the network needs to learn, thus reducing the required network capacity of the ASR task. Table 1 compares the performances between CTC and GramCTC using the same network. There are some interesting distinctions. First, the CERs of GramCTC are similar or even worse than CTC; however, the WERs of GramCTC are always significantly better than CTC. This is probably because GramCTC predicts in chunks of characters and the characters in the same chunk are dependent, thus more robust. Secondly, we also observe the performance on the dev set is relatively worse than that on the train holdout. Our dev dataset is not drawn from the same distribution of the training data - this exhibits the potential for GramCTC to overfit even a large dataset.

Table 2 compares the training efficiency and the performance of trained model with GramCTC on two time resolutions, and . By striding over the input at a faster rate in the early layers, we effectively reduce the time steps of later layers, and reduce the training time in half. From stride to stride , the performance also improves a lot probably because larger n-grams align with larger segments of utterance, and thus need lower time resolution.

6 Optimization Tricks

Removing optimization issues have been a reliable way of improving performance in deep neural networks [Ioffe and Szegedy, 2015, He et al., 2016]. Several optimization tricks have been proposed especially for training recurrent networks - we tried using LayerNorm [Ba et al., 2016], Recurrent batch norm [Cooijmans et al., 2016] and NormProp [Arpit et al., 2016] without much success. Additionally, we take care special care to optimize layers properly, and also employ SortaGrad [Amodei et al., 2015].

[Sak et al., 2015] suggests that CTC training could be suffering from optimization issues and could be made more stable by providing alignment information during training. In this section we study how alignment information can be used effectively.

6.1 Pre-training vs Joint-training

Using alignment information for training CTC models appears counter intuitive since CTC marginalizes over all alignments during training. However, the CTC loss is hard to optimize because it simultaneously estimates network parameters and alignments. To simplify the problem, one may propose an Expectation-Maximization (EM) like approach, where the E-step computes the expected log-likelihood by marginalizing over the posterior of alignments, and the M-step refines the model parameters by maximizing the expected log-likelihood. However, it is infeasible to compute the posterior for all the alignments, and we approximate it by taking only the most probable alignment. One step of EM can be considered as the pre-training approach of using alignment information - we start training a model with the most likely alignment (which simplifies to training with a Cross-Entropy (CE) loss for a few epochs, followed by training with the CTC loss.

Another way of using the alignment information is train a single model simultaneously using a weighted combination of the CTC loss and the CE loss.

Figure (c)c shows the training curves of the same model architecture with pure CTC training, pre-training and joint training with alignment information from different source models. In the case of pre-training we stop providing alignment information at the 6-th epoch, corresponding to the shift in the training curve. Note that the final training losses of both pre-trained and joint-trained models are all lower than the pure CTC trained model, showing the effectiveness of this optimization trick. Additionally, joint-training and pre-training are on par in terms of training, so we prefer joint-training to avoid multi phase training. The corresponding CER on dev set is presented in figure (d)d.

6.2 Source of alignments

It is important for us to understand how accurate the alignment information needs to be, since different models have differing alignments according to the architecture and training methods.

We estimate alignments from three “reference” models (models with forward only GRU, LC-BGRU and bidirectional GRU layers, all trained with several epochs of CTC minimization), and present the cross correlation between the alignments produced by these models in Fig. (a)a. The location of the peak implies the amount of delays between two alignments. It is evident that alignments by a forward (and LC-BGRU) model are 5 (4) time-steps later than those by a bidirectional model, an observation that is consistent with [Schuster and Paliwal, 1997]. Therefore, it seems important to pre-train a model with properly adjusted alignments, e.g., alignments from a bidirectional model are supposed to be delayed for 5 steps to be used in the pre-training of a forward model. However, we found that for models trained on large datasets, this delay has little impact on the final result (figure (b)b). To push this series of experiments to the extreme, we tried pre-training a model with random alignments. Random alignments do not work, but we found that most likely alignment as predict by any ctc model was sufficient to achieve improved optimization.

7 Experiments

7.1 Setup

In all experiments, the dataset is 10,000 hours of labeled speech from a wide variety of sources. The dataset is expanded by noise augmentation – in every epoch, of the utterances are randomly selected and background noise is added. For robustness to reverberant noise encountered in far-field recognition, we adopt room impulse response (RIR) augmentation as in [Ko et al., 2017], in which case, we randomly sample a subset of the data and convolve each instance with a random RIR signal. 333We collect RIRs by emitting a signal from a speaker and capturing the signal, as well as the reverberations from the room, using an linear array of 8 microphones. The speaker is placed in a variety of configurations, ranging from 1 to 3 meters distance and 60 to 120 degrees inclination with respect to the array, for 20 different rooms.

The model specification and training procedure are the same as in [Amodei et al., 2015]. The baseline model is a deep recurrent neural network with two 2D convolutional input layers, followed by 3 forward Gated Recurrent layers [D. Bahdanau and Bengio, 2014]

, 2560 cells each, a look-ahead convolution layer and one fully connected layer before a softmax layer. The network is trained end-to-end to predict characters using the CTC loss. The configurations of the 2D convolution layers (filters, filter dimensions, channels, stride) are (32, 41x11, 1, 2x2) and (32, 21x11, 32, 2x1). Striding in both time and frequency domains helps us reduce computation in the convolution layers. In the convolution and fully-connected layers, we apply batch-normalization before applying nonlinearities (ReLU). We use sequence-wise batch-normalization in the recurrent layers

[Amodei et al., 2015], effectively acting on the affine transformations of the inputs fed into them. Figure 20 shows the baseline model on the left.

For the baseline model, log spectrogram features are extracted, in 161 bins with a hop size of 10ms and window size of 20ms, and are normalized so that each input feature has zero mean and unit variance. The optimization method we use is stochastic gradient descent with Nesterov momentum. Hyperparameters (batch-size =

, learning-rate , momentum ) are kept the same across different experiments.

Figure 20: Comparison between the baseline (left) and the proposed (right) model architectures.

Table 3 shows the result of the proposed solutions in earlier sections. We report the results on a sample of the train set as well as a development set. The error rates on the train set are useful to identify over-fitting scenarios, especially since the development set is significantly different from our training distribution as their sources are different.

In Table 3, both character and word error rates (CER/WER) are produced using a greedy max decoding of the output softmax matrix, i.e., taking the most likely symbol at each time step and then removing blanks and repeated characters. However, when a language model is adopted as in the “Dev LM” results, we use a beam search over the combined CTC and LM scores.

Train Dev Dev LM
CER WER CER % Rel WER % Rel WER % Rel
Baseline 4.38 12.41 11.64 0.00% 28.68 0.00% 18.95 0.00%
Baseline + PCEN 3.79 10.90 11.16 4.20% 27.85 2.90% 18.12 4.40%
Baseline + LC-BGRU 3.49 10.33 11.06 5.00% 27.03 5.80% 17.47 7.80%
Baseline + GramCTC 4.33 10.42 12.03 -3.30% 27.10 5.50% 19.26 -1.70%
Baseline + CE pre-training 3.31 9.50 10.84 6.90% 26.39 8.00% 17.89 5.60%
Baseline + CE joint-training 3.25 9.58 10.75 7.70% 26.64 7.10% 17.93 5.40%
Baseline + Farfield augmentation 5.25 14.59 11.49 1.30% 29.64 -3.30% 18.47 2.50%
Baseline + CE + CTC
   + GramCTC joint training (Mix-1) 2.97 7.31 10.91 6.30% 24.48 14.60% 17.71 6.5%
Baseline + PCEN
   + LC-BGRU
   + Farfield augmentation (Mix-2) 5.51 14.10 9.74 16.40% 24.82 13.40% 15.47 18.40%
   + CE joint training (Mix-3) 3.57 10.50 9.38 19.40% 23.77 17.10% 15.75 16.90%
Bidirectional target 2.58 7.47 9.37 19.60% 23.03 19.70% 15.96 15.80%
Table 3: Results for both single improvements and incremental improvements to the models. Except when using a language mode (Dev LM), reported numbers are computed using greedy max decoding as described in 7.1. Best results using deployable models are bolded.

7.2 Results of Individual Changes

In the first half of Table 3, we show the impact of each of the changes applied individually. All of the techniques proposed help fit the training data better, measured by CER on the train set. Several observations stand out.

  1. Replacing CTC loss with GramCTC loss achieves a lower WER, while CERs are similar on the train set. This indicates that the loss promotes the model to learn the spelling of words, but completely mis-predicts words when they are not known. This effect results in diminished improvements when the language model is applied.

  2. Applying farfield augmentation on the same sized model results in a worse training error as expected. It shows a marginal improvement on the dev set, even though our dev set has a heavy representation of farfield audio.

  3. The single biggest improvement on the dev set is the addition of the LC-BGRU which closes the gap to bidirectional models by 50%.

  4. Joint (and pre) training with alignment information improves CER on the train set by 25%, highlighting optimization issues in training CTC models from scratch. However, these models get less of an improvement from language model decoding, indicating their softmax outputs could be overconfident, therefore less amenable to correction by the language model. This phenomenon is observed in all models employing CE training as well as our Bidirectional target model (the model that provides the targets used for CE training).

7.3 Results of Incremental Changes

While we designed the solutions to address distinct issues in the model, we should not expect every individual improvement to be beneficial when used in combination. As an example, we see in the section on optimization that models with bidirectional layers gain very little by using alignment information - clearly, bidirectional layers by themselves address a part of the difficulty in optimizing CTC models. Therefore, addressing the absence of bidirectional layers will also address optimization difficulties and they may not stack up.

We see in the second half of Table 3 that improvements indeed do not stack up. There are 3 interesting models to discuss.

  1. The model mix of joint training with 3 increasingly difficult losses (CE, CTC, and GramCTC, Mix-1) achieves the best results on the train set far surpassing the other model mixes, and even nearly matching the performance of models with bidirectional layers on the train set. This model has the smallest gain on the dev set amongst all the mix-models, and puts it in the overfitting regime. We know that there exists a model that can generalize better than this one, while achieving the same error rates on the train set: the bidirectional baseline. Additionally, this model receives a weak improvement from the language model, which agrees with what we observed with GramCTC and CE training in 7.2.

  2. The model mix of PCEN, LC-BGRU and IR augmentation (Mix-2) performs worse on the train set – additional data augmentation with IR impulses makes the training data harder to fit as we have seen earlier, but PCEN and LC-BGRU is not sufficient to address this difficulty. However, the model does attain better generalization and performs better on the dev set, and actually surpasses our bidirectional target when using a language model.

  3. Mix-3 adds CE joint training which helps to address optimization issues and leads to lower error rates on both the train and dev sets. However, the improvement in dev WER disappears when using a language model, again highlighting the language model integration issues when using CE training.

Finally in Table 4, we compare Mix-3 against the baseline model, and its equivalent with twice as many parameters in every layer, on various categories of speech data. Clearly, Mix-3 is significantly better for “farfield” and “Names” speech data, two notably difficult categories for ASR. ASR tasks run into a generalization issue for “Names” categories because they are often required words that is not present in the acoustic training data. Similarly far field audio is hard to obtain and the models are forced to generalize out of the training data, in this case by making use of augmentation. At the same time, the serving latency of Mix-3 is only slightly higher than baseline model, still good for deployment.

Devsets Baseline 2Baseline* Mix-3
Clean casual speech 5.90 5.00 5.80
Farfield 35.05 30.60 26.49
Names 19.73 19.30 17.40
Overall 18.46 17.46 15.74
Serving latency 112 25933 153
Training time 17 29 25
Table 4: A deeper look into the dev set, measuring WER with language model decoding of each model on different slices of the dev set. Serving latency (milliseconds) is the percentile latency on the last packet as described in Sec. 4. Training time is in hours per epoch with the data and infrastructure the same. *This model has twice the number of parameters as the Baseline, and suffers from prohibitively large serving latency.

8 Conclusion

In this work, we identify multiple sources of bias in end-to-end speech systems which tend to encourage very large neural network structure, thus make deployment impractical. Multiple methods are proposed to address these issues, which enable us to build a model that performs significantly better on our target dev set, while still being good for streaming inference.

While the addition of cross entropy alignment training and the GramCTC loss allow models to fit the training and validation data better with respect to the WER of a greedy max decoding, they see much less of a benefit from language modeling integration. Using an LC-BRGU layer in place of lookahead convolutions conveys benefits across the board as does use of a PCEN layer at the front end. Finally, generalization to unseen data is improved by the addition of farfield augmentation.

9 Acknowledgements

We are indebted to the Baidu Speech Technology Group for all the IR convolutions, and the Quality Assurance team for helping us identify and understand useful metrics (like first, last and 98% packet latency). We are also grateful to the systems team at SVAIL, for their help in developing the training platform and infrastructure.