Automatic Speech Recognition (ASR) enables fast and accurate transcriptions of voice commands and dictations on edge devices; examples of on-device ASR applications include dictation for Google keyboard [schalkwyk_2019, he2019streaming], voice commands for Apple Siri [vincent_2021], and Amazon Alexa [liu2021exploiting], etc. Past work developed streaming End-to-End (E2E) all-neural ASR models that run compactly on edge devices [he2019streaming, kim2019attention, shangguan2019optimizing, kim2020review]
. These neural-network based E2E ASR models, however, are prone to overfit the training data, and suffer from large performance degradation on noisy or unseen testing data, e.g.,[narayanan2019recognizing]
Researchers have proposed many techniques to regularize the training of neural network-based models, which has shown to significantly improve the generalization of ASR models. Notable examples include layer normalization [Ba2016Layer], early stop and dropout [zhou2017improved, gal2016theoretically], audio reverberation and background noise simulations [kim2017generation], spectrum data augmentation [Park2019, Park2020SpecAug] and speed perturbation of the audio inputs [ko2015audio, cui2015data], etc.
In this work, we analyze a simple yet effective noisy training strategy for E2E streaming ASR model training via regularization. By introducing random noise to the parameter space during training, we force the ASR models to be smoother at convergence to achieve better generalization [an1996effects]
. We select a strongly regularized state-of-the-art E2E ASR model – an Enformer-based recurrent neural network transducer (RNN-T) model[shi2021emformer] and demonstrate how the proposed method can help to reduce the model word error rate (WER) on noisy testing dataset.
As E2E streaming ASR models are often constrained by the hardware resources when deployed on-device, pruning is widely adopted to simultaneously reduce the model computation and parameters. Hence, besides applying the proposed noisy training on the dense RNN-T model (73 MB), we also evaluate its effectiveness on several sparse RNN-T models, whose sizes range from 45 MB to 20 MB. We observe consistent WER reduction across all model sizes. In fact, we find noisy training more effective in improving downsized or sparse RNN-T models compared to the baseline models. The overall contribution of the paper can be summarized as follows:
For the first time, noisy training is applied to Enformer-based RNN-T models and systematically studied within an on-device E2E RNN-T framework.
Noisy training is studied for both dense and sparse models across a wide range of model sizes. While consistent improvement is observed, we demonstrate 12% and 14% WER reduction on the LibriSpeech Test-other and Test-clean dataset for the 90% sparse model. The experiments and ablation study help shed insights into the regularization of training small, on-device ASR models.
2 Streaming E2E ASR
In this work, we focus on streaming E2E ASR models, also known as as incremental recognizers – as users speak, partial results already start to surface in real-time before the ASR finalizes its transcriptions. Streaming ASR models appear faster and more natural to users [aist2007incremental] and thus are popular for deployment on edge devices. They potentially reduce the latency of downstream models in a pipeline, such as in a real-time translation pipeline or a natural language understanding pipeline, because partial results can be fed downstream before the ASR finalizes its hypotheses [ShangguanKHMB20].
Given a collection of data with the acoustic signals and the corresponding text labels. ASR models are often trained by maximizing the log-likelihood of the alignment sequence,
where denotes the predictions of a ASR model and is the model parameters. We use the RNN-transducer loss to compute [graves2012sequence]. In this work, we also adopt alignment-restriction RNN-T loss to speed up model training on GPUs [mahadeokar2021alignment].
This optimization in equation (1) is prone to over-fitting. Empirically, we observe that the WER of training data is much smaller than that of the testing set even in the presence of other strong regularization techniques. Similar behavior is also reported by [2020GhodsiStateless] in the context of low-resource training data. We observe that the RNN-T losses and WER differences between training set and testing set are more obvious with compressed or sparsity-pruned E2E ASR models than large, dense models.
|(a) parameter noise injection||(b) model overview|
2.1 Noisy training
We investigate a simple method to effectively alleviates over-fitting when training ASR models. Our method works by introduce random noise to the parameter space during training,
Here is a noise distribution. As shown theoretically in [an1996effects], this noisy training strategy effectively avoids sharp local minima at convergence. Compared with standard MLE training in equation (1), optimizing a noise-perturbed objective yields smoother neural networks as solutions that generalize better. We refer the reader to [an1996effects] for more theoretical treatments.
We setfor simplicity. The squared term on is introduced to prevent from growing arbitrary large and hence, cancels out the regularization effect from the noise. We set as default throughout the paper.
Note that the exact computation of equation (2) is intractable. In practice, we approximate equation (2) with a single Monte Carlo sample from and a random mini-batch from . More specifically, at each training step, we compute the gradient of as follows:
2.2 Connections to Variational inference
Past works point out that Bayesian inference with neural networks improves model generalization, accuracy and calibration[DBLP:journals/corr/abs-2002-08791, DBLP:journals/corr/abs-2007-06823]. For ASR, exact Bayesian inference requires sampling from a Bayesian NN posterior, which is computationally intractable. Variational inference provides a computationally efficient tool that enables fast posterior approximation through optimization [David2017VI]. Similar to Graves’ formulation in [graves2011practical], our noisy training could be viewed as finding a Gaussian proposal distribution over real value weight parameters , , to approximate the posterior distribution of weights over the given dataset , .
In variational inference, finding the best approximation of could then be formulated as minimizing the Kullback-Leibler (KL) divergence between and as follows:
Furthermore, minimizing equation (4) is equivalent to maximizing the following evidence lower bound (ELBO) [blei2017variational]:
where is the expected log-likelihood. Assume a simple Gaussian prior, e.g., (), and assume a fully factorized Gaussian, e.g., with as a constant. In this way, the term can be integrated analytically,
2.3 Implementation Details
Note that our method requires specifying a noise distribution (see equation (2)) to control the exploration in the parameter space, Intuitively, a large might yield an optimization objective that is hard to converge; On the other hand, a small noise might cannot sufficient regularize the training, and therefore yielding marginal benefits. In this work, we find it is most effective to set adaptively according the magnitude of the weights during training.
Specially, consider a linear layer with weights , where represents the input dimension and denotes the output dimension. For each column of the weight matrix , we perturb as follows,
During training, we treat as constants and stop the gradients back-propagating through .
Our experiments are designed and reported in three parts: in Section 3.1 we apply our method to improve the training of RNN-T models, with different number depths and widths in the RNN-T Emformer-based encoder network; in Section 3.2, we show the effectiveness of noisy training on sparsity-pruned ASR models; and finally in Section 3.3, we provide extensive ablation studies to study the impact of the noise scale and the location to inject noises on the model’s performance.
Dataset and data augmentation
We use the LibriSpeech 960h corpus for experiments [panayotov2015librispeech]. To enforce strong regularization from data augmentation, we perturb the input audio speed with ratio 0.9, 1.0 and 1.1 using techniques in [ko2015audio]. We then extra 80-dimensional logMel features using a sliding window of 25ms and step of 10ms over the input audio. To regularize the model implicitly, we further apply spectrum data augmentation [Park2019] with Frequency mask parameter (F=27) and (T=10) Time masks with maximum time-mask ratio (p = 0.05).
We use the recurrent neural network transducer (RNN-T) framework to represent E2E ASR models in this work. A RNN-T model typically contains three components: a encoder network, a prediction network and a joiner network. The encoder network converts frame-wise acoustic input into a high level vector representation; the prediction network acts as a language model that converts previously predicted non-blank tokens into a high level representation; the joiner combines encoder and prediction network output and applies a softmax to predict the next token, including ablank token. We use a simplified Emformer cells [shi2021emformer]
to build the encoder model of our RNN-T models. We use Long Short Term Memory (LSTMs) cells to build the predictor. We provide a more detailed description of our model architecture in Section4. We refer readers to [graves2012sequence, he2019streaming] for a more explanation on RNN-T. Additionally, we show the RNN-T structure used in this work in Figure 1(b).
We explicitly regularize all RNN-T models in this work by add extra modules, which lead to auxiliary losses to the intermediate RNN-T encoder layers as suggested in [liu2021improving].
We leverage state-of-the-art regularization techniques to build a strong baseline training pipeline. To summarize, we implement popular regularization techniques, including speech perturbation, spectrum data augmentation, layer normalization for all weight matrices in the RNN-T, and residual connection within the Emformer cells (see Section4). We apply dropouts [srivastava2014dropout] of ratio 0.1 to the Emformer weight matrices, and dropouts of ratio 0.3 to all other linear and LSTM layers to reduce overfitting.
On top of the above-mentioned regularization techniques, we apply to noisy training to further boost the performance. We report the WERs on the LibriSpeech test-clean and test-other datasets. All WER results in this work is scored with NIST Sclite tool without GLM grammar replacement [NISTtool].
Noisy training settings
We add adaptive noise to the training following equation (8). We set the noise scale to be throughout the paper unless otherwise specified. We found this single setting performs well across all experimental setups.
3.1 Improving Emformers
In this part, we apply our noisy training to improve a number of Emformers with various model sizes.
Specifically, in Table 1, we denote Emformer-nL as n
layers of Emformer used in the RNN-T transcription network. We also denote Emformer-20L(0.5x) as removing 1/2 of the hidden units in each of the 20 layers of Emformer cells in the RNN-T encoder network; in this way, the total number of parameters reduced is 75% for the encoder. All models are trained for 120 epochs with a batch size of 1024.
We find that our method achieves better WER compared to the baseline models without noisy training. The smaller the Emformer model, the more effective noisy training is in improving model performances.
|+ Noisy training||12.0 (-8%)||4.6 (-12%)|
|+ Noisy training||10.8 (-6%)||4.0 (-11%)|
|+ Noisy training||9.7 (-5%)||3.8 (-5%)|
|+ Noisy training||9.5 (-4%)||3.5 (-8%)|
|+ Noisy training||10.8 (-8%)||4.3 (-8%)|
3.2 Improving Pruning aware training
Sparsity pruning introduces block-patterns of zeros inside the weight matrices of the neural networks. It allows the E2E ASR models to run compactly with low latencies on sparsity-friendly hardware. Sparse models have also been shown to outperform similar-sized compressed models without sparsity on speech and language modeling tasks [shangguan2019optimizing, Zhu2018Prune, pang2018compression]. In this section, we apply our noisy training method to the training of sparse Emformer-based ASR models. We use weight magnitude based pruning, and sparsify only the Emformer-based encoder network in an RNN-T, which occupies of the size of the entire RNN-T model.
Each model in Table 2 is trained from scratch with pruning-aware training. We use a sparsity block pattern of 8x1, and a cubic pruning schedule described in [Zhu2018Prune]:
where . Here denotes the pruning ratio at training step and is the target sparsity ratio. We starts pruning at step and gradually increase the pruning ratio during training. Meanwhile, represents the pruning frequency and denotes the number of pruning steps. We set both and to be 256 for all sparse models.
Meanwhile, as it is more difficult to optimize sparsified models. For each pruning-aware training setting with pruning ratio , we linearly scale down its corresponding dropout ratio on the Emformer encoder from to .
As shown in Table 2, our noisy training leads to consistent WER reduction on all the settings evaluated. In particular, the improvements become increasingly significant as we gradually increase the pruning sparsity. Specifically, for Emformers with 90% sparsity, our method achieves 12% and 14% WER reduction on the test-other and the test-clean data set, respectively, compared its correpoding basleine model; for Emformer with 50% sparsity, our results are comparable with the results from the standard baseline without noisy training.
Additionally, compared with Emformer-10L in Table 1, our Emformer with 50% sparsity achieves better WER while maintaining a similar model size. Our results confirm the effectiveness of network pruning for ASR model compression.
|+ Noisy training||10.1 (-6%)||3.7 (-8%)|
|+ Noisy training||10.4 (-7%)||4.1 (-9%)|
|+ Noisy training||12.9 (-12%)||5.1 (-14%)|
|Training loss||Validation loss||
|Training epoch||Training epoch||
Index of Singular Values
3.3 Ablation Studies
We provide empirical results to show how the noise can be tuned, where noise should be added, and how overfitting can be further reduced in the baseline Emformer-20L model which already has larger dropout ratios and strong data augmentation.
Magnitude of noise
By sweeping the hyper-parameter of noise scales, we show in Table 3 that noisy training reduces model overfitting and results in better WER performances while the results are relatively insensitive to the magnitude of noise.
Location of noise
By adding noise to each component of the RNN-T model separately, we see in Table 4 that noisy training is most effective when added to all parts of the RNN-T models; see an illustration of our model architecture in Figure 1 (b).
|+ Noisy training (0.005)||9.7||3.6|
|+ Noisy training (0.01)||9.5||3.5|
|+ Noisy training (0.05)||9.6||3.6|
|+ Noisy training (0.1)||9.8||3.6|
|Where to +noise||Test-other||Test-clean|
Logit space noise injection
Inspired by past work on noise injection into the pre-softmax output logits[wang2019improving], in addition to the parameter space noise injection, we further explore the idea of adding noise to the output logits of RNN-T model.
Table 5 shows that noise added to the logits does not help improve the model further. We hypothesize that both logit-space noise injection and weight noise injection achieves similar effects of regularization in the RNN-T model.
Adding Gaussian noise to the logit space cannot further improve the WER. “Logit noise” denotes the standard deviation of the Gaussian noise distributions used.
Noisy Training compliments other regularization
First, we explore whether increasing the strengths of dropouts or data augmentation could lead to the same improvements in the model’s WER and reduce model overftting. In Table 6
, we show results of models when we increase the dropouts on Emformer from 0.1 to 0.2 and 0.3. We also show results of the baseline model configuration, with stronger data augmentation by increasing the time-mask probability from 0.2 to 0.3. These changes lead to higher training loss, slightly lower validation loss, but higher WERs.
Noisy training compliments these existing regularization techniques, and is able to force the model to generalize better. We plot out the training loss and validation loss during training for the baseline model, models in Table 6, and the oisy trained model in figure 2. Comparing with the baseline model, the noisy trained model converges slower at the beginning of the training but its training loss continues to decrease at a faster rate after 60 epochs. With noisy training, the validation loss is significantly lowered.
To illustrate the regularization power of our noisy training, we follow work in [gao2019representation] and study the representation power of the learned Emformer-based RNN-T. Specifically, we visualize the weight matrix from the last linear layer of the joiner network, which has size . When trained with noisy training, the singular values distribute more uniformly, an indication that noisy trained embedding vectors fills a higher dimensional subspace.
4 Additional details on Emformer-based RNN-T
We build the RNN-T model encoder network with various layers of Emformer cells [shi2021emformer], sandwitched between 2 linear projection layers. Emformer, an efficient variants of the transformer architecture [waswani2017attention], is one of the state-of-the-art architecture for E2E streaming speech recognizers. Each Emformer cell has 8 attention heads, 512 hidden units and the attention feed-forward network of 2048 units. We do not use the “memory bank” because in each step, the Emformer processes a long enough segment (160ms) of input. We thus simplify the Emformer with the the following formulation (using similar mathematical notation as in [shi2021emformer]):
where an input segment sequence is with i denoting the segment index and n denoting the layer index. refers to the right context segment; in this work we use only 1 right context segment. refers to the key, query and value originally specified in transformer cells [waswani2017attention]. are the weight matrices associated with the computation for and respectively. is the attention operation and refers to a feed-forward network. Figure 3 shows a visualization of these operations in the Emformer layer.
We build the predictor network of the RNN-T model with 3 layers of LSTMs with 512 hidden units, which are sandwiched between an linear Embedding layer and a linear output projection layer. Our LSTM layers are the same as shown in [shangguan2019optimizing]. A layer normalization is added to stabilize the hidden dynamics of the cell [Ba2016Layer]. The joiner network of the RNN-T contains one linear layer of size 1024, and a softmax that predicts the probabilistic distributions of 4097 tokens: 4096 pre-trained sentence pieces [kudo-richardson-2018-sentencepiece] and a blank token.
5 Prior Works
Gaussian weight noise injection into neural network training is not a new technique. Noisy training for recurrent neural networks has been discussed as early as in 1996 in [jim1996analysis]. The paper concluded that noisy training in feed-forward RNNs resulted in faster convergence and better generalization.
Fast forward to 2011, Graves framed Gaussian weight noise injection during the training of neural networks in the context of variational inferences [graves2011practical]; Graves et al. in [graves2013speech] further applied Gaussian weight noise empirically into a speech model training process. There are two main differences between [graves2013speech] and this work. First, the noise was added once per training sequence to a Long Short Term Memory-based Acoustic model with phoneme targets and CTC loss. In our work, however, we apply noisy training in every forward step of the model, to a transformer-based E2E ASR model with RNN-T loss. Secondly, their best models in that work were first trained without noisy training to the best log-likelihood over the dev set, before being fine-tuned with noisy training further. In our work, we train the models from scratch with noisy training, simplifying the model training process.
Similarly, Shan et al. and Toshniwal et al. both applied Gaussian weight noise to the attention-based non-streaming Listen-attend-and-Spell (LAS) E2E model training [shan2018attention, toshniwal2018multilingual]. They mentioned noisy training but did not discuss in depth the extent of contribution of noisy training with respect to other regularizes, or the impact of noisy training on model of different sizes like we do.
In this work, we analyze the impact of noisy training on a streaming, on-device E2E ASR model, trained with the RNN-T loss and existing strong regularization techniques. We systemically studies the ablation of noisy training with respect to the location and the magnitude of noises added. To support on-device compressed model deployment, this work specifically studies the impact of noisy training on transformer-based RNN-T models that are compressed or sparsity-pruned. We show that noisy training brings disproportionately more performance gain on smaller models, by reducing model overfitting, which could not be simply achieved via increasing the strengths of other regularizers such as drop out or data agumentation.