1 Introduction
Automatic Speech Recognition (ASR) enables fast and accurate transcriptions of voice commands and dictations on edge devices; examples of ondevice ASR applications include dictation for Google keyboard [schalkwyk_2019, he2019streaming], voice commands for Apple Siri [vincent_2021], and Amazon Alexa [liu2021exploiting], etc. Past work developed streaming EndtoEnd (E2E) allneural ASR models that run compactly on edge devices [he2019streaming, kim2019attention, shangguan2019optimizing, kim2020review]
. These neuralnetwork based E2E ASR models, however, are prone to overfit the training data, and suffer from large performance degradation on noisy or unseen testing data, e.g.,
[narayanan2019recognizing]Researchers have proposed many techniques to regularize the training of neural networkbased models, which has shown to significantly improve the generalization of ASR models. Notable examples include layer normalization [Ba2016Layer], early stop and dropout [zhou2017improved, gal2016theoretically], audio reverberation and background noise simulations [kim2017generation], spectrum data augmentation [Park2019, Park2020SpecAug] and speed perturbation of the audio inputs [ko2015audio, cui2015data], etc.
In this work, we analyze a simple yet effective noisy training strategy for E2E streaming ASR model training via regularization. By introducing random noise to the parameter space during training, we force the ASR models to be smoother at convergence to achieve better generalization [an1996effects]
. We select a strongly regularized stateoftheart E2E ASR model – an Enformerbased recurrent neural network transducer (RNNT) model
[shi2021emformer] and demonstrate how the proposed method can help to reduce the model word error rate (WER) on noisy testing dataset.As E2E streaming ASR models are often constrained by the hardware resources when deployed ondevice, pruning is widely adopted to simultaneously reduce the model computation and parameters. Hence, besides applying the proposed noisy training on the dense RNNT model (73 MB), we also evaluate its effectiveness on several sparse RNNT models, whose sizes range from 45 MB to 20 MB. We observe consistent WER reduction across all model sizes. In fact, we find noisy training more effective in improving downsized or sparse RNNT models compared to the baseline models. The overall contribution of the paper can be summarized as follows:

For the first time, noisy training is applied to Enformerbased RNNT models and systematically studied within an ondevice E2E RNNT framework.

Noisy training is studied for both dense and sparse models across a wide range of model sizes. While consistent improvement is observed, we demonstrate 12% and 14% WER reduction on the LibriSpeech Testother and Testclean dataset for the 90% sparse model. The experiments and ablation study help shed insights into the regularization of training small, ondevice ASR models.
2 Streaming E2E ASR
In this work, we focus on streaming E2E ASR models, also known as as incremental recognizers – as users speak, partial results already start to surface in realtime before the ASR finalizes its transcriptions. Streaming ASR models appear faster and more natural to users [aist2007incremental] and thus are popular for deployment on edge devices. They potentially reduce the latency of downstream models in a pipeline, such as in a realtime translation pipeline or a natural language understanding pipeline, because partial results can be fed downstream before the ASR finalizes its hypotheses [ShangguanKHMB20].
Given a collection of data with the acoustic signals and the corresponding text labels. ASR models are often trained by maximizing the loglikelihood of the alignment sequence,
(1) 
where denotes the predictions of a ASR model and is the model parameters. We use the RNNtransducer loss to compute [graves2012sequence]. In this work, we also adopt alignmentrestriction RNNT loss to speed up model training on GPUs [mahadeokar2021alignment].
This optimization in equation (1) is prone to overfitting. Empirically, we observe that the WER of training data is much smaller than that of the testing set even in the presence of other strong regularization techniques. Similar behavior is also reported by [2020GhodsiStateless] in the context of lowresource training data. We observe that the RNNT losses and WER differences between training set and testing set are more obvious with compressed or sparsitypruned E2E ASR models than large, dense models.
(a) parameter noise injection  (b) model overview 
2.1 Noisy training
We investigate a simple method to effectively alleviates overfitting when training ASR models. Our method works by introduce random noise to the parameter space during training,
(2)  
Here is a noise distribution. As shown theoretically in [an1996effects], this noisy training strategy effectively avoids sharp local minima at convergence. Compared with standard MLE training in equation (1), optimizing a noiseperturbed objective yields smoother neural networks as solutions that generalize better. We refer the reader to [an1996effects] for more theoretical treatments.
We set
as zero centered Gaussian distributions with variance
for simplicity. The squared term on is introduced to prevent from growing arbitrary large and hence, cancels out the regularization effect from the noise. We set as default throughout the paper.Note that the exact computation of equation (2) is intractable. In practice, we approximate equation (2) with a single Monte Carlo sample from and a random minibatch from . More specifically, at each training step, we compute the gradient of as follows:
(3) 
We summarize our noisy training pipeline in Algorithm 1. See Figure 1 (a) for an illustration.
2.2 Connections to Variational inference
Past works point out that Bayesian inference with neural networks improves model generalization, accuracy and calibration
[DBLP:journals/corr/abs200208791, DBLP:journals/corr/abs200706823]. For ASR, exact Bayesian inference requires sampling from a Bayesian NN posterior, which is computationally intractable. Variational inference provides a computationally efficient tool that enables fast posterior approximation through optimization [David2017VI]. Similar to Graves’ formulation in [graves2011practical], our noisy training could be viewed as finding a Gaussian proposal distribution over real value weight parameters , , to approximate the posterior distribution of weights over the given dataset , .In variational inference, finding the best approximation of could then be formulated as minimizing the KullbackLeibler (KL) divergence between and as follows:
(4) 
Furthermore, minimizing equation (4) is equivalent to maximizing the following evidence lower bound (ELBO) [blei2017variational]:
(5)  
where is the expected loglikelihood. Assume a simple Gaussian prior, e.g., (), and assume a fully factorized Gaussian, e.g., with as a constant. In this way, the term can be integrated analytically,
(6) 
Substituting equation (6) into equation (5), we have
(7) 
It is easy to see that maximizing over as in equation (7) gives the same optimization objective as minimizing as in equation (2).
2.3 Implementation Details
Note that our method requires specifying a noise distribution (see equation (2)) to control the exploration in the parameter space, Intuitively, a large might yield an optimization objective that is hard to converge; On the other hand, a small noise might cannot sufficient regularize the training, and therefore yielding marginal benefits. In this work, we find it is most effective to set adaptively according the magnitude of the weights during training.
Specially, consider a linear layer with weights , where represents the input dimension and denotes the output dimension. For each column of the weight matrix , we perturb as follows,
(8) 
During training, we treat as constants and stop the gradients backpropagating through .
3 Experiments
Our experiments are designed and reported in three parts: in Section 3.1 we apply our method to improve the training of RNNT models, with different number depths and widths in the RNNT Emformerbased encoder network; in Section 3.2, we show the effectiveness of noisy training on sparsitypruned ASR models; and finally in Section 3.3, we provide extensive ablation studies to study the impact of the noise scale and the location to inject noises on the model’s performance.
Dataset and data augmentation
We use the LibriSpeech 960h corpus for experiments [panayotov2015librispeech]. To enforce strong regularization from data augmentation, we perturb the input audio speed with ratio 0.9, 1.0 and 1.1 using techniques in [ko2015audio]. We then extra 80dimensional logMel features using a sliding window of 25ms and step of 10ms over the input audio. To regularize the model implicitly, we further apply spectrum data augmentation [Park2019] with Frequency mask parameter (F=27) and (T=10) Time masks with maximum timemask ratio (p = 0.05).
Model architecture
We use the recurrent neural network transducer (RNNT) framework to represent E2E ASR models in this work. A RNNT model typically contains three components: a encoder network, a prediction network and a joiner network. The encoder network converts framewise acoustic input into a high level vector representation; the prediction network acts as a language model that converts previously predicted nonblank tokens into a high level representation; the joiner combines encoder and prediction network output and applies a softmax to predict the next token, including a
blank token. We use a simplified Emformer cells [shi2021emformer]to build the encoder model of our RNNT models. We use Long Short Term Memory (LSTMs) cells to build the predictor. We provide a more detailed description of our model architecture in Section
4. We refer readers to [graves2012sequence, he2019streaming] for a more explanation on RNNT. Additionally, we show the RNNT structure used in this work in Figure 1(b).We explicitly regularize all RNNT models in this work by add extra modules, which lead to auxiliary losses to the intermediate RNNT encoder layers as suggested in [liu2021improving].
Regularization settings
We leverage stateoftheart regularization techniques to build a strong baseline training pipeline. To summarize, we implement popular regularization techniques, including speech perturbation, spectrum data augmentation, layer normalization for all weight matrices in the RNNT, and residual connection within the Emformer cells (see Section
4). We apply dropouts [srivastava2014dropout] of ratio 0.1 to the Emformer weight matrices, and dropouts of ratio 0.3 to all other linear and LSTM layers to reduce overfitting.On top of the abovementioned regularization techniques, we apply to noisy training to further boost the performance. We report the WERs on the LibriSpeech testclean and testother datasets. All WER results in this work is scored with NIST Sclite tool without GLM grammar replacement [NISTtool].
Noisy training settings
We add adaptive noise to the training following equation (8). We set the noise scale to be throughout the paper unless otherwise specified. We found this single setting performs well across all experimental setups.
3.1 Improving Emformers
In this part, we apply our noisy training to improve a number of Emformers with various model sizes.
Settings
Specifically, in Table 1, we denote EmformernL as n
layers of Emformer used in the RNNT transcription network. We also denote Emformer20L(0.5x) as removing 1/2 of the hidden units in each of the 20 layers of Emformer cells in the RNNT encoder network; in this way, the total number of parameters reduced is 75% for the encoder. All models are trained for 120 epochs with a batch size of 1024.
Results
We find that our method achieves better WER compared to the baseline models without noisy training. The smaller the Emformer model, the more effective noisy training is in improving model performances.
Method  #Prams (M)  Testother  Testclean 

Emformer7L  35.7  13.0  5.2 
+ Noisy training  12.0 (8%)  4.6 (12%)  
Emformer10L  45.4  11.5  4.5 
+ Noisy training  10.8 (6%)  4.0 (11%)  
Emforer14L  57.8  10.2  4.0 
+ Noisy training  9.7 (5%)  3.8 (5%)  
Emforer20L  76.7  9.9  3.8 
+ Noisy training  9.5 (4%)  3.5 (8%)  
Emformer20L (0.5x)  29.1  11.7  4.7 
+ Noisy training  10.8 (8%)  4.3 (8%) 
3.2 Improving Pruning aware training
Sparsity pruning introduces blockpatterns of zeros inside the weight matrices of the neural networks. It allows the E2E ASR models to run compactly with low latencies on sparsityfriendly hardware. Sparse models have also been shown to outperform similarsized compressed models without sparsity on speech and language modeling tasks [shangguan2019optimizing, Zhu2018Prune, pang2018compression]. In this section, we apply our noisy training method to the training of sparse Emformerbased ASR models. We use weight magnitude based pruning, and sparsify only the Emformerbased encoder network in an RNNT, which occupies of the size of the entire RNNT model.
Settings
Each model in Table 2 is trained from scratch with pruningaware training. We use a sparsity block pattern of 8x1, and a cubic pruning schedule described in [Zhu2018Prune]:
(9) 
where . Here denotes the pruning ratio at training step and is the target sparsity ratio. We starts pruning at step and gradually increase the pruning ratio during training. Meanwhile, represents the pruning frequency and denotes the number of pruning steps. We set both and to be 256 for all sparse models.
Meanwhile, as it is more difficult to optimize sparsified models. For each pruningaware training setting with pruning ratio , we linearly scale down its corresponding dropout ratio on the Emformer encoder from to .
Results
As shown in Table 2, our noisy training leads to consistent WER reduction on all the settings evaluated. In particular, the improvements become increasingly significant as we gradually increase the pruning sparsity. Specifically, for Emformers with 90% sparsity, our method achieves 12% and 14% WER reduction on the testother and the testclean data set, respectively, compared its correpoding basleine model; for Emformer with 50% sparsity, our results are comparable with the results from the standard baseline without noisy training.
Additionally, compared with Emformer10L in Table 1, our Emformer with 50% sparsity achieves better WER while maintaining a similar model size. Our results confirm the effectiveness of network pruning for ASR model compression.
Method  #Prams (M)  Testother  Testclean 

Sparse 50%  45.1  10.7  4.0 
+ Noisy training  10.1 (6%)  3.7 (8%)  
Sparse 70%  32.4  11.2  4.5 
+ Noisy training  10.4 (7%)  4.1 (9%)  
Sparse 90%  19.6  14.7  5.9 
+ Noisy training  12.9 (12%)  5.1 (14%) 
Training loss  Validation loss 
Log Eigenvalues 

Training epoch  Training epoch  Index of Singular Values 
(a)  (b)  (c) 
3.3 Ablation Studies
We provide empirical results to show how the noise can be tuned, where noise should be added, and how overfitting can be further reduced in the baseline Emformer20L model which already has larger dropout ratios and strong data augmentation.
Magnitude of noise
By sweeping the hyperparameter of noise scales, we show in Table 3 that noisy training reduces model overfitting and results in better WER performances while the results are relatively insensitive to the magnitude of noise.
Location of noise
By adding noise to each component of the RNNT model separately, we see in Table 4 that noisy training is most effective when added to all parts of the RNNT models; see an illustration of our model architecture in Figure 1 (b).
Method  Testother  Testclean 

Baseline  9.9  3.8 
+ Noisy training (0.005)  9.7  3.6 
+ Noisy training (0.01)  9.5  3.5 
+ Noisy training (0.05)  9.6  3.6 
+ Noisy training (0.1)  9.8  3.6 
Where to +noise  Testother  Testclean 

Baseline: none  9.9  3.8 
Baseline: all  9.5  3.5 
Encoder only  9.7  3.6 
Predictor only  9.8  3.6 
Joiner only  9.7  3.6 
Logit space noise injection
Inspired by past work on noise injection into the presoftmax output logits
[wang2019improving], in addition to the parameter space noise injection, we further explore the idea of adding noise to the output logits of RNNT model.Table 5 shows that noise added to the logits does not help improve the model further. We hypothesize that both logitspace noise injection and weight noise injection achieves similar effects of regularization in the RNNT model.
Model  Logit noise  Testother  Testclean 

Dense  0.01  9.5  3.6 
0.05  9.6  3.5 
Adding Gaussian noise to the logit space cannot further improve the WER. “Logit noise” denotes the standard deviation of the Gaussian noise distributions used.
Noisy Training compliments other regularization
First, we explore whether increasing the strengths of dropouts or data augmentation could lead to the same improvements in the model’s WER and reduce model overftting. In Table 6
, we show results of models when we increase the dropouts on Emformer from 0.1 to 0.2 and 0.3. We also show results of the baseline model configuration, with stronger data augmentation by increasing the timemask probability from 0.2 to 0.3. These changes lead to higher training loss, slightly lower validation loss, but higher WERs.
Method  Testother  Testclean 

Dropout 0.2  9.9  3.9 
Dropout 0.3  10.6  4.1 
Stronger SpecAug  9.9  3.8 
Noisy training compliments these existing regularization techniques, and is able to force the model to generalize better. We plot out the training loss and validation loss during training for the baseline model, models in Table 6, and the oisy trained model in figure 2. Comparing with the baseline model, the noisy trained model converges slower at the beginning of the training but its training loss continues to decrease at a faster rate after 60 epochs. With noisy training, the validation loss is significantly lowered.
To illustrate the regularization power of our noisy training, we follow work in [gao2019representation] and study the representation power of the learned Emformerbased RNNT. Specifically, we visualize the weight matrix from the last linear layer of the joiner network, which has size . When trained with noisy training, the singular values distribute more uniformly, an indication that noisy trained embedding vectors fills a higher dimensional subspace.
4 Additional details on Emformerbased RNNT
We build the RNNT model encoder network with various layers of Emformer cells [shi2021emformer], sandwitched between 2 linear projection layers. Emformer, an efficient variants of the transformer architecture [waswani2017attention], is one of the stateoftheart architecture for E2E streaming speech recognizers. Each Emformer cell has 8 attention heads, 512 hidden units and the attention feedforward network of 2048 units. We do not use the “memory bank” because in each step, the Emformer processes a long enough segment (160ms) of input. We thus simplify the Emformer with the the following formulation (using similar mathematical notation as in [shi2021emformer]):
(10)  
(11)  
(12)  
(13)  
(14)  
(15)  
(16)  
(17) 
where an input segment sequence is with i denoting the segment index and n denoting the layer index. refers to the right context segment; in this work we use only 1 right context segment. refers to the key, query and value originally specified in transformer cells [waswani2017attention]. are the weight matrices associated with the computation for and respectively. is the attention operation and refers to a feedforward network. Figure 3 shows a visualization of these operations in the Emformer layer.
We build the predictor network of the RNNT model with 3 layers of LSTMs with 512 hidden units, which are sandwiched between an linear Embedding layer and a linear output projection layer. Our LSTM layers are the same as shown in [shangguan2019optimizing]. A layer normalization is added to stabilize the hidden dynamics of the cell [Ba2016Layer]. The joiner network of the RNNT contains one linear layer of size 1024, and a softmax that predicts the probabilistic distributions of 4097 tokens: 4096 pretrained sentence pieces [kudorichardson2018sentencepiece] and a blank token.
5 Prior Works
Gaussian weight noise injection into neural network training is not a new technique. Noisy training for recurrent neural networks has been discussed as early as in 1996 in [jim1996analysis]. The paper concluded that noisy training in feedforward RNNs resulted in faster convergence and better generalization.
Fast forward to 2011, Graves framed Gaussian weight noise injection during the training of neural networks in the context of variational inferences [graves2011practical]; Graves et al. in [graves2013speech] further applied Gaussian weight noise empirically into a speech model training process. There are two main differences between [graves2013speech] and this work. First, the noise was added once per training sequence to a Long Short Term Memorybased Acoustic model with phoneme targets and CTC loss. In our work, however, we apply noisy training in every forward step of the model, to a transformerbased E2E ASR model with RNNT loss. Secondly, their best models in that work were first trained without noisy training to the best loglikelihood over the dev set, before being finetuned with noisy training further. In our work, we train the models from scratch with noisy training, simplifying the model training process.
Similarly, Shan et al. and Toshniwal et al. both applied Gaussian weight noise to the attentionbased nonstreaming ListenattendandSpell (LAS) E2E model training [shan2018attention, toshniwal2018multilingual]. They mentioned noisy training but did not discuss in depth the extent of contribution of noisy training with respect to other regularizes, or the impact of noisy training on model of different sizes like we do.
6 Conclusion
In this work, we analyze the impact of noisy training on a streaming, ondevice E2E ASR model, trained with the RNNT loss and existing strong regularization techniques. We systemically studies the ablation of noisy training with respect to the location and the magnitude of noises added. To support ondevice compressed model deployment, this work specifically studies the impact of noisy training on transformerbased RNNT models that are compressed or sparsitypruned. We show that noisy training brings disproportionately more performance gain on smaller models, by reducing model overfitting, which could not be simply achieved via increasing the strengths of other regularizers such as drop out or data agumentation.
Comments
There are no comments yet.