1 Introduction
With the aid of recent advances in neural networks, endtoend deep learning systems for automatic speech recognition (ASR) have gained popularity and achieved extraordinary performance on a variety of benchmarks
[1, 2, 3, 4]. Endtoend ASR models typically consist of Recurrent Neural Networks (RNNs) with SequencetoSequence (Seq2Seq) architectures and attention mechanisms
[5], RNN transducers [6], or transformer networks
[3]. These systems learn a direct mapping from an audio signal sequence to a sequence of text transcriptions. However, the input audio sequence often contains nuisance factors that are irrelevant to the recognition task and the trained model can incorrectly learn to associate some of these factors with target variables, which leads to overfitting. For example, besides linguistic content, speech data contains nuisance information about speaker identities, background noise, etc., which can hurt the recognition performance if the distributions of these attributes are mismatched between training and testing.A common method for combatting the vulnerability of deep neural networks to nuisance factors is the incorporation of invariance induction during model training. For example, invariant deep models have achieved considerable success in computer vision
[7, 8, 9] and speech recognition [10, 11, 12, 13]. Serdyuk et al. [10] obtain noiseinvariant representations by employing noisecondition annotations and the gradient reversal layer [14] for acoustic modeling. Similarly, Meng et al. [11] utilize speaker information to train a speakerinvariant model for senone prediction. Hsu et al. [12]extract domaininvariant features using a factorized hierarchical variational autoencoder. Liang et al.
[13] force their endtoend ASR model to learn similar representations for clean input instances and their synthetically generated noisy counterparts.While these methods work well at handling discrepancies between training and testing datasets for ASR systems, they require domain knowledge [12], supplementary nuisance information during training (e.g., speaker identities [11], recording environments [10], etc.), or pairwise data [13]. However, these requirements are difficult and expensive to fulfill in real world, e.g., it is hard to enumerate all possible nuisance factors and collect corresponding annotations.
In this work, we propose a new training scheme, namely NIESR, which adopts the unsupervised adversarial invariance learning framework (UAI) [7] for endtoend speech recognition. Without incorporating supervised information of nuisances for the input signal features, the proposed method is capable of separating the underlying elements of speech data into two series of latent embeddings – one containing all the information that is essential for ASR, and the other containing information that is irrelevant to the recognition task (e.g. accents, background noises, etc.). Experimental results show that the proposed training method boosts the endtoend ASR performance on WSJ0, CHiME3, and TIMIT datasets. We also show the effectiveness of combining NIESR with data augmentation.
2 Methodology
In this section, we present the proposed NIESR model for nuisanceinvariant endtoend speech recognition, where the invariance is achieved by adopting the UAI framework [7]. We begin by describing the base Seq2Seq ASR model. Subsequently, we introduce the UAI framework for unsupervised adversarial invariance induction. Finally, we present the complete design of the proposed NIESR model.
2.1 Base Sequencetosequence Model
We are interested in learning a mapping from a sequence of acoustic spectra features to a series of textual characters , given a dataset , following the formulation of Chan et al. [5]
. We employ a Seq2Seq model for this task, which estimates the probability of each character output
by conditioning over the previous characters and the input sequence . Thus, the conditional probability of the entire output is:(1) 
A Seq2Seq model is composed of two modules: an encoder and a decoder . transforms the input features into a highlevel representation , i.e. and infers the output sequence from . We model
as a stack of Bidirectional LongShort Term Memory (BLSTM) layers with interspersed projectedsubsampling layers
[15]. The subsampling layer projects a pair of consecutive input frames to a single lowerdimensional frame . We model as an attentionbased LSTM transducer [16], which employs to produce the output character sequence. At every time step,generates a probability distribution of
over character sequences, which is a function of a transducer state and an attention context. We denote this function as CharDist, which is implemented as a single layer perceptron with softmax activation:
(2)  
(3) 
In order to calculate the attention context , we employ the hybrid locationaware contentbased attention mechanism proposed by [17]. Specifically, the attention energy for frame at timestep takes previous attention alignment into account through the convolution operation:
(4) 
where , , , , , and are learned parameters and depicts the convolution operation. The attention alignment and the attention context is then calculated as:
(5) 
The base model is trained by minimizing the crossentropy loss:
(6) 
2.2 Unsupervised Adversarial Invariance Induction
Deep neural networks (DNNs) often learn incorrect associations between nuisance factors in the raw data and the final target, leading to poor generalization [7]. In the case of ASR, the network can link accents, speakerspecific information, or background noise with the transcriptions, resulting in overfitting. In order to cope with this issue, we adopt the unsupervised adversarial invariance (UAI) [7] framework for learning invariant representations that eliminate factors irrelevant to the recognition task without requiring any knowledge of nuisance factors.
The working principle of UAI is to learn a split representation of data as and , where contains information relevant to the prediction task (here ASR) and holds all other information about the input data. The underlying mechanism for learning such a split representation is to induce competition between the main prediction task and an auxiliary task of data reconstruction. In order to achieve this, the framework uses for the prediction task and a noisy version of along with for reconstruction. In addition, a disentanglement constraint enforces that and contain independent information. The prediction task tries to pull relevant factors into , while the reconstruction task drives to store all the information about input data because is unreliable. However, the disentanglement constraint forces the two embeddings to not contain overlapping information, thus leading to competition. At convergence, this results in a nuisancefree that contains only those factors that are essential for the prediction task.
2.3 NIESR Model Design and Optimization
The NIESR model comprises five types of modules: (1) encoders and that map input data to the encodings and , respectively, (2) a decoder that infers target from , (3) a dropout layer that converts into its noisy version , (4) a reconstructor that reconstructs input data from , and (5) two adversarial disentanglers and that try to infer each embedding ( or ) from the other. Figure 1 shows the complete NIESR model.
The encoder and decoder follow the base model design as described in Section 2.1, i.e., an attentionbased Seq2Seq model for the speech recognition task. is designed to have exactly the same structure as . The dropout layer is introduced to make an unreliable source of information for reconstruction, which influences the reconstruction task to extract all information about into [7]. is modeled as a stack of BLSTM layers interspersed with novel upsampling layers, which perform decompression by splitting information in each timeframe to two frames. This is the inverse of the subsampling layers [15] used in and . The upsampling operation is formulated as:
(7)  
(8) 
where represents concatenation, is the output, and is a learned projection matrix.
The adversarial disentanglers and model the UAI disentanglement constraint discussed in Section 2.2 following previous works [7, 8, 9]. tries to predict from and tries to do the inverse. This is directly opposite to the desired independence between and . Thus, training and adversarially against the rest of the model helps achieve the independence goal. Unlike previous works [7, 8, 9], the encodings and
for this work are vectorsequences instead of single vectors:
and . Naïve instantiations of the disentanglers would perform framespecific predictions of from and vice versa. However, each pair of and generated at the timestep contains information not only from frame but also from other frames across the timespan. This is because and are modeled as RNNs. Therefore, a better method to perform disentanglement for sequential representations is to use the whole series of or to estimate every element of the other. Hence, we model and as BLSTMs.The proposed NIESR model is optimized by adopting the UAI training strategy [7, 9], i.e., playing a game where we treat , , , and as one player , and and as the other player . The model is trained using a scheduled update scheme where we freeze the weights of one player model when we update the weights of the other. The training objective comprises three tasks: (1) predicting transcriptions from the input signal, (2) reconstruction of the input, and (3) adversarial prediction of each of and from the other. The objective of the first task is written as Equation 6. The goal for the reconstruction task is to minimize the mean squared error (MSE) between and the reconstructed :
(9) 
where means dropout. The training objective for the disentanglers is to minimize the MSE between embeddings predicted by the disentenglers and the embeddings generated from the encoder. However, that of the encoders is to generate and that are not predictive of each other. Hence, in the scheduled update scheme, the targets and for the disentanglers are different when updating the player models versus , following [9]. The loss can be written as:
(10)  
(11) 
where and are set as and , respectively, when updating but are set to random vectors when updating .
Overall, the model is trained through backpropagation by optimizing the objective described in Equation
12, where the lossweights , , andare hyperparameters, which are decided by the performance on the development set.
(12) 
Inference with NIESR involves a forward pass of data through followed by . Hence, the usage and computational cost of NIESR for inference is the same as the base model.
3 Experiments
The effectiveness of NIESR is quantified through the performance improvement achieved by adopting the invariant learning framework. We provide experimental results on speech recognition on three benchmark datasets: the Wall Street Journal Corpus (WSJ0) [18], CHiME3 [19], and TIMIT [20]. We additionally provide results on the combined WSJ0+CHiME3 dataset.
3.1 Datasets
WSJ0: This dataset is a collection of readings of the Wall Street Journal. It contains 7,138 utterances in the training set, 410 in the development set, and 330 in the test set. We use 40dimensional log Mel filterbank features as the model input, and normalize the transcriptions to capitalized character sequences.
CHiME3: CHiME3 dataset contains: (1) WSJ0 sentences spoken in challenging noisy environments (real data) and (2) WSJ0 readings mixed with four different background noise (simulated data). The real speech data was recorded in five noisy environments using a sixchannel tabletbased microphone array. Training data consists of 1,999 real noisy utterances from four speakers, and 7,138 simulated noisy utterances from 83 speakers in the WSJ0 training set. In total, there are 3,280 utterances in the development set, and 2,640 utterances in the test set containing both real and simulated data. The speakers in training, development, and test set are mutually different. In our experiments, we follow [11] to use farfield speech from the fifth microphone channel for all sets. We adopt the same inputoutput setting for CHiME3 as WSJ0.
TIMIT: This corpus contains a total of 6,300 sentences, with 10 sentences spoken by 630 speakers each with 8 different dialects. Among them, utterances from 168 different speakers are heldout as the test set. We further select sentences from 4 speakers of each dialect group, i.e., 32 speakers in total, from the remaining data to form the development set. Thus, all speakers in training, development, and test sets are different. Models were trained on 80 log Mel filterbank features and capitalized character sequences were treated as targets.
3.2 Experiment Setup
We train the base model without using invariance induction, i.e., the model consisting of and (Section 2.1), as a baseline. We feed the whole sequence of spectra features to and get the predicted character sequence from . We use a stack of two BLSTMs with a subsampling layer (as described in Section 2.1) in between for . is implemented as a single layer LSTM combined with attention modules introduced in Section 2.1
. All the models were trained with early stopping with 30 epochs of patience and the best model is selected based on the performance on the development set. Other model and training hyperparameters are listed in Table
1.Item  Setting 

and LSTM Dimension  200 
Subsampling Projected Dimension  200 
Attention Dimension  200 
Attention Convolution Channel  10 
Attention Convolution Kernel Size  100 
Optimizer  Adam 
Learning Rate  5e4 
Item  Setting 

LSTM Dimension  300 
Upsampling Projected Dimension  200 
, Dimension  200 
Dropout layer rate  0.4 
Optimizer  Adam 
Learning Rate for  5e4 
Learning Rate for  1e3 
, , for WSJ0  100, 10, 1 
, , for CHiME3  100, 1, 0.5 
, , for TIMIT  100, 50, 1 
We augment the base model with , , , and , while treating as , to form the NIESR model. has the same hyperparameter setting and structure as . is modeled as a cascade of a BLSTM layer, an upsampling layer, and another BLSTM layer. and are implemented as BLSTMs followed by two fullyconnected layers. We update the player models and in the frequency ratio of in our experiments. Hyperparameters for and are the same as the base model. Additional hyperparameters for NIESR are summarized in Table 2.
We further provide results of a stronger baseline model that utilizes labeled nuisances (speakers for WSJ0, speakers and noise environment condition for CHiME3, speakers and dialect groups for TIMIT) with the gradient reversal layer (GRL) [14] to learn invariant representations. Specifically, the model consists of ,
, and a classifier with a GRL between the embedding learned from
and the classifier, following the standard setup in [14]. The target for the classifier is to predict from the embedding while the direction of the training gradient to is flipped. We denote this model as SpkInv for speakerinvariance, EnvInv for environmentinvariance in CHiME3, and DialInv for dialectinvariance in TIMIT.Model  WSJ0  CHiME3  TIMIT 

Base  12.95  44.61  28.76 
SpkInv  12.31 (4.94)  43.93 (1.52)  28.45 (1.08) 
EnvInv  –  42.61 (4.48)  – 
DialInv  –  –  28.29 (1.63) 
NIESR  12.24 (5.48)  41.86 (6.16)  26.86 (6.61) 
3.3 ASR Performance on Benchmark Datasets
Table 3 summarizes the results at endtoend ASR on WSJ0, CHiME3, and TIMIT datasets. Results show that NIESR achieves 5.48%, 6.16%, and 6.61% relative improvements over base model on WSJ0, CHiME3, and TIMIT, respectively, and demonstrates the best CER among all methods.
3.4 Invariance to Nuisance Factors
In order to examine whether a latent embedding is invariant to nuisance factors , we calculate the accuracy of predicting the factor from the encoding. Specifically, this is calculated by training classification networks (BLSTM followed by two fullyconnected layers) to predict from the generated embeddings. Table 4 presents results of this experiment, showing that the embedding of the NIESR model, which is used for ASR, contains less nuisance information than the encoding of the base, SpkInv, and EnvInv models. In contrast, the embedding of NIESR contains most of the nuisance information, showing that nuisance factors migrate to this embedding, as expected.
Dataset  Predict from  Accuracy  

: Speaker  : Env  
WSJ0  in Base Model  67.91  – 
in SpkInv  65.60  –  
in NIESR  63.35  –  
in NIESR  97.92  –  
CHiME3  in Base Model  38.52  69.24 
in SpkInv  37.91  69.11  
in EnvInv  38.84  66.44  
in NIESR  35.87  63.45  
in NIESR  92.28  97.05 
3.5 Additional Robustness through Data Augmentation
Training with additional data that reflects multiple variations of nuisance factors helps models generalize better. In this experiment, we treat the CHiME3 dataset, which contains WSJ0 recordings with four different types of noise, as a noisy augmentation for WSJ0. We train the base model and NIESR on the augmented dataset, i.e. WSJ0+CHiME3, and test on the original CHiME3 and WSJ0 test sets separately. Table 5 summarizes the results on this experiment, showing that training with data augmentation provides improvements on both CHiME3 and WSJ0 datasets compared to the results in Table 3. It is important to note that the NIESR model trained on the augmented dataset achieves 14.44% relative improvement on WSJ0 as compared to the base model trained on the same. This is because data augmentation provides additional information about potential nuisance factors to the NIESR model and, consequently, helps it ignore these factors for the ASR task, even though pairwise data is not provided to the model like [13]. Hence, results show that the NIESR model can be easily combined with data augmentation to further enhance the robustness and nuisanceinvariance of the learned features.
Model  WSJ0  CHiME3 

Base  9.35  41.55 
SpkInv  8.62 (7.81)  40.77 (1.88) 
EnvInv  9.17 (1.93)  40.27 (3.08) 
NIESR  8.00 (14.44)  38.35 (7.7) 
4 Conclusion
We presented NIESR, an endtoend speech recognition model that adopts the unsupervised adversarial invariance framework for invariance to nuisances without requiring any knowledge of potential nuisance factors. The model works by learning a split representation of data through competition between the recognition and an auxiliary data reconstruction task. Results of experimental evaluation demonstrate that the proposed model achieves significant boosts in performance on ASR.
5 Acknowledgements
This material is based on research sponsored by DARPA under agreement number FA87501820014. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of DARPA or the U.S. Government.
References
 [1] R. Prabhavalkar, K. Rao, T. N. Sainath, B. Li, L. Johnson, and N. Jaitly, “A comparison of sequencetosequence models for speech recognition.” 2017.
 [2] C.C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, E. Gonina et al., “Stateoftheart speech recognition with sequencetosequence models,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 4774–4778.
 [3] S. Zhou, L. Dong, S. Xu, and B. Xu, “Syllablebased sequencetosequence speech recognition with the transformer in mandarin chinese,” arXiv preprint arXiv:1804.10752, 2018.
 [4] N. Jaitly, Q. V. Le, O. Vinyals, I. Sutskever, D. Sussillo, and S. Bengio, “An online sequencetosequence model using partial conditioning,” in Advances in Neural Information Processing Systems, 2016, pp. 5067–5075.
 [5] W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, “Listen, attend and spell,” arXiv preprint arXiv:1508.01211, 2015.
 [6] K. Rao, H. Sak, and R. Prabhavalkar, “Exploring architectures, data and units for streaming endtoend speech recognition with rnntransducer,” in 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2017, pp. 193–199.
 [7] A. Jaiswal, R. Y. Wu, W. AbdAlmageed, and P. Natarajan, “Unsupervised Adversarial Invariance,” in Advances in Neural Information Processing Systems, 2018, pp. 5097–5107.
 [8] A. Jaiswal, S. Xia, I. Masi, and W. AbdAlmageed, “RoPAD: Robust Presentation Attack Detection through Unsupervised Adversarial Invariance,” in 12th IAPR International Conference on Biometrics (ICB), 2019.
 [9] A. Jaiswal, Y. Wu, W. AbdAlmageed, and P. Natarajan, “Unified adversarial invariance,” 2019.
 [10] D. Serdyuk, K. Audhkhasi, P. Brakel, B. Ramabhadran, S. Thomas, and Y. Bengio, “Invariant representations for noisy speech recognition,” arXiv preprint arXiv:1612.01928, 2016.
 [11] Z. Meng, J. Li, Z. Chen, Y. Zhao, V. Mazalov, Y. Gang, and B.H. Juang, “Speakerinvariant training via adversarial learning,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5969–5973.

[12]
W.N. Hsu and J. Glass, “Extracting domain invariant features by unsupervised learning for robust automatic speech recognition,” in
2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5614–5618.  [13] D. Liang, Z. Huang, and Z. C. Lipton, “Learning noiseinvariant representations for robust speech recognition,” arXiv preprint arXiv:1807.06610, 2018.
 [14] Y. Ganin and V. Lempitsky, “Unsupervised domain adaptation by backpropagation,” arXiv preprint arXiv:1409.7495, 2014.
 [15] Y. Zhang, W. Chan, and N. Jaitly, “Very deep convolutional networks for endtoend speech recognition,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 4845–4849.
 [16] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
 [17] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attentionbased models for speech recognition,” in Advances in neural information processing systems, 2015, pp. 577–585.
 [18] D. B. Paul and J. M. Baker, “The design for the wall street journalbased csr corpus,” in Proceedings of the workshop on Speech and Natural Language. Association for Computational Linguistics, 1992, pp. 357–362.
 [19] J. Barker, R. Marxer, E. Vincent, and S. Watanabe, “The third ‘chime’speech separation and recognition challenge: Dataset, task and baselines,” in 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). IEEE, 2015, pp. 504–511.
 [20] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, and N. L. Dahlgren, “Darpa timit acoustic phonetic continuous speech corpus cdrom,” 1993.
Comments
There are no comments yet.