With the aid of recent advances in neural networks, end-to-end deep learning systems for automatic speech recognition (ASR) have gained popularity and achieved extraordinary performance on a variety of benchmarks[1, 2, 3, 4]
. End-to-end ASR models typically consist of Recurrent Neural Networks (RNNs) with Sequence-to-Sequence (Seq2Seq) architectures and attention mechanisms, RNN transducers 
, or transformer networks. These systems learn a direct mapping from an audio signal sequence to a sequence of text transcriptions. However, the input audio sequence often contains nuisance factors that are irrelevant to the recognition task and the trained model can incorrectly learn to associate some of these factors with target variables, which leads to overfitting. For example, besides linguistic content, speech data contains nuisance information about speaker identities, background noise, etc., which can hurt the recognition performance if the distributions of these attributes are mismatched between training and testing.
A common method for combatting the vulnerability of deep neural networks to nuisance factors is the incorporation of invariance induction during model training. For example, invariant deep models have achieved considerable success in computer vision[7, 8, 9] and speech recognition [10, 11, 12, 13]. Serdyuk et al.  obtain noise-invariant representations by employing noise-condition annotations and the gradient reversal layer  for acoustic modeling. Similarly, Meng et al.  utilize speaker information to train a speaker-invariant model for senone prediction. Hsu et al. 
extract domain-invariant features using a factorized hierarchical variational autoencoder. Liang et al. force their end-to-end ASR model to learn similar representations for clean input instances and their synthetically generated noisy counterparts.
While these methods work well at handling discrepancies between training and testing datasets for ASR systems, they require domain knowledge , supplementary nuisance information during training (e.g., speaker identities , recording environments , etc.), or pairwise data . However, these requirements are difficult and expensive to fulfill in real world, e.g., it is hard to enumerate all possible nuisance factors and collect corresponding annotations.
In this work, we propose a new training scheme, namely NIESR, which adopts the unsupervised adversarial invariance learning framework (UAI)  for end-to-end speech recognition. Without incorporating supervised information of nuisances for the input signal features, the proposed method is capable of separating the underlying elements of speech data into two series of latent embeddings – one containing all the information that is essential for ASR, and the other containing information that is irrelevant to the recognition task (e.g. accents, background noises, etc.). Experimental results show that the proposed training method boosts the end-to-end ASR performance on WSJ0, CHiME3, and TIMIT datasets. We also show the effectiveness of combining NIESR with data augmentation.
In this section, we present the proposed NIESR model for nuisance-invariant end-to-end speech recognition, where the invariance is achieved by adopting the UAI framework . We begin by describing the base Seq2Seq ASR model. Subsequently, we introduce the UAI framework for unsupervised adversarial invariance induction. Finally, we present the complete design of the proposed NIESR model.
2.1 Base Sequence-to-sequence Model
We are interested in learning a mapping from a sequence of acoustic spectra features to a series of textual characters , given a dataset , following the formulation of Chan et al. by conditioning over the previous characters and the input sequence . Thus, the conditional probability of the entire output is:
A Seq2Seq model is composed of two modules: an encoder and a decoder . transforms the input features into a high-level representation , i.e. and infers the output sequence from . We model
as a stack of Bidirectional Long-Short Term Memory (BLSTM) layers with interspersed projected-subsampling layers. The subsampling layer projects a pair of consecutive input frames to a single lower-dimensional frame . We model as an attention-based LSTM transducer , which employs to produce the output character sequence. At every time step,
generates a probability distribution ofover character sequences, which is a function of a transducer state and an attention context
. We denote this function as CharDist, which is implemented as a single layer perceptron with softmax activation:
In order to calculate the attention context , we employ the hybrid location-aware content-based attention mechanism proposed by . Specifically, the attention energy for frame at time-step takes previous attention alignment into account through the convolution operation:
where , , , , , and are learned parameters and depicts the convolution operation. The attention alignment and the attention context is then calculated as:
The base model is trained by minimizing the cross-entropy loss:
2.2 Unsupervised Adversarial Invariance Induction
Deep neural networks (DNNs) often learn incorrect associations between nuisance factors in the raw data and the final target, leading to poor generalization . In the case of ASR, the network can link accents, speaker-specific information, or background noise with the transcriptions, resulting in overfitting. In order to cope with this issue, we adopt the unsupervised adversarial invariance (UAI)  framework for learning invariant representations that eliminate factors irrelevant to the recognition task without requiring any knowledge of nuisance factors.
The working principle of UAI is to learn a split representation of data as and , where contains information relevant to the prediction task (here ASR) and holds all other information about the input data. The underlying mechanism for learning such a split representation is to induce competition between the main prediction task and an auxiliary task of data reconstruction. In order to achieve this, the framework uses for the prediction task and a noisy version of along with for reconstruction. In addition, a disentanglement constraint enforces that and contain independent information. The prediction task tries to pull relevant factors into , while the reconstruction task drives to store all the information about input data because is unreliable. However, the disentanglement constraint forces the two embeddings to not contain overlapping information, thus leading to competition. At convergence, this results in a nuisance-free that contains only those factors that are essential for the prediction task.
2.3 NIESR Model Design and Optimization
The NIESR model comprises five types of modules: (1) encoders and that map input data to the encodings and , respectively, (2) a decoder that infers target from , (3) a dropout layer that converts into its noisy version , (4) a reconstructor that reconstructs input data from , and (5) two adversarial disentanglers and that try to infer each embedding ( or ) from the other. Figure 1 shows the complete NIESR model.
The encoder and decoder follow the base model design as described in Section 2.1, i.e., an attention-based Seq2Seq model for the speech recognition task. is designed to have exactly the same structure as . The dropout layer is introduced to make an unreliable source of information for reconstruction, which influences the reconstruction task to extract all information about into . is modeled as a stack of BLSTM layers interspersed with novel upsampling layers, which perform decompression by splitting information in each time-frame to two frames. This is the inverse of the subsampling layers  used in and . The upsampling operation is formulated as:
where represents concatenation, is the output, and is a learned projection matrix.
The adversarial disentanglers and model the UAI disentanglement constraint discussed in Section 2.2 following previous works [7, 8, 9]. tries to predict from and tries to do the inverse. This is directly opposite to the desired independence between and . Thus, training and adversarially against the rest of the model helps achieve the independence goal. Unlike previous works [7, 8, 9], the encodings and
for this work are vector-sequences instead of single vectors:and . Naïve instantiations of the disentanglers would perform frame-specific predictions of from and vice versa. However, each pair of and generated at the time-step contains information not only from frame but also from other frames across the time-span. This is because and are modeled as RNNs. Therefore, a better method to perform disentanglement for sequential representations is to use the whole series of or to estimate every element of the other. Hence, we model and as BLSTMs.
The proposed NIESR model is optimized by adopting the UAI training strategy [7, 9], i.e., playing a game where we treat , , , and as one player , and and as the other player . The model is trained using a scheduled update scheme where we freeze the weights of one player model when we update the weights of the other. The training objective comprises three tasks: (1) predicting transcriptions from the input signal, (2) reconstruction of the input, and (3) adversarial prediction of each of and from the other. The objective of the first task is written as Equation 6. The goal for the reconstruction task is to minimize the mean squared error (MSE) between and the reconstructed :
where means dropout. The training objective for the disentanglers is to minimize the MSE between embeddings predicted by the disentenglers and the embeddings generated from the encoder. However, that of the encoders is to generate and that are not predictive of each other. Hence, in the scheduled update scheme, the targets and for the disentanglers are different when updating the player models versus , following . The loss can be written as:
where and are set as and , respectively, when updating but are set to random vectors when updating .
Overall, the model is trained through backpropagation by optimizing the objective described in Equation12, where the loss-weights , , and
are hyperparameters, which are decided by the performance on the development set.
Inference with NIESR involves a forward pass of data through followed by . Hence, the usage and computational cost of NIESR for inference is the same as the base model.
The effectiveness of NIESR is quantified through the performance improvement achieved by adopting the invariant learning framework. We provide experimental results on speech recognition on three benchmark datasets: the Wall Street Journal Corpus (WSJ0) , CHiME3 , and TIMIT . We additionally provide results on the combined WSJ0+CHiME3 dataset.
WSJ0: This dataset is a collection of readings of the Wall Street Journal. It contains 7,138 utterances in the training set, 410 in the development set, and 330 in the test set. We use 40-dimensional log Mel filterbank features as the model input, and normalize the transcriptions to capitalized character sequences.
CHiME3: CHiME3 dataset contains: (1) WSJ0 sentences spoken in challenging noisy environments (real data) and (2) WSJ0 readings mixed with four different background noise (simulated data). The real speech data was recorded in five noisy environments using a six-channel tablet-based microphone array. Training data consists of 1,999 real noisy utterances from four speakers, and 7,138 simulated noisy utterances from 83 speakers in the WSJ0 training set. In total, there are 3,280 utterances in the development set, and 2,640 utterances in the test set containing both real and simulated data. The speakers in training, development, and test set are mutually different. In our experiments, we follow  to use far-field speech from the fifth microphone channel for all sets. We adopt the same input-output setting for CHiME3 as WSJ0.
TIMIT: This corpus contains a total of 6,300 sentences, with 10 sentences spoken by 630 speakers each with 8 different dialects. Among them, utterances from 168 different speakers are held-out as the test set. We further select sentences from 4 speakers of each dialect group, i.e., 32 speakers in total, from the remaining data to form the development set. Thus, all speakers in training, development, and test sets are different. Models were trained on 80 log Mel filterbank features and capitalized character sequences were treated as targets.
3.2 Experiment Setup
We train the base model without using invariance induction, i.e., the model consisting of and (Section 2.1), as a baseline. We feed the whole sequence of spectra features to and get the predicted character sequence from . We use a stack of two BLSTMs with a subsampling layer (as described in Section 2.1) in between for . is implemented as a single layer LSTM combined with attention modules introduced in Section 2.1
. All the models were trained with early stopping with 30 epochs of patience and the best model is selected based on the performance on the development set. Other model and training hyperparameters are listed in Table1.
|and LSTM Dimension||200|
|Subsampling Projected Dimension||200|
|Attention Convolution Channel||10|
|Attention Convolution Kernel Size||100|
|Upsampling Projected Dimension||200|
|Dropout layer rate||0.4|
|Learning Rate for||5e-4|
|Learning Rate for||1e-3|
|, , for WSJ0||100, 10, 1|
|, , for CHiME3||100, 1, 0.5|
|, , for TIMIT||100, 50, 1|
We augment the base model with , , , and , while treating as , to form the NIESR model. has the same hyperparameter setting and structure as . is modeled as a cascade of a BLSTM layer, an upsampling layer, and another BLSTM layer. and are implemented as BLSTMs followed by two fully-connected layers. We update the player models and in the frequency ratio of in our experiments. Hyperparameters for and are the same as the base model. Additional hyperparameters for NIESR are summarized in Table 2.
We further provide results of a stronger baseline model that utilizes labeled nuisances (speakers for WSJ0, speakers and noise environment condition for CHiME3, speakers and dialect groups for TIMIT) with the gradient reversal layer (GRL)  to learn invariant representations. Specifically, the model consists of ,
, and a classifier with a GRL between the embedding learned fromand the classifier, following the standard setup in . The target for the classifier is to predict from the embedding while the direction of the training gradient to is flipped. We denote this model as Spk-Inv for speaker-invariance, Env-Inv for environment-invariance in CHiME3, and Dial-Inv for dialect-invariance in TIMIT.
|Spk-Inv||12.31 (4.94)||43.93 (1.52)||28.45 (1.08)|
|NIESR||12.24 (5.48)||41.86 (6.16)||26.86 (6.61)|
3.3 ASR Performance on Benchmark Datasets
Table 3 summarizes the results at end-to-end ASR on WSJ0, CHiME3, and TIMIT datasets. Results show that NIESR achieves 5.48%, 6.16%, and 6.61% relative improvements over base model on WSJ0, CHiME3, and TIMIT, respectively, and demonstrates the best CER among all methods.
3.4 Invariance to Nuisance Factors
In order to examine whether a latent embedding is invariant to nuisance factors , we calculate the accuracy of predicting the factor from the encoding. Specifically, this is calculated by training classification networks (BLSTM followed by two fully-connected layers) to predict from the generated embeddings. Table 4 presents results of this experiment, showing that the embedding of the NIESR model, which is used for ASR, contains less nuisance information than the encoding of the base, Spk-Inv, and Env-Inv models. In contrast, the embedding of NIESR contains most of the nuisance information, showing that nuisance factors migrate to this embedding, as expected.
|: Speaker||: Env|
|WSJ0||in Base Model||67.91||–|
|CHiME3||in Base Model||38.52||69.24|
3.5 Additional Robustness through Data Augmentation
Training with additional data that reflects multiple variations of nuisance factors helps models generalize better. In this experiment, we treat the CHiME3 dataset, which contains WSJ0 recordings with four different types of noise, as a noisy augmentation for WSJ0. We train the base model and NIESR on the augmented dataset, i.e. WSJ0+CHiME3, and test on the original CHiME3 and WSJ0 test sets separately. Table 5 summarizes the results on this experiment, showing that training with data augmentation provides improvements on both CHiME3 and WSJ0 datasets compared to the results in Table 3. It is important to note that the NIESR model trained on the augmented dataset achieves 14.44% relative improvement on WSJ0 as compared to the base model trained on the same. This is because data augmentation provides additional information about potential nuisance factors to the NIESR model and, consequently, helps it ignore these factors for the ASR task, even though pairwise data is not provided to the model like . Hence, results show that the NIESR model can be easily combined with data augmentation to further enhance the robustness and nuisance-invariance of the learned features.
|Spk-Inv||8.62 (7.81)||40.77 (1.88)|
|Env-Inv||9.17 (1.93)||40.27 (3.08)|
|NIESR||8.00 (14.44)||38.35 (7.7)|
We presented NIESR, an end-to-end speech recognition model that adopts the unsupervised adversarial invariance framework for invariance to nuisances without requiring any knowledge of potential nuisance factors. The model works by learning a split representation of data through competition between the recognition and an auxiliary data reconstruction task. Results of experimental evaluation demonstrate that the proposed model achieves significant boosts in performance on ASR.
This material is based on research sponsored by DARPA under agreement number FA8750-18-2-0014. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of DARPA or the U.S. Government.
-  R. Prabhavalkar, K. Rao, T. N. Sainath, B. Li, L. Johnson, and N. Jaitly, “A comparison of sequence-to-sequence models for speech recognition.” 2017.
-  C.-C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, E. Gonina et al., “State-of-the-art speech recognition with sequence-to-sequence models,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 4774–4778.
-  S. Zhou, L. Dong, S. Xu, and B. Xu, “Syllable-based sequence-to-sequence speech recognition with the transformer in mandarin chinese,” arXiv preprint arXiv:1804.10752, 2018.
-  N. Jaitly, Q. V. Le, O. Vinyals, I. Sutskever, D. Sussillo, and S. Bengio, “An online sequence-to-sequence model using partial conditioning,” in Advances in Neural Information Processing Systems, 2016, pp. 5067–5075.
-  W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, “Listen, attend and spell,” arXiv preprint arXiv:1508.01211, 2015.
-  K. Rao, H. Sak, and R. Prabhavalkar, “Exploring architectures, data and units for streaming end-to-end speech recognition with rnn-transducer,” in 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2017, pp. 193–199.
-  A. Jaiswal, R. Y. Wu, W. Abd-Almageed, and P. Natarajan, “Unsupervised Adversarial Invariance,” in Advances in Neural Information Processing Systems, 2018, pp. 5097–5107.
-  A. Jaiswal, S. Xia, I. Masi, and W. AbdAlmageed, “RoPAD: Robust Presentation Attack Detection through Unsupervised Adversarial Invariance,” in 12th IAPR International Conference on Biometrics (ICB), 2019.
-  A. Jaiswal, Y. Wu, W. AbdAlmageed, and P. Natarajan, “Unified adversarial invariance,” 2019.
-  D. Serdyuk, K. Audhkhasi, P. Brakel, B. Ramabhadran, S. Thomas, and Y. Bengio, “Invariant representations for noisy speech recognition,” arXiv preprint arXiv:1612.01928, 2016.
-  Z. Meng, J. Li, Z. Chen, Y. Zhao, V. Mazalov, Y. Gang, and B.-H. Juang, “Speaker-invariant training via adversarial learning,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5969–5973.
W.-N. Hsu and J. Glass, “Extracting domain invariant features by unsupervised learning for robust automatic speech recognition,” in2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5614–5618.
-  D. Liang, Z. Huang, and Z. C. Lipton, “Learning noise-invariant representations for robust speech recognition,” arXiv preprint arXiv:1807.06610, 2018.
-  Y. Ganin and V. Lempitsky, “Unsupervised domain adaptation by backpropagation,” arXiv preprint arXiv:1409.7495, 2014.
-  Y. Zhang, W. Chan, and N. Jaitly, “Very deep convolutional networks for end-to-end speech recognition,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 4845–4849.
-  D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
-  J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-based models for speech recognition,” in Advances in neural information processing systems, 2015, pp. 577–585.
-  D. B. Paul and J. M. Baker, “The design for the wall street journal-based csr corpus,” in Proceedings of the workshop on Speech and Natural Language. Association for Computational Linguistics, 1992, pp. 357–362.
-  J. Barker, R. Marxer, E. Vincent, and S. Watanabe, “The third ‘chime’speech separation and recognition challenge: Dataset, task and baselines,” in 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). IEEE, 2015, pp. 504–511.
-  J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, and N. L. Dahlgren, “Darpa timit acoustic phonetic continuous speech corpus cdrom,” 1993.