NIESR: Nuisance Invariant End-to-end Speech Recognition

07/07/2019 ∙ by I-Hung Hsu, et al. ∙ USC Information Sciences Institute 0

Deep neural network models for speech recognition have achieved great success recently, but they can learn incorrect associations between the target and nuisance factors of speech (e.g., speaker identities, background noise, etc.), which can lead to overfitting. While several methods have been proposed to tackle this problem, existing methods incorporate additional information about nuisance factors during training to develop invariant models. However, enumeration of all possible nuisance factors in speech data and the collection of their annotations is difficult and expensive. We present a robust training scheme for end-to-end speech recognition that adopts an unsupervised adversarial invariance induction framework to separate out essential factors for speech-recognition from nuisances without using any supplementary labels besides the transcriptions. Experiments show that the speech recognition model trained with the proposed training scheme achieves relative improvements of 5.48 Additionally, the proposed method achieves a relative improvement of 14.44 the combined WSJ0+CHiME3 dataset.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the aid of recent advances in neural networks, end-to-end deep learning systems for automatic speech recognition (ASR) have gained popularity and achieved extraordinary performance on a variety of benchmarks 

[1, 2, 3, 4]

. End-to-end ASR models typically consist of Recurrent Neural Networks (RNNs) with Sequence-to-Sequence (Seq2Seq) architectures and attention mechanisms 

[5], RNN transducers [6]

, or transformer networks 

[3]. These systems learn a direct mapping from an audio signal sequence to a sequence of text transcriptions. However, the input audio sequence often contains nuisance factors that are irrelevant to the recognition task and the trained model can incorrectly learn to associate some of these factors with target variables, which leads to overfitting. For example, besides linguistic content, speech data contains nuisance information about speaker identities, background noise, etc., which can hurt the recognition performance if the distributions of these attributes are mismatched between training and testing.

A common method for combatting the vulnerability of deep neural networks to nuisance factors is the incorporation of invariance induction during model training. For example, invariant deep models have achieved considerable success in computer vision 

[7, 8, 9] and speech recognition [10, 11, 12, 13]. Serdyuk et al. [10] obtain noise-invariant representations by employing noise-condition annotations and the gradient reversal layer [14] for acoustic modeling. Similarly, Meng et al. [11] utilize speaker information to train a speaker-invariant model for senone prediction. Hsu et al. [12]

extract domain-invariant features using a factorized hierarchical variational autoencoder. Liang et al. 

[13] force their end-to-end ASR model to learn similar representations for clean input instances and their synthetically generated noisy counterparts.

While these methods work well at handling discrepancies between training and testing datasets for ASR systems, they require domain knowledge [12], supplementary nuisance information during training (e.g., speaker identities [11], recording environments [10], etc.), or pairwise data [13]. However, these requirements are difficult and expensive to fulfill in real world, e.g., it is hard to enumerate all possible nuisance factors and collect corresponding annotations.

In this work, we propose a new training scheme, namely NIESR, which adopts the unsupervised adversarial invariance learning framework (UAI) [7] for end-to-end speech recognition. Without incorporating supervised information of nuisances for the input signal features, the proposed method is capable of separating the underlying elements of speech data into two series of latent embeddings – one containing all the information that is essential for ASR, and the other containing information that is irrelevant to the recognition task (e.g. accents, background noises, etc.). Experimental results show that the proposed training method boosts the end-to-end ASR performance on WSJ0, CHiME3, and TIMIT datasets. We also show the effectiveness of combining NIESR with data augmentation.

2 Methodology

In this section, we present the proposed NIESR model for nuisance-invariant end-to-end speech recognition, where the invariance is achieved by adopting the UAI framework [7]. We begin by describing the base Seq2Seq ASR model. Subsequently, we introduce the UAI framework for unsupervised adversarial invariance induction. Finally, we present the complete design of the proposed NIESR model.

2.1 Base Sequence-to-sequence Model

We are interested in learning a mapping from a sequence of acoustic spectra features to a series of textual characters , given a dataset , following the formulation of Chan et al. [5]

. We employ a Seq2Seq model for this task, which estimates the probability of each character output

by conditioning over the previous characters and the input sequence . Thus, the conditional probability of the entire output is:


A Seq2Seq model is composed of two modules: an encoder and a decoder . transforms the input features into a high-level representation , i.e. and infers the output sequence from . We model

as a stack of Bidirectional Long-Short Term Memory (BLSTM) layers with interspersed projected-subsampling layers 

[15]. The subsampling layer projects a pair of consecutive input frames to a single lower-dimensional frame . We model as an attention-based LSTM transducer [16], which employs to produce the output character sequence. At every time step,

generates a probability distribution of

over character sequences, which is a function of a transducer state and an attention context

. We denote this function as CharDist, which is implemented as a single layer perceptron with softmax activation:


In order to calculate the attention context , we employ the hybrid location-aware content-based attention mechanism proposed by [17]. Specifically, the attention energy for frame at time-step takes previous attention alignment into account through the convolution operation:


where , , , , , and are learned parameters and depicts the convolution operation. The attention alignment and the attention context is then calculated as:


The base model is trained by minimizing the cross-entropy loss:


2.2 Unsupervised Adversarial Invariance Induction

Deep neural networks (DNNs) often learn incorrect associations between nuisance factors in the raw data and the final target, leading to poor generalization [7]. In the case of ASR, the network can link accents, speaker-specific information, or background noise with the transcriptions, resulting in overfitting. In order to cope with this issue, we adopt the unsupervised adversarial invariance (UAI) [7] framework for learning invariant representations that eliminate factors irrelevant to the recognition task without requiring any knowledge of nuisance factors.

The working principle of UAI is to learn a split representation of data as and , where contains information relevant to the prediction task (here ASR) and holds all other information about the input data. The underlying mechanism for learning such a split representation is to induce competition between the main prediction task and an auxiliary task of data reconstruction. In order to achieve this, the framework uses for the prediction task and a noisy version of along with for reconstruction. In addition, a disentanglement constraint enforces that and contain independent information. The prediction task tries to pull relevant factors into , while the reconstruction task drives to store all the information about input data because is unreliable. However, the disentanglement constraint forces the two embeddings to not contain overlapping information, thus leading to competition. At convergence, this results in a nuisance-free that contains only those factors that are essential for the prediction task.

2.3 NIESR Model Design and Optimization

Figure 1: NIESR: The two encoders and are BLSTM-based feature extractors that encode the input sequence into representations and . The two encodings are disentangled by adversarially training the two disentanglers, and , which aim to predict one embedding from another. is an attention-based decoder that generates the target characters from . is a BLSTM-based reconstructor that decodes and the noisy back to the input-sequence

The NIESR model comprises five types of modules: (1) encoders and that map input data to the encodings and , respectively, (2) a decoder that infers target from , (3) a dropout layer that converts into its noisy version , (4) a reconstructor that reconstructs input data from , and (5) two adversarial disentanglers and that try to infer each embedding ( or ) from the other. Figure 1 shows the complete NIESR model.

The encoder and decoder follow the base model design as described in Section 2.1, i.e., an attention-based Seq2Seq model for the speech recognition task. is designed to have exactly the same structure as . The dropout layer is introduced to make an unreliable source of information for reconstruction, which influences the reconstruction task to extract all information about into  [7]. is modeled as a stack of BLSTM layers interspersed with novel upsampling layers, which perform decompression by splitting information in each time-frame to two frames. This is the inverse of the subsampling layers [15] used in and . The upsampling operation is formulated as:


where represents concatenation, is the output, and is a learned projection matrix.

The adversarial disentanglers and model the UAI disentanglement constraint discussed in Section 2.2 following previous works [7, 8, 9]. tries to predict from and tries to do the inverse. This is directly opposite to the desired independence between and . Thus, training and adversarially against the rest of the model helps achieve the independence goal. Unlike previous works [7, 8, 9], the encodings and

for this work are vector-sequences instead of single vectors:

and . Naïve instantiations of the disentanglers would perform frame-specific predictions of from and vice versa. However, each pair of and generated at the time-step contains information not only from frame but also from other frames across the time-span. This is because and are modeled as RNNs. Therefore, a better method to perform disentanglement for sequential representations is to use the whole series of or to estimate every element of the other. Hence, we model and as BLSTMs.

The proposed NIESR model is optimized by adopting the UAI training strategy [7, 9], i.e., playing a game where we treat , , , and as one player , and and as the other player . The model is trained using a scheduled update scheme where we freeze the weights of one player model when we update the weights of the other. The training objective comprises three tasks: (1) predicting transcriptions from the input signal, (2) reconstruction of the input, and (3) adversarial prediction of each of and from the other. The objective of the first task is written as Equation 6. The goal for the reconstruction task is to minimize the mean squared error (MSE) between and the reconstructed :


where means dropout. The training objective for the disentanglers is to minimize the MSE between embeddings predicted by the disentenglers and the embeddings generated from the encoder. However, that of the encoders is to generate and that are not predictive of each other. Hence, in the scheduled update scheme, the targets and for the disentanglers are different when updating the player models versus , following [9]. The loss can be written as:


where and are set as and , respectively, when updating but are set to random vectors when updating .

Overall, the model is trained through backpropagation by optimizing the objective described in Equation 

12, where the loss-weights , , and

are hyperparameters, which are decided by the performance on the development set.


Inference with NIESR involves a forward pass of data through followed by . Hence, the usage and computational cost of NIESR for inference is the same as the base model.

3 Experiments

The effectiveness of NIESR is quantified through the performance improvement achieved by adopting the invariant learning framework. We provide experimental results on speech recognition on three benchmark datasets: the Wall Street Journal Corpus (WSJ0) [18], CHiME3 [19], and TIMIT [20]. We additionally provide results on the combined WSJ0+CHiME3 dataset.

3.1 Datasets

WSJ0: This dataset is a collection of readings of the Wall Street Journal. It contains 7,138 utterances in the training set, 410 in the development set, and 330 in the test set. We use 40-dimensional log Mel filterbank features as the model input, and normalize the transcriptions to capitalized character sequences.

CHiME3: CHiME3 dataset contains: (1) WSJ0 sentences spoken in challenging noisy environments (real data) and (2) WSJ0 readings mixed with four different background noise (simulated data). The real speech data was recorded in five noisy environments using a six-channel tablet-based microphone array. Training data consists of 1,999 real noisy utterances from four speakers, and 7,138 simulated noisy utterances from 83 speakers in the WSJ0 training set. In total, there are 3,280 utterances in the development set, and 2,640 utterances in the test set containing both real and simulated data. The speakers in training, development, and test set are mutually different. In our experiments, we follow [11] to use far-field speech from the fifth microphone channel for all sets. We adopt the same input-output setting for CHiME3 as WSJ0.

TIMIT: This corpus contains a total of 6,300 sentences, with 10 sentences spoken by 630 speakers each with 8 different dialects. Among them, utterances from 168 different speakers are held-out as the test set. We further select sentences from 4 speakers of each dialect group, i.e., 32 speakers in total, from the remaining data to form the development set. Thus, all speakers in training, development, and test sets are different. Models were trained on 80 log Mel filterbank features and capitalized character sequences were treated as targets.

3.2 Experiment Setup

We train the base model without using invariance induction, i.e., the model consisting of and (Section 2.1), as a baseline. We feed the whole sequence of spectra features to and get the predicted character sequence from . We use a stack of two BLSTMs with a subsampling layer (as described in Section 2.1) in between for . is implemented as a single layer LSTM combined with attention modules introduced in Section 2.1

. All the models were trained with early stopping with 30 epochs of patience and the best model is selected based on the performance on the development set. Other model and training hyperparameters are listed in Table 


Item Setting
and LSTM Dimension 200
Subsampling Projected Dimension 200
Attention Dimension 200
Attention Convolution Channel 10
Attention Convolution Kernel Size 100
Optimizer Adam
Learning Rate 5e-4
Table 1: Hyperparameters for the base model.
Item Setting
LSTM Dimension 300
Upsampling Projected Dimension 200
, Dimension 200
Dropout layer rate 0.4
Optimizer Adam
Learning Rate for 5e-4
Learning Rate for 1e-3
, , for WSJ0 100, 10, 1
, , for CHiME3 100, 1, 0.5
, , for TIMIT 100, 50, 1
Table 2: Hyperparameters for the NIESR model.

We augment the base model with , , , and , while treating as , to form the NIESR model. has the same hyperparameter setting and structure as . is modeled as a cascade of a BLSTM layer, an upsampling layer, and another BLSTM layer. and are implemented as BLSTMs followed by two fully-connected layers. We update the player models and in the frequency ratio of in our experiments. Hyperparameters for and are the same as the base model. Additional hyperparameters for NIESR are summarized in Table 2.

We further provide results of a stronger baseline model that utilizes labeled nuisances (speakers for WSJ0, speakers and noise environment condition for CHiME3, speakers and dialect groups for TIMIT) with the gradient reversal layer (GRL) [14] to learn invariant representations. Specifically, the model consists of ,

, and a classifier with a GRL between the embedding learned from

and the classifier, following the standard setup in [14]. The target for the classifier is to predict from the embedding while the direction of the training gradient to is flipped. We denote this model as Spk-Inv for speaker-invariance, Env-Inv for environment-invariance in CHiME3, and Dial-Inv for dialect-invariance in TIMIT.

Base 12.95 44.61 28.76
Spk-Inv 12.31 (4.94) 43.93 (1.52) 28.45 (1.08)
Env-Inv 42.61 (4.48)
Dial-Inv 28.29 (1.63)
NIESR 12.24 (5.48) 41.86 (6.16) 26.86 (6.61)
Table 3: Speech recognition performance as CER (%). Values in parentheses show relative improvement (%) over Base model.

3.3 ASR Performance on Benchmark Datasets

Table 3 summarizes the results at end-to-end ASR on WSJ0, CHiME3, and TIMIT datasets. Results show that NIESR achieves 5.48%, 6.16%, and 6.61% relative improvements over base model on WSJ0, CHiME3, and TIMIT, respectively, and demonstrates the best CER among all methods.

3.4 Invariance to Nuisance Factors

In order to examine whether a latent embedding is invariant to nuisance factors , we calculate the accuracy of predicting the factor from the encoding. Specifically, this is calculated by training classification networks (BLSTM followed by two fully-connected layers) to predict from the generated embeddings. Table 4 presents results of this experiment, showing that the embedding of the NIESR model, which is used for ASR, contains less nuisance information than the encoding of the base, Spk-Inv, and Env-Inv models. In contrast, the embedding of NIESR contains most of the nuisance information, showing that nuisance factors migrate to this embedding, as expected.

Dataset Predict from Accuracy
: Speaker : Env
WSJ0 in Base Model 67.91
in Spk-Inv 65.60
in NIESR 63.35
in NIESR 97.92
CHiME3 in Base Model 38.52 69.24
in Spk-Inv 37.91 69.11
in Env-Inv 38.84 66.44
in NIESR 35.87 63.45
in NIESR 92.28 97.05
Table 4: Results of predicting nuisance factor from learned representations as accuracy. Env stands for environment.

3.5 Additional Robustness through Data Augmentation

Training with additional data that reflects multiple variations of nuisance factors helps models generalize better. In this experiment, we treat the CHiME3 dataset, which contains WSJ0 recordings with four different types of noise, as a noisy augmentation for WSJ0. We train the base model and NIESR on the augmented dataset, i.e. WSJ0+CHiME3, and test on the original CHiME3 and WSJ0 test sets separately. Table 5 summarizes the results on this experiment, showing that training with data augmentation provides improvements on both CHiME3 and WSJ0 datasets compared to the results in Table 3. It is important to note that the NIESR model trained on the augmented dataset achieves 14.44% relative improvement on WSJ0 as compared to the base model trained on the same. This is because data augmentation provides additional information about potential nuisance factors to the NIESR model and, consequently, helps it ignore these factors for the ASR task, even though pairwise data is not provided to the model like [13]. Hence, results show that the NIESR model can be easily combined with data augmentation to further enhance the robustness and nuisance-invariance of the learned features.

Model WSJ0 CHiME3
Base 9.35 41.55
Spk-Inv 8.62 (7.81) 40.77 (1.88)
Env-Inv 9.17 (1.93) 40.27 (3.08)
NIESR 8.00 (14.44) 38.35 (7.7)
Table 5: Test results of models trained on the WSJ0+CHiME3 augmented dataset as CER (%). Values in parentheses show the relative improvement (%) over Base model.

4 Conclusion

We presented NIESR, an end-to-end speech recognition model that adopts the unsupervised adversarial invariance framework for invariance to nuisances without requiring any knowledge of potential nuisance factors. The model works by learning a split representation of data through competition between the recognition and an auxiliary data reconstruction task. Results of experimental evaluation demonstrate that the proposed model achieves significant boosts in performance on ASR.

5 Acknowledgements

This material is based on research sponsored by DARPA under agreement number FA8750-18-2-0014. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of DARPA or the U.S. Government.