Invariant Representations for Noisy Speech Recognition

by   Dmitriy Serdyuk, et al.
Université de Montréal

Modern automatic speech recognition (ASR) systems need to be robust under acoustic variability arising from environmental, speaker, channel, and recording conditions. Ensuring such robustness to variability is a challenge in modern day neural network-based ASR systems, especially when all types of variability are not seen during training. We attempt to address this problem by encouraging the neural network acoustic model to learn invariant feature representations. We use ideas from recent research on image generation using Generative Adversarial Networks and domain adaptation ideas extending adversarial gradient-based training. A recent work from Ganin et al. proposes to use adversarial training for image domain adaptation by using an intermediate representation from the main target classification network to deteriorate the domain classifier performance through a separate neural network. Our work focuses on investigating neural architectures which produce representations invariant to noise conditions for ASR. We evaluate the proposed architecture on the Aurora-4 task, a popular benchmark for noise robust ASR. We show that our method generalizes better than the standard multi-condition training especially when only a few noise categories are seen during training.


page 1

page 2

page 3

page 4


Adversarial Learning of Raw Speech Features for Domain Invariant Speech Recognition

Recent advances in neural network based acoustic modelling have shown si...

Unsupervised Speech Domain Adaptation Based on Disentangled Representation Learning for Robust Speech Recognition

In general, the performance of automatic speech recognition (ASR) system...

Articulatory Features for ASR of Pathological Speech

In this work, we investigate the joint use of articulatory and acoustic ...

Boosting Noise Robustness of Acoustic Model via Deep Adversarial Training

In realistic environments, speech is usually interfered by various noise...

Learning Domain Invariant Representations for Child-Adult Classification from Speech

Diagnostic procedures for ASD (autism spectrum disorder) involve semi-na...

Exploring Textual and Speech information in Dialogue Act Classification with Speaker Domain Adaptation

In spite of the recent success of Dialogue Act (DA) classification, the ...

Attentive Adversarial Learning for Domain-Invariant Training

Adversarial domain-invariant training (ADIT) proves to be effective in s...

1 Introduction

One of the most challenging aspects of automatic speech recognition (ASR) is the mismatch between the training and testing acoustic conditions. During testing, a system may encounter new recording conditions, microphone types, speakers, accents and types of background noises. Furthermore, even if the test scenarios are seen during training, there can be significant variability in their statistics. Thus, its important to develop ASR systems that are invariant to unseen acoustic conditions. Several model and feature based adaptation methods such as Maximum Likelihood Linear Regression (MLLR), feature-based MLLR and iVectors 

(Saon et al. , 2013) have been proposed to handle speaker variability; and Noise Adaptive Training (NAT; Kalinli et al. , 2010)

and Vector Taylor Series

(VTS; Un et al. , 1998) to handle environment variability. With the increasing success of Deep Neural Network (DNN) acoustic models for ASR (Hinton et al. , 2012; Seide et al. , 2011; Sainath et al. , 2011), end-to-end systems are being proposed (Miao et al. , 2015; Sainath et al. , 2015) for modeling the acoustic conditions within a single network. This allows us to take advantage of the network’s ability to learn highly non-linear feature transformations, with greater flexibility in constructing training objective functions that promote learning of noise invariant representations. The main idea of this work is to force the acoustic model to learn a representation invariant to noise conditions, instead of explicitly using noise robust acoustic features (Section 3). This type of noise-invariant training requires noise-condition labels during training only. It is related to the idea of generative adversarial networks (GAN) and the gradient reverse method proposed in Goodfellow et al.  (2014) and Ganin & Lempitsky (2014) respectively (Section 2). We present results on the Aurora-4 speech recognition task in Section 4 and summarize our findings in Section 5.

2 Related Work

Generative Adversarial Networks consist of two networks: generator and discriminator. The generator network has an input of randomly-generated feature vectors and is asked to produce a sample, e.g. an image, similar to the images in the training set. The discriminator network can either receive a generated image from the generator

or an image from the training set. Its task is to distinguish between the “fake” generated image and the “real” image taken from the dataset. Thus, the discriminator is just a classifier network with a sigmoid output layer and can be trained with gradient backpropagation. This gradient can be propagated further to the generator network.

Two networks in the GAN setup are competing with each other: the generator is trying to deceive the discriminator network, while the discriminator tries to do its best to recognize if there was a deception, similar to adversarial game-theoretic settings. Formally, the objective function of GAN training is

The maximization over the discriminator forms a usual cross-entropy objective, the gradients are computed with respect to the parameters of . The parameters of are minimized using the gradients propagated through the second term. The minimization over makes it to produce examples which classifiers as the training ones.

Several practical guidelines were proposed for optimizing GANs in Radford et al.  (2015) and further explored in Salimans et al.  (2016).

Prior work by Ganin & Lempitsky (2014) proposed a method of training a network which can be adapted to new domains. The training data consists of the images labeled with classes of interest and separate domain (image background) labels. The network has a

-like structure: the image is fed to the first network which produces a hidden representation

. Then this representation is input to two separate networks: a domain classifier network (D) and a target classifier network (R). The goal of training is to learn the hidden representation that is invariant to the domain labels and performs well on the target classification task, so that the domain information doesn’t interfere with the target classifier at test time. Similar to the GAN objective, which forces the generation distribution be close to the data distribution, the gradient reverse method makes domain distributions similar to each other.

The network is trained with three goals: the hidden representation

should be helpful for the target classifier, harmful for the domain classifier, and the domain classifier should have a good classification accuracy. More formally, the authors define the loss function as


where is the ground truth class, is the domain label, corresponding hat variables are the network predictions, and and are the subsets of parameters for the encoder, recognizer and the domain classifier networks respectively. The hyper-parameters and denote the relative influence of the loss functions terms.

The influence of representations produced by a neural network to internal noise reduction is discussed in Yu et al.  (2013) and this work sets a baseline for experiments on Aurora-4 dataset. Recently, in Shunohara (2016) a multilayer sigmoidal network is trained in an adversarial fashion on an in-house transcription task corrupted by noise.

3 Invariant Representations for Speech Recognition

Most ASR systems are DNN-HMM hybrid systems. The context dependent (CD) HMM states (acoustic model) are the class labels of interest. The recording conditions, speaker identity, or gender represent the domains in GANs. The task is to make the hidden layer representations of the HMM state classifier network invariant with respect to these domains. We hypothesize that this adversarial method of training helps the HMM state classifier to generalize better to unseen domain conditions and requires only a small additional amount of supervision, i.e. the domain labels.

Figure 0(a)

depicts the model, which is same as the model for the gradient reverse method. It is a feed-forward neural network trained to predict the CD HMM state, with a branch that predicts the domain (noise condition). This branch is discarded in the testing phase. In our experiments we used the noise condition as the domain label merging all noise types into one label and clean as the other label. Our training loss function is Eq. 

1 with set to for stability during training.

term maximizes the probability of an incorrect domain classification in contrast to the gradient reverse where the correct classification is minimized. The terms

and are regular cross-entropies which are minimized with corresponding parameters and . For simplicity, we use only a single hyper-parameter – the weight of the third term.

(a) The model consists of three neural networks. The encoder produces the intermediate representation which used in the recognizer and in the domain discriminator . The hidden representation is trained to improve the recognition and minimize the domain discriminator accuracy. The domain discriminator is a classifier trained to maximize its accuracy on the noise type classification task.
(b) Up: Average performance of the baseline multi-condition and invariance model varying with the number of noise conditions used for training. Bottom: Average performance on seen versus unseen noise conditions. Testing was performed on all wv1 conditions (Sennheiser microphone).
Figure 1: Model structure for invariant training and ASR results.

4 Experiments

We experimentally evaluate our approach on the well-benchmarked Aurora-4 (Parihar & Picone, 2002) noisy speech recognition task. Aurora-4 is based on the Wall Street Journal corpus (WSJ0). It contains noises of six categories which was added to clean data. Every clean and noisy utterance is filtered to simulate the frequency characteristics. The training data contains 4400 clean utterances and 446 utterances for each noise condition, i.e. a total of 2676 noisy utterances. The test set consists of clean data, data corrupted by 6 noise types, and data recorded with a different microphone for both clean and noisy cases.

For both clean and noisy data, we extract 40-dimensional Mel-filterbank features with their deltas and delta-deltas spliced over

5 frames, resulting in 1320 input features that are subsequently mean and variance normalized. The baseline acoustic model is a 6-layer DNN with 2048 rectified linear units at every layer. It is trained using momentum-accelerated stochastic gradient descent for 15 epochs with new-bob annealing 

(as in Morgan & Bourlard, 1995; Sainath et al. , 2011).

In order to evaluate the impact of our method on generalization to unseen noises, we performed 6 experiments with different set of seen noises. The networks are trained on clean data, with each noise condition added one-by-one in the following order: airport, babble, car, restaurant, street, and train. The last training group includes all noises therefore matches the standard multi-condition training setup. For every training group, we trained the baseline and the invariance model where we branch out at the layer to an binary classifier predicting clean versus noisy data. Due to the imbalance between amounts of clean and noisy utterances, we had to oversample noisy frames to ensure that every mini-batch contained equal number of clean and noisy speech frames.

Table 1 summarizes the results. Figure 0(b) visualizes the word error rate for the baseline multi-condition training and invariance training as the number of seen noise types varies. We conclude that the best performance gain is achieved when a small number of noise types are available during training. It can be seen that invariance training is able to generalize better to unseen noise types compared with multi-condition training.

We note that our experiments did not use layer-wise pre-training, commonly used for small datasets. The baseline WERs reported are very close to the state-of-the-art. Our preliminary experiments on a pre-trained network (better overall WER) when using all noise types (last row of Table 1) for training show the same trend as the non-pretrained networks.

Noise Inv BL A B C D
Inv BL Inv BL Inv BL Inv BL
1 16.36 18.14 6.54 7.57 12.71 14.09 11.45 13.10 22.47 24.80
2 15.56 17.39 5.90 6.58 11.69 13.28 11.12 13.51 21.79 23.96
3 14.24 14.67 5.45 5.08 10.76 12.44 9.75 9.84 19.93 19.30
4 13.61 13.84 5.08 5.29 9.73 9.97 9.49 9.56 19.49 19.90
5 13.41 13.02 5.12 5.34 9.52 9.42 9.55 8.67 19.33 18.65
6 12.62 12.60 4.80 4.61 9.04 8.86 8.76 8.59 18.16 18.21
6* 11.85 11.99 4.52 4.76 8.76 8.76 7.79 8.57 16.84 16.99
Table 1: Average word error rate (WER%) on Aurora-4 dataset on all test conditions, including seen and unseen noise and unseen microphone. First column is the number of noise conditions used for the training. The last row is a preliminary experiment with layer-wise pre-training close to state-of-the-art model and a corresponding invariance training starting with a pretrained model.

5 Discussion

This paper presents the application of generative adversarial networks and invariance training for noise robust speech recognition. We show that invariance training helps the ASR system to generalize better to unseen noise conditions and improves word error rate when a small number of noise types are seen during training. Our experiments show that in contrast to the image recognition task, in speech recognition, the domain adaptation network suffers from underfitting. Therefore, the gradient of the term in Eq. 1 is unreliable and noisy. Future research includes enhancements to the domain adaptation network while exploring alternative network architectures and invariance-promoting loss functions.


We would like to thank Yaroslav Ganin, David Warde-Farley for insightful discussions, developers of Theano 

Theano Development Team (2016), Blocks, and Fuel van Merriënboer et al.  (2015) for great toolkits.