Investigation of Densely Connected Convolutional Networks with Domain Adversarial Learning for Noise Robust Speech Recognition

by   Chia Yu Li, et al.
University of Stuttgart

We investigate densely connected convolutional networks (DenseNets) and their extension with domain adversarial training for noise robust speech recognition. DenseNets are very deep, compact convolutional neural networks which have demonstrated incredible improvements over the state-of-the-art results in computer vision. Our experimental results reveal that DenseNets are more robust against noise than other neural network based models such as deep feed forward neural networks and convolutional neural networks. Moreover, domain adversarial learning can further improve the robustness of DenseNets against both, known and unknown noise conditions.



page 1

page 2

page 3

page 4


Densely Connected Convolutional Networks for Speech Recognition

This paper presents our latest investigation on Densely Connected Convol...

Unsupervised Domain Adaptation by Adversarial Learning for Robust Speech Recognition

In this paper, we investigate the use of adversarial learning for unsupe...

Adversarial-based neural networks for affect estimations in the wild

There is a growing interest in affective computing research nowadays giv...

Speech Recognition: Keyword Spotting Through Image Recognition

The problem of identifying voice commands has always been a challenge du...

Lip-reading with Densely Connected Temporal Convolutional Networks

In this work, we present the Densely Connected Temporal Convolutional Ne...

Adversarial-based neural network for affect estimations in the wild

There is a growing interest in affective computing research nowadays giv...

An Abstraction-Refinement Approach to Verifying Convolutional Neural Networks

Convolutional neural networks have gained vast popularity due to their e...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the last years, automatic speech recognition (ASR) performance has been significantly improved through the use of neural networks with deep structures [deep-neural-networks-for-acoustic-modeling-in-speech, conversational-speech-transcription-using-context-dependent-deep-neural-networks-2, context-dependent-pre-trained-deep-neural-networks-for-large-vocabulary-speech-recognition]. Various neural network architectures have been developed to improve ASR performance. They are variations of TDNNs [waibel1990phoneme], CNNs  [abdel2012applying] RNNs [graves2013speech], and their combinations  [convolutional-networks-for-images-sppech-and-time-series]. Among them, very deep CNNs demonstrate impressive performance  [very-deep-convolutional-neural-networks-for-robust-speech-recognition, deep-convolutional-neural-networks-layer-wise-context-expansion-attention, xiong2018microsoft] especially in noisy conditions  [very-deep-convolutional-neural-networks-for-robust-speech-recognition].

Recently in the computer vision research community, densely connected convolutional networks (DenseNets) have obtained significant improvements over the state-of-the-art networks on four highly competitive object recognition benchmark tasks [DenseNet]

. The idea is to introduce shorter connections between layers close to the input and those close to the output which alleviate the vanishing-gradient problem. Furthermore, DenseNets require fewer parameters than traditional CNNs with the same deep structure 

[DenseNet]. In [DenseNetASR], we showed that DenseNets can be used for acoustic modeling achieving impressive performance.

In this paper, we explore noise robustness of DenseNets and their extension with domain adversarial learning. This method was originally proposed by Ganin et al. [ganin2016domain] for unsupervised domain adaptation in natural language processing and was then applied to deep feed forward neural networks (DNNs) for noise robust speech recognition [Adversarial-Multi-task-Learning-of-Deep-Neural-Networks-for-Robust-Speech-Recognition, DomainAD]. However to the best of our knowledge, domain adversarial learning has never been examined with a complex network like DenseNets before. Our experimental results on noisy data demonstrate that DenseNets can effectively improve the noise robustness of the system outperforming other neural based models. Using domain adversarial learning can further improve their robustness against both, known and unknown noise conditions.

2 Methods

2.1 DenseNets Acoustic Models

In this subsection, we first describe DenseNets and review the method how to use DenseNets for acoustic modeling [DenseNetASR]. The key idea of DenseNets is the introduction of shorter connections between layers close to the input and those close to the output which alleviate the vanishing-gradient problem. For acoustic modeling, DenseNets take the unadapted features 40-dimensional log Mel filterbank as input and predict the context dependent HMM states (senones) [DenseNetASR].

Given an input and a CNN with layers, where each layer is equipped with a nonlinear transformation

which is the composition of three consecutive operations: batch normalization, followed by a ReLU and a

convolution, DenseNets introduce direct connections from any layer to all subsequent layers. The output of layer is:


where refers to the concatenation of the feature maps yielded in all the previous layers. Fig. 1 illustrates the dense connectivity structure, in which each layer takes all preceding feature-maps as input. This structure is called dense block.

Figure 1: A 3-layer dense block, in which each layer takes all preceding feature-maps as input

DenseNets consist of multiple dense blocks, connected in series and separated by transition layers. Each transition layer consists of a convolution layer and a average pooling layer. Fig. 2 illustrates how these dense blocks and transition layers are composed in DenseNets. Note that pooling is only performed outside of dense blocks.

Figure 2: A DenseNet architecture with three dense blocks connected via transition layers

Furthermore, DenseNets reduce the number of feature-maps by convolution layer in the transition layer to improve model compactness. For example, if a dense block has feature-maps, the transition layer generates output feature-maps, where is the compression factor and the range is . The growth rate of DenseNets is the number of channels in their convolution layers. By equation Hn, the layer within a dense block has input feature-maps, where is the number of input channels and (the model’s growth rate) is the number of channels for subsequent convolution layers. DenseNets have better performance when is a small integer, e.g.  [DenseNet].

2.2 Domain Adversarial Learning

In this subsection, we introduce the extension of DenseNets with domain adversarial learning [ganin2016domain]. In this context, noisy conditions act as domain information.

The overall architecture of the domain adversarial learning of DenseNets is shown in Fig 3. It consists three sub-networks: the sub-network () is for senone classification, the sub-network () is for domain classification and the share-network () is the share part of the two tasks. The shared-network (

) can be seen as a feature extractor to convert an input vector to its latent representation. Each output sub-network acts as a classifier to calculate posterior probabilities of classes given the this latent representation

[Adversarial-Multi-task-Learning-of-Deep-Neural-Networks-for-Robust-Speech-Recognition, DomainAD]. In the domain adversarial learning, the representation is learned adversarially to the domain classification and friendly to the senone classification, so that domain-dependent information to the senone classifier is removed from the representation.

Figure 3: An architecture of DenseNets with domain adversarial learning which consists of there sub-networks: features extractor sub-network (x), senone classification sub-network (y) and domain classification sub-network (z)


denote the parameters of the share-network (x), sub-network (y) and sub-network (z), respectively. The cross-entropy loss functions for the senone classifier and domain classifier are defined as


The parameters are updated as


where is the gradient reversal coefficient which is a positive scalar parameter to adjust the strength of the regularization.

The first layer of the sub-network () is the gradient reversal layer (GRL) as proposed in [Adversarial-Multi-task-Learning-of-Deep-Neural-Networks-for-Robust-Speech-Recognition]. In the forward propagation phase, the GRL just passes the input to the output as follows:


where and represent input and output vectors of the layer, respectively. In the backward propagation phase, the GRL reverses the gradient by multipling it with as follows:


Hence the shared-network (x) is trained adversarially to the sub-network (y) for the domain classification. When the training is finished, the output of the entire network (x, y) for the senone classification is used for decoding.

3 Setup

Two experiments are conducted in this paper. The goal for the first experiment is to explore DenseNets’ robustness at different levels of noise. We compare the baseline models, which are DNNs, CNNs and TDNNs, and DenseNets on noise corrupted RM. The second experiment examines the effectiveness of domain adversarial learning, in which we compare the performance of DenseNets, DenseNets-AD and the best baseline model TDNNs on noise corrupted RM and Aurora4.

3.1 Resources

3.1.1 Data

The noise corrupted RM is made by artificially adding different types of noise at different values of SNR 111Signal-to-Noise ratio (SNR) is defined as the ratio of the power of a signal to the power of noise () to RM [RM]

using the large-scale open-source acoustic simulator developed in

[C2]. It contains 1,993 noisy conditions: 1,500 are used for training and 493 for testing. We created three noise corrupted data sets with different SNRs: Data-1 (SNR from 0 to 4), Data-2 (SNR from 0 to 8) and Data-3 (SNR from 0 to 12). Figure 4

shows the data distribution of Data-1, Data-2 and Data-3. For example, 19.9% utterances in Data-1 are adding noise (randomly chosen) at SNR=0, 20.3% of them are at SNR=1, 19.6% of them are at SNR=2, 21.2% of them are at SNR=3 and 19% of them are at SNR=4. Two noise corrupted test sets are used in this paper. In the "known-noise test set" (KNN), the noise is randomly picked from 1,500 training noise and added to the utterance at the same range of SNR used in the training set. On the contrary in the "unknown-noise test set" (UKN), the noise is selected from 493 testing noises and added to the utterance.

Figure 4: The data composition of Data-1, Data-2 and Data-3 from left to right.

The Aurora 4 task, which is a medium vocabulary task speech recognition task, is based on the Wall Street Journal (WSJ0) dataset [AURORA4]. It contains 16 kHz speech data in the presence of six additive noises (car, crowd of people, restaurant, street, airport and train station) with linear convolutional channel distortions. The multi-condition training set with 7138 utterances from 83 speakers includes a combination of clean utterance and utterance corrupted by one of six different noises at 10-20 dB SNR. 3,569 utterances are from the primary Sennheiser micro-phone and 3,569 utterances are from the secondary microphone. The test data is made using the same types of noise and microphones, and these can be classified into five test-conditions: clean, noisy, clean with channel distortion, noisy with channel distortion, and all of them which will be referred to as A, B, C, D and Average respectively.

3.1.2 Baseline systems

The baseline models in our experiments are DNNs, CNNs and TDNNs. DNNs take the 40-dimensional log Mel filterbank features as input and contain six hidden layers with sigmoid activation functions and one fully-connected output layer with a softmax activation. Each hidden layer has 1024 units. CNNs are composed of two convolution layers and max-pooling layers, and four affine layers with sigmoid activation function. Each affine layer contains 1024 units and is trained using 40-dimensional log Mel filterbank features with the first and the second time derivatives. Both DNNs and CNNs use the same context window of five and batch size of 256. The best TDNNs in Kaldi use in addtion iVector for speaker adapted systems. They contain five weight layers with different context specifications (subsampling). Furthermore, the TDNNs recipe applies data augmentation technique which does speed perturbation of the training data in order to emulate vocal tract length perturbations and speaking rate perturbation. All the ASR systems are built up with the Kaldi speech recognition toolkit  


. The acoustic models except TDNNs are implemented with PDNN (A Python Deep learning toolkit)  


, Theano  

[theano], Lasagne  [lasagne] and DenseNets source code  [DenseNet].

3.1.3 Hyperparameters for DenseNets and DenseNets-AD

The architecture of DenseNets in this paper is the same as the best model in the previous work  [DenseNetASR]: the first layer is convolution which is followed by 4 dense blocks. Each block contains 14 convolutional layers. Each dense block except the last one is followed by a transition which consists of convolution and average pooling. The depth is 65, the growth rate is 12 and the compression ratio is 0.5. All the convolutional layer use kernel size . The architecture of DenseNets-AD is mentioned in figure 3 except the shared layer is only the first convolutional layer and the following layers are training for senone (phoneme) recognition task. The gradient reversal coefficient is 0.5. The input features for both models is 40-dimensional log Mel filterbank features with the first and the second time derivatives.

Figure 5: The WERs of TDNN, DenseNets and DenseNets-AD on noisy test sets at different SNR ranges; DNNs(KNN) means the WER of DNNs on known-noise test set (KNN) and DNNs(UKN) means the WER of DNNs on unknown-noise test set (UKN).

4 Results and Discussion

Figure 5 shows the WERs of all the baseline models and DenseNets on noise corrupted RM test set at different ranges of SNR. Note that all the models are examined on the noise corrupted test set at the same range of SNR as the training. For example, the model are trained on the noise corrupted training set at the SNR range from 0 to 4 and tested on the noise corrupted test set at the same SNR range. The experimental results in figure 5 show that the WERs of DNNs and CNNs increase when the SNR decreases. However, TDNNs and DenseNets are relatively stable when reducing SNR. Overall, DenseNets outperform the baseline models on all the test sets.

TDNNs 6.93 8.59
0-12 DenseNets 6.30 7.64
DenseNets-AD 6.11 6.97
TDNNs 9.16 9.18
0-8 DenseNets-AD 8.02 8.20
DenseNets-AD 7.84 7.95
TDNNs 10.07 12.21
0-4 DenseNets 9.33 10.48
DenseNets-AD 8.64 9.68
Table 1: The WERs of TDNN, DenseNets and DenseNets-AD on KNN and UKN test sets. Where KNN means known-noise test set and UKN means uknow-noise test set.
System A B C D Average
TDNNs 3.47 7.44 10.14 21.91 13.57
DenseNets 3.57 7.29 7.12 16.56 11.53
DenseNets-AD 3.58 6.58 6.76 16.42 10.21
Table 2: WERs of TDNNs, DenseNets and DenseNets-AD on Aurora4 test set with A, B, C, D conditions. (A: clean and Sennheiser mic, B: Sennheiser mic and noise added, C: clean and 2nd mic, D: 2nd mic and noise added)

Table 1 and Table 2 show the comparison between TDNNs, DenseNets and DenseNets-Dal on the noise corrupted RM and Aurora4. As expected, the WERs of three models increase when the SNR decreases. However, DenseNets-Dal achieves best performance on both KNN and UKN test sets at all three SNR ranges. One of the reason is that TDNNs and DenseNets recognize noise as speech when the noise becomes severe. Table 3 shows one example of TDNNs and DenseNets failing to distinguish the noise and speech while DenseNets-AD was unaffected.

Table 3: The references and the ASR outputs from TDNNs, DenseNets and DenseNets-Dal of an utterance when adding dumpster truck noise at SNR=1

5 Conclusions

This paper investigates noise robustness of DenseNets and their extension with domain adversarial learning. Our experimental results demonstrate that DenseNets are more robust against noise than other types of neural networks. Furthermore, we show that applying domain adversarial learning improves the performance of DenseNets and model generalization.