In the last years, automatic speech recognition (ASR) performance has been significantly improved through the use of neural networks with deep structures [deep-neural-networks-for-acoustic-modeling-in-speech, conversational-speech-transcription-using-context-dependent-deep-neural-networks-2, context-dependent-pre-trained-deep-neural-networks-for-large-vocabulary-speech-recognition]. Various neural network architectures have been developed to improve ASR performance. They are variations of TDNNs [waibel1990phoneme], CNNs [abdel2012applying] RNNs [graves2013speech], and their combinations [convolutional-networks-for-images-sppech-and-time-series]. Among them, very deep CNNs demonstrate impressive performance [very-deep-convolutional-neural-networks-for-robust-speech-recognition, deep-convolutional-neural-networks-layer-wise-context-expansion-attention, xiong2018microsoft] especially in noisy conditions [very-deep-convolutional-neural-networks-for-robust-speech-recognition].
Recently in the computer vision research community, densely connected convolutional networks (DenseNets) have obtained significant improvements over the state-of-the-art networks on four highly competitive object recognition benchmark tasks [DenseNet]
. The idea is to introduce shorter connections between layers close to the input and those close to the output which alleviate the vanishing-gradient problem. Furthermore, DenseNets require fewer parameters than traditional CNNs with the same deep structure[DenseNet]. In [DenseNetASR], we showed that DenseNets can be used for acoustic modeling achieving impressive performance.
In this paper, we explore noise robustness of DenseNets and their extension with domain adversarial learning. This method was originally proposed by Ganin et al. [ganin2016domain] for unsupervised domain adaptation in natural language processing and was then applied to deep feed forward neural networks (DNNs) for noise robust speech recognition [Adversarial-Multi-task-Learning-of-Deep-Neural-Networks-for-Robust-Speech-Recognition, DomainAD]. However to the best of our knowledge, domain adversarial learning has never been examined with a complex network like DenseNets before. Our experimental results on noisy data demonstrate that DenseNets can effectively improve the noise robustness of the system outperforming other neural based models. Using domain adversarial learning can further improve their robustness against both, known and unknown noise conditions.
2.1 DenseNets Acoustic Models
In this subsection, we first describe DenseNets and review the method how to use DenseNets for acoustic modeling [DenseNetASR]. The key idea of DenseNets is the introduction of shorter connections between layers close to the input and those close to the output which alleviate the vanishing-gradient problem. For acoustic modeling, DenseNets take the unadapted features 40-dimensional log Mel filterbank as input and predict the context dependent HMM states (senones) [DenseNetASR].
Given an input and a CNN with layers, where each layer is equipped with a nonlinear transformationconvolution, DenseNets introduce direct connections from any layer to all subsequent layers. The output of layer is:
where refers to the concatenation of the feature maps yielded in all the previous layers. Fig. 1 illustrates the dense connectivity structure, in which each layer takes all preceding feature-maps as input. This structure is called dense block.
DenseNets consist of multiple dense blocks, connected in series and separated by transition layers. Each transition layer consists of a convolution layer and a average pooling layer. Fig. 2 illustrates how these dense blocks and transition layers are composed in DenseNets. Note that pooling is only performed outside of dense blocks.
Furthermore, DenseNets reduce the number of feature-maps by convolution layer in the transition layer to improve model compactness. For example, if a dense block has feature-maps, the transition layer generates output feature-maps, where is the compression factor and the range is . The growth rate of DenseNets is the number of channels in their convolution layers. By equation Hn, the layer within a dense block has input feature-maps, where is the number of input channels and (the model’s growth rate) is the number of channels for subsequent convolution layers. DenseNets have better performance when is a small integer, e.g. [DenseNet].
2.2 Domain Adversarial Learning
In this subsection, we introduce the extension of DenseNets with domain adversarial learning [ganin2016domain]. In this context, noisy conditions act as domain information.
The overall architecture of the domain adversarial learning of DenseNets is shown in Fig 3. It consists three sub-networks: the sub-network () is for senone classification, the sub-network () is for domain classification and the share-network () is the share part of the two tasks. The shared-network (
) can be seen as a feature extractor to convert an input vector to its latent representation. Each output sub-network acts as a classifier to calculate posterior probabilities of classes given the this latent representation[Adversarial-Multi-task-Learning-of-Deep-Neural-Networks-for-Robust-Speech-Recognition, DomainAD]. In the domain adversarial learning, the representation is learned adversarially to the domain classification and friendly to the senone classification, so that domain-dependent information to the senone classifier is removed from the representation.
denote the parameters of the share-network (x), sub-network (y) and sub-network (z), respectively. The cross-entropy loss functions for the senone classifier and domain classifier are defined as
The parameters are updated as
where is the gradient reversal coefficient which is a positive scalar parameter to adjust the strength of the regularization.
The first layer of the sub-network () is the gradient reversal layer (GRL) as proposed in [Adversarial-Multi-task-Learning-of-Deep-Neural-Networks-for-Robust-Speech-Recognition]. In the forward propagation phase, the GRL just passes the input to the output as follows:
where and represent input and output vectors of the layer, respectively. In the backward propagation phase, the GRL reverses the gradient by multipling it with as follows:
Hence the shared-network (x) is trained adversarially to the sub-network (y) for the domain classification. When the training is finished, the output of the entire network (x, y) for the senone classification is used for decoding.
Two experiments are conducted in this paper. The goal for the first experiment is to explore DenseNets’ robustness at different levels of noise. We compare the baseline models, which are DNNs, CNNs and TDNNs, and DenseNets on noise corrupted RM. The second experiment examines the effectiveness of domain adversarial learning, in which we compare the performance of DenseNets, DenseNets-AD and the best baseline model TDNNs on noise corrupted RM and Aurora4.
The noise corrupted RM is made by artificially adding different types of noise at different values of SNR 111Signal-to-Noise ratio (SNR) is defined as the ratio of the power of a signal to the power of noise () to RM [RM]
using the large-scale open-source acoustic simulator developed in[C2]. It contains 1,993 noisy conditions: 1,500 are used for training and 493 for testing. We created three noise corrupted data sets with different SNRs: Data-1 (SNR from 0 to 4), Data-2 (SNR from 0 to 8) and Data-3 (SNR from 0 to 12). Figure 4
shows the data distribution of Data-1, Data-2 and Data-3. For example, 19.9% utterances in Data-1 are adding noise (randomly chosen) at SNR=0, 20.3% of them are at SNR=1, 19.6% of them are at SNR=2, 21.2% of them are at SNR=3 and 19% of them are at SNR=4. Two noise corrupted test sets are used in this paper. In the "known-noise test set" (KNN), the noise is randomly picked from 1,500 training noise and added to the utterance at the same range of SNR used in the training set. On the contrary in the "unknown-noise test set" (UKN), the noise is selected from 493 testing noises and added to the utterance.
The Aurora 4 task, which is a medium vocabulary task speech recognition task, is based on the Wall Street Journal (WSJ0) dataset [AURORA4]. It contains 16 kHz speech data in the presence of six additive noises (car, crowd of people, restaurant, street, airport and train station) with linear convolutional channel distortions. The multi-condition training set with 7138 utterances from 83 speakers includes a combination of clean utterance and utterance corrupted by one of six different noises at 10-20 dB SNR. 3,569 utterances are from the primary Sennheiser micro-phone and 3,569 utterances are from the secondary microphone. The test data is made using the same types of noise and microphones, and these can be classified into five test-conditions: clean, noisy, clean with channel distortion, noisy with channel distortion, and all of them which will be referred to as A, B, C, D and Average respectively.
3.1.2 Baseline systems
The baseline models in our experiments are DNNs, CNNs and TDNNs. DNNs take the 40-dimensional log Mel filterbank features as input and contain six hidden layers with sigmoid activation functions and one fully-connected output layer with a softmax activation. Each hidden layer has 1024 units. CNNs are composed of two convolution layers and max-pooling layers, and four affine layers with sigmoid activation function. Each affine layer contains 1024 units and is trained using 40-dimensional log Mel filterbank features with the first and the second time derivatives. Both DNNs and CNNs use the same context window of five and batch size of 256. The best TDNNs in Kaldi use in addtion iVector for speaker adapted systems. They contain five weight layers with different context specifications (subsampling). Furthermore, the TDNNs recipe applies data augmentation technique which does speed perturbation of the training data in order to emulate vocal tract length perturbations and speaking rate perturbation. All the ASR systems are built up with the Kaldi speech recognition toolkit[kaldi]
. The acoustic models except TDNNs are implemented with PDNN (A Python Deep learning toolkit)[kaldi-pdnn]
, Theano[theano], Lasagne [lasagne] and DenseNets source code [DenseNet].
3.1.3 Hyperparameters for DenseNets and DenseNets-AD
The architecture of DenseNets in this paper is the same as the best model in the previous work [DenseNetASR]: the first layer is convolution which is followed by 4 dense blocks. Each block contains 14 convolutional layers. Each dense block except the last one is followed by a transition which consists of convolution and average pooling. The depth is 65, the growth rate is 12 and the compression ratio is 0.5. All the convolutional layer use kernel size . The architecture of DenseNets-AD is mentioned in figure 3 except the shared layer is only the first convolutional layer and the following layers are training for senone (phoneme) recognition task. The gradient reversal coefficient is 0.5. The input features for both models is 40-dimensional log Mel filterbank features with the first and the second time derivatives.
4 Results and Discussion
Figure 5 shows the WERs of all the baseline models and DenseNets on noise corrupted RM test set at different ranges of SNR. Note that all the models are examined on the noise corrupted test set at the same range of SNR as the training. For example, the model are trained on the noise corrupted training set at the SNR range from 0 to 4 and tested on the noise corrupted test set at the same SNR range. The experimental results in figure 5 show that the WERs of DNNs and CNNs increase when the SNR decreases. However, TDNNs and DenseNets are relatively stable when reducing SNR. Overall, DenseNets outperform the baseline models on all the test sets.
Table 1 and Table 2 show the comparison between TDNNs, DenseNets and DenseNets-Dal on the noise corrupted RM and Aurora4. As expected, the WERs of three models increase when the SNR decreases. However, DenseNets-Dal achieves best performance on both KNN and UKN test sets at all three SNR ranges. One of the reason is that TDNNs and DenseNets recognize noise as speech when the noise becomes severe. Table 3 shows one example of TDNNs and DenseNets failing to distinguish the noise and speech while DenseNets-AD was unaffected.
|Reference||LIST FULL LOCATION DATA FOR TRACK FFF088|
|TDNNs||LIST FULL LOCATION DATA FOR TRACK FFF088 TO EIGHT|
|DenseNets||LIST FULL LOCATION DATA FOR TRACK FFF088 IN THE EIGHT|
|DenseNets-AD||LIST FULL LOCATION DATA FOR TRACK FFF088|
This paper investigates noise robustness of DenseNets and their extension with domain adversarial learning. Our experimental results demonstrate that DenseNets are more robust against noise than other types of neural networks. Furthermore, we show that applying domain adversarial learning improves the performance of DenseNets and model generalization.