Spectral mapping using residual neural network and mimic loss
Spectral mapping uses a deep neural network (DNN) to map directly from noisy speech to clean speech. Our previous study found that the performance of spectral mapping improves greatly when using helpful cues from an acoustic model trained on clean speech. The mapper network learns to mimic the input favored by the spectral classifier and cleans the features accordingly. In this study, we explore two new innovations: we replace a DNN-based spectral mapper with a residual network that is more attuned to the goal of predicting clean speech. We also examine how integrating long term context in the mimic criterion (via wide-residual biLSTM networks) affects the performance of spectral mapping compared to DNNs. Our goal is to derive a model that can be used as a preprocessor for any recognition system; the features derived from our model are passed through the standard Kaldi ASR pipeline and achieve a WER of 9.3 only feature adaptation.READ FULL TEXT VIEW PDF
For the task of speech enhancement, local learning objectives are agnost...
This study proposes a multi-microphone complex spectral mapping approach...
Feature-mapping with deep neural networks is commonly used for single-ch...
In this paper, we present a reverberation removal approach for speaker
A deep neural network (DNN)-based model has been developed to predict
Recent advances in neural-network based generative modeling of speech ha...
The reliability of using fully convolutional networks (FCNs) has been
Spectral mapping using residual neural network and mimic loss
Applying deep learning to the task of Automatic Speech Recognition (ASR) has shown great progress recently in clean environments. However, these ASR systems still suffer from performance degradation in the presence of acoustic interference, such as additive noise and room reverberation.
One strategy to address this problem is to use a deep learning front-end for denoising the features, which are then fed to the ASR system. Some of these models attempt to estimate an ideal ratio mask (IRM) which is multiplied with the spectral features to remove noise from the speech signal. Others utilize spectral mapping in the signal domain [3, 4] or in the feature domain [5, 6] to translate directly from noisy to clean speech without additional constraints.
When these pre-processing models were introduced, they could be easily decoupled from the rest of the ASR pipeline. This was useful, because these models provided a general-purpose speech denoising module that could be applied to any noisy data. With time, impressive gains in performance were noticed with the addition of noise-robust features and joint training of spectral mapper and acoustic model . However, the front-end and back-end models in these approaches each depend on the presence of the other, i.e. one would not be able to re-use the mapper for another task or dataset without re-training it. Moreover, adding robust features increases the difficulty of feature creation and increases the number of parameters in the speech recognition pipeline.
Our previous work  introduced a form of knowledge transfer we dubbed mimic loss. Unlike student-teacher learning  or knowledge distillation [9, 10, 11] which transfer knowledge from a cumbersome model to a small model, the mimic approach transfers knowledge from a higher-level model (in this case, an acoustic model) to a lower-level model (a noisy to clean transformation). This can be seen in context in Figure 1. In this work, we improve our results using the mimic loss framework in two ways:
First, we propose a residual network  for spectral mapping. A residual network model is a natural fit for the task of speech denoising, because like the model, the task involves computing a residual, i.e. the noise contained in the features. We find that a residual network architecture by itself works well for the task of speech enhancement, surpassing the performance of other front-end-only systems.
Second, we use a more sophisticated architecture for senone classification, since this is the backbone of mimic loss. This provides a more informative error signal to the spectral mapper. To achieve this goal, we choose Wide Residual BiLSTM Networks (WRBN) 
as the architecture for our senone classifier, which combines the effective feature extraction of residual networks and the long-term context modeling of recurrent networks [14, 15].
During evaluation, a forward-pass through the residual spectral mapper generates denoised features which are then fed to an off-the-shelf Kaldi recipe . These features achieve a much lower WER on their own as compared to DNN spectral mappers trained without mimic loss [5, 6]. With the addition of the stronger feedback from a senone classifier, we achieve results beating the state-of-the-art system, which includes both additional noise-robust features, and joint training of the front-end denoiser with the acoustic model back-end.
, and recurrent neural network language models[17, 18]
. Another approach is to use a more sophisticated acoustic model, such as Convolutional Neural Networks (CNNs)[19, 20], Recurrent Neural Networks (RNNs) , and Residual Memory Networks (RMNs) 
that use residual connections with DNNs.
In terms of front-end models, DNNs are the most common approach , though RNNs have been used for speech enhancement as well, as in . There have also been a few studies that used CNNs for front-end speech denoising [24, 25, 26]. In the last of these, the authors used a single ”bypass” connection from the encoder to the decoder, but none of the models described here can be said to use residual connections. In addition, none of these authors evaluated the output of their model for the task of ASR.
Residual networks have seen success in computer vision[12, 27], and speech recognition [28, 13]. These networks add shortcut connections to a neural network that pass the output of some layers to higher layers. The shortcut connections allow the network to compute a modification of the input, called the residual, rather than having to re-compute the important parts of the input at every layer. This model seems a natural fit for the task of spectral mapping, which seeks to reproduce the input with the noise removed. We use an architecture similar to Wide ResNet with a small change: convolutional (channel-wise) dropout rather than conventional dropout. Architectural details are in Section 3.2.
Senone classification in speech recognition systems has improved due to recurrent neural networks. The horizontal connections in LSTMs work well in modeling the temporal nature of speech. On the other hand, convolutional neural networks are good for extracting useful patterns from spectral features. DNNs further complement the performance of these models by warping the speech manifold so that it resembles the senone feature space. The CNN-LSTM-DNN combination (CLDNN) along with HMMs have seen good results [29, 30]. Recently wide residual networks have been adapted for noise-robust speech recognition in the CHiME-4 setting and used with LSTMs and DNNs. This network, called WRBN is reported by  as a great acoustic model.
Mimic loss, proposed in , is a kind of knowledge transfer that uses an acoustic model trained on clean speech to teach the speech enhancement model how to produce more realistic speech — key to this idea is that the denoised speech should make a senone classifier behave like it is operating with clean speech. In contrast to joint training, the mimic loss does not tie the speech enhancement model to the particular acoustic model used; the enhancement module can be decoupled and used as a pure pre-processing unit with another recognizer. More details can be found in Section 4.
, we have shown that a DNN spectral mapper, which takes noisy spectrogram as input to predict clean filterbank features for ASR, yields good results on the CHiME-2 noisy and reverberant dataset. Specifically, we first divide the input time-domain signals into 25 ms windows with a 10 ms shift, and then apply short time Fourier transform (STFT) with a hamming window to compute log spectral magnitudes in each time frame. For a 16 kHz signal, each window contains 400 samples, and we use 512-point Fourier transform to compute the magnitudes, forming a 257-dimensional log magnitude vector.
Many speech recognition systems extend the input features using delta and double-delta. These features are a simple arithmetic function of the surrounding frames. CNNs naturally learn filters of a similar nature to the delta function, and can easily learn to approximate these features if necessary. We find that the model works better without these redundant features. We use 5 frames of stacked content (both past & future) for both DNN and ResNets. Hence, the input feature dimension decreases to 2827 () compared to 8481 when delta features are included ().
We use a baseline model for comparison, a DNN that is also a front-end-only system. Though this architecture is quite a bit simpler than the residual network architecture, similar architectures are commonly used in speech enhancement research [5, 7].
Unlike the proposed model, we add delta and double-delta features to the input for the baseline model, since a DNN cannot learn the delta function as easily. These features have been shown to dramatically improve ASR performance, and they improve spectral mapping performance as well.
Our baseline model is a 2-layer DNN with 2048 ReLU neurons in each layer, with an output layer of 257 neurons. We use batch norm and dropout to regularize the network. The batch norm uses the moving mean and variance at training time as well as test time. This is the same architecture that is used in.
A residual network adds shortcut connections to neural network architectures, typically CNNs, in a way that causes the network to learn a modification of the original input, rather than being forced to reconstruct the important information at each layer. This usually takes the form of blocks of several neural network layers with the output of the first layer added to the output of the last layer, so that the interior layers can compute the residual.
Adding these connections has several advantages: the training time is decreased, the networks can grow deeper, and the model tends to behave more like an ensemble of smaller models . In addition to all of those, however, we expect this model to be particularly good for the task of speech denoising, since the architecture matches the task at hand: reconstructing the input signal with the residual noisy signal removed.
In previous work using CNNs for speech enhancement, it has been noted that performance sometimes degrades with the addition of max pooling between convolution layers. We also observe this phenomenon, and instead of doing max pooling, we use an additional CNN layer with stride to learn a down-sampling function. This layer has the additional effect of increasing the number of filters, so the output can be directly added to the output of the last layer of the block as a residual connection.
Inspired by Wide ResNet 
, we use dropout instead of batch normalization, though we use convolutional (channel-wise) dropout rather than conventional dropout in order to better preserve the local structure within each filter. This results in a small gain in WER of around 0.2 percent. The authors also suggested that a shallower, but wider network may work better than a very deep network; we use a network that is only 14 layers deep, and the layers are a comparable width to Wide ResNet. Neither adding filters nor layers improved performance.
The full architecture of this model can be seen in Figure 2. The first part of the model uses four convolutional blocks: two blocks of 128 filters, and two of 256 filters. After the convolutional blocks, we append two fully-connected layers of 2048 neurons and an output layer. The whole network uses ReLU neurons.
We found that there was some sensitivity to the training procedure for the residual network, so we report our procedure here. We use the Adam optimizer  with an initial learning rate of and an exponential decay rate of 0.95 every steps. Following the training procedure for ResNets in the field of computer vision, we experimented with learning rate drops, going to one-tenth of the initial learning rate after convergence. We found that this resulted in sizable improvements in the fidelity loss (see Table 1 in Section 6), but no improvements in the final WER.
Training a model to faithfully reproduce the input via fidelity loss does not teach a model exactly what parts of the signal are important to focus on reproducing correctly. A lower learning rate allows the model to make more precise adjustments to its parameters, reproducing small details in the spectrogram more faithfully. However, the fact that these details don’t help for the task of speech recognition indicates that they are mostly irrelevant for speech comprehension.
In order to train with mimic loss, first the two component models must be pre-trained. We pre-train the spectral mapper to compute the function from a noisy spectral component for frequency at time slice , augmented with a five-frame window (designated ), to clean spectral slice . This is called the fidelity loss, written as follows:
While the residual network spectral mapper trained with only fidelity loss results in performance better than previous front-end-only systems, we add mimic loss for an additional gain in performance. This is done by training a senone classifier to learn a function from clean speech input to a set of
senones, and freezing the weights of the model. The spectral mapper is then trained to mimic the behavior of clean speech by backpropagating the L2-loss between clean and denoised input after being run through the acoustic model. The loss is computed at the output layer, before softmax is applied.
In early experiments, we found that using only mimic loss did not allow the model to converge, since it cares only about behavior and not the actual shape of the features. So we use a linear combination of fidelity loss and mimic loss:
where is a hyper-parameter controlling the ratio of fidelity and mimic losses. For our experiments, we use when the mimic model is a DNN, and when the mimic model is a WRBN. These values were chosen to ensure that the magnitude of the fidelity loss and mimic loss were roughly equal. Higher or lower values of do not usually produce better results. The entire process for training with mimic loss can be seen in Figure 3.
In order to provide additional feedback to the spectral mapper, we train another model as a teaching model. This second model is trained for the task of senone classification, with clean speech as input. This model will ideally learn what parts of a speech signal are important for recognition and be able to help a spectral mapper model to learn to reproduce these important speech structures faithfully.
The loss used to train the senone classifier is typical acoustic model criterion: the cross-entropy loss between the outputs of the classifier, , and the senone label, , where is the function computed by the classifier.
We experiment with two different senone classifier models that are separate from the one used in the off-the-shelf Kaldi recipe used for recognition. This separation exists both in terms of the architecture and particular parameter values that are used, which gives some evidence to our claim that our front-end model is not tied to any particular acoustic model. For both models, we target 1999 senone classes.
For our first model, we use a 6-layer 1024-node DNN with batch norm and leaky ReLU neurons (with a leak factor of 0.3), the same model used in . Our second model is a WRBN model that has recently been shown to perform well on the CHiME-4 challenge . This allows us to add a sequential component to the training of the residual network via the senone classifier.
The WRBN model combines a wide residual network to a bi-directional LSTM model. The wide residual network consists of 3 residual blocks of 6 convolutional layers each, with 80, 160, and 320 channels. The first layer in the second and third blocks use a stride of 2 2 to downsample, with a 1 1 convolutional layer bypass connection. Following these blocks is a linear layer.
The LSTM part of the model is a 2-layer network with 512 nodes per layer in each direction. After the first layer, the two directions are added together before being passed to the second layer, after which the two directions are concatenated. The last two layers in the network are linear. The entire network uses ELU activations , batch norm, and dropout.
Both the classifier networks are trained using the Adam optimizer  with learning rate for the WRBN and for the DNN. We use 257 dimensional mean-normalized spectrogram features as input to the networks. Delta and delta-delta coefficients are not used. The DNN senone classifier uses a window of 5 context frames in the past and future while the WRBN is trained on a per utterance basis with full backpropagation through time.
The WRBN model achieved a cross-entropy loss of 1.1 on the clean speech development set, which is almost half the cross-entropy loss of the DNN model, which was 2.1, so we expect it to be able to provide much more helpful feedback to the spectral mapper model.
|Enhancement Model||Fidelity loss|
|DNN spectral mapper||0.52|
|with DNN mimic||0.51|
|with WRBN mimic||0.51|
|Residual network mapper||0.47|
|with learn rate drop||0.44|
|with DNN mimic||0.48|
|with WRBN mimic||0.49|
We evaluate the quality of the denoised features produced with our residual network spectral mapper by training an off-the-shelf Kaldi recipe for Track 2 of the CHiME-2 challenge .
CHiME-2 is a medium-vocabulary task for word recognition under reverberant and noisy environments without speaker movements. In this task, three types of data are provided based on the Wall Street Journal (WSJ0) 5K vocabulary read speech corpus: clean, reverberant and reverberant+noisy. The clean utterances are extracted from the WSJ0 database. The reverberant utterances are created by convolving the clean speech with binaural room impulse responses (BRIR) corresponding to a frontal position in a family living room. Real-world non-stationary noise background recorded in the same room is mixed with the reverberant utterances to form the reverberant+noisy set. The noise excerpts are selected such that the signal-to-noise ratio (SNR) ranges among -6, -3, 0, 3, 6 and 9 dB without scaling. The multi-condition training, development and test sets of the reverberant+noisy set contain 7138, 2454 and 1980 utterances respectively, which are the same utterances in the clean set but with reverberation and noise at 6 different SNR conditions.
In order to determine the effectiveness of our front-end system, we train the denoised features with an off-the-shelf Kaldi recipe for CHiME-2. The DNN-HMM hybrid system is trained using the clean WSJ0-5k alignments generated using the method stated above. The DNN acoustic model has 7 hidden layers, with 2048 sigmoid neurons in each layer and a softmax output layer. Splicing context size for the filterbank features was fixed at 11 frames (5 frames of past and 5 frames of future context), with the minibatch-size being 1024. After that, we train the DNN with state-level minimum Bayes risk (sMBR) sequence training. We regenerate the lattices after the first iteration and train for 4 more iterations. We use the CMU pronunciation dictionary and the official 5k closed-vocabulary trigram language model in our experiments.
|DNN spectral mapper||16.0|
|with DNN mimic||14.4|
|with WRBN mimic||14.0|
|Residual network mapper||10.8|
|with DNN mimic||10.5|
|with WRBN mimic||9.3|
We report the best fidelity loss of all models on the development set in Table 1. Fidelity loss is a record of how well a model can exactly reproduce the clean speech signal, not taking into account whether the denoised signal is speech-like or not. In terms of fidelity loss, our residual networks gain about 10% over the baseline models. With the learning rate drop that is common in vision tasks, residual networks gain an additional 5%. However, this improvement in fidelity loss did not translate to any gain in WER. The last entries in the table show that the residual network performs slightly worse in terms of fidelity loss when mimic is added, which is to be expected given that the objective is split between fidelity loss and mimic loss.
In addition to our fidelity loss results, we present robust speech recognition results, generated by presenting our denoised spectral features to an off-the-shelf Kaldi recipe. The results are shown in Table 2. One point of note is that the features generated by the DNN spectral mapper without mimic loss only perform a little better than the original noisy features, likely due to introduced distortions .
It is also interesting to note that the WER gain for the residual network is much more significant than the fidelity loss alone would suggest, reaching around 30% relative improvement. This improvement holds whether the model is trained with or without mimic loss. Finally, we note that using a more sophisticated WRBN mimic leads to a large improvement in the performance of the residual network spectral mapper, but only a small gain for the DNN mapper. We speculate that the modeling power of the DNN may be limited, since it has only two layers.
Finally, we compare our best-performing model with other studies on the CHiME-2 test set that use only feature engineering and generation (e.g. more sophisticated language models not included). Even without mimic loss, our model performs much better than all other systems that use no additional noise-robust features or joint training of front-end speech enhancer and acoustic model. With the addition of mimic loss, our model also performs 10% better than the state-of-the-art, which uses both of these.
We have enhanced the performance of the mimic loss framework with the help of a ResNet-style architecture for spectral mapping and a more sophisticated senone classifier, with an almost 30% improvement over the DNN baseline and achieve the best acoustic-only adaptation result without using additional noise-robust features or joint training of a speech enhancement module and ASR system.
One route to achieving improved WER may be to do mimic loss at a higher level, such as the word level rather than the senone level. Since other work has found that joint training all the way up to the word level has helped performance, we expect that this would help our denoiser.
For some tasks, targeting an ideal ratio mask which is then multiplied with the original signal has achieved higher performance than spectral mapping. We plan to apply mimic loss to the technique of spectral masking; if successful, we could extend our work to the CHiME-3 and CHiME-4 challenges where mask generation during the beamforming stage has achieved the state-of-the-art.
Our code is publicly available at https://github.com/OSU-slatelab/residual_mimic_net.
This work was supported by the National Science Foundation under Grant IIS-1409431. We also thank the Ohio Supercomputer Center (OSC)  for providing us with computational resources. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Quadro P6000 GPU used for this research.
Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
“Integration of speech enhancement and recognition using long-short term memory recurrent neural network,”in Proc. Interspeech, 2015.