An Exploration of Mimic Architectures for Residual Network Based Spectral Mapping

09/25/2018 ∙ by Peter Plantinga, et al. ∙ 0

Spectral mapping uses a deep neural network (DNN) to map directly from noisy speech to clean speech. Our previous study found that the performance of spectral mapping improves greatly when using helpful cues from an acoustic model trained on clean speech. The mapper network learns to mimic the input favored by the spectral classifier and cleans the features accordingly. In this study, we explore two new innovations: we replace a DNN-based spectral mapper with a residual network that is more attuned to the goal of predicting clean speech. We also examine how integrating long term context in the mimic criterion (via wide-residual biLSTM networks) affects the performance of spectral mapping compared to DNNs. Our goal is to derive a model that can be used as a preprocessor for any recognition system; the features derived from our model are passed through the standard Kaldi ASR pipeline and achieve a WER of 9.3 only feature adaptation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

residual_mimic_net

Spectral mapping using residual neural network and mimic loss


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Applying deep learning to the task of Automatic Speech Recognition (ASR) has shown great progress recently in clean environments. However, these ASR systems still suffer from performance degradation in the presence of acoustic interference, such as additive noise and room reverberation.

One strategy to address this problem is to use a deep learning front-end for denoising the features, which are then fed to the ASR system. Some of these models attempt to estimate an ideal ratio mask (IRM) which is multiplied with the spectral features to remove noise from the speech signal

[2]. Others utilize spectral mapping in the signal domain [3, 4] or in the feature domain [5, 6] to translate directly from noisy to clean speech without additional constraints.

When these pre-processing models were introduced, they could be easily decoupled from the rest of the ASR pipeline. This was useful, because these models provided a general-purpose speech denoising module that could be applied to any noisy data. With time, impressive gains in performance were noticed with the addition of noise-robust features and joint training of spectral mapper and acoustic model [7]. However, the front-end and back-end models in these approaches each depend on the presence of the other, i.e. one would not be able to re-use the mapper for another task or dataset without re-training it. Moreover, adding robust features increases the difficulty of feature creation and increases the number of parameters in the speech recognition pipeline.

Our previous work [1] introduced a form of knowledge transfer we dubbed mimic loss. Unlike student-teacher learning [8] or knowledge distillation [9, 10, 11] which transfer knowledge from a cumbersome model to a small model, the mimic approach transfers knowledge from a higher-level model (in this case, an acoustic model) to a lower-level model (a noisy to clean transformation). This can be seen in context in Figure 1. In this work, we improve our results using the mimic loss framework in two ways:

First, we propose a residual network [12] for spectral mapping. A residual network model is a natural fit for the task of speech denoising, because like the model, the task involves computing a residual, i.e. the noise contained in the features. We find that a residual network architecture by itself works well for the task of speech enhancement, surpassing the performance of other front-end-only systems.

Second, we use a more sophisticated architecture for senone classification, since this is the backbone of mimic loss. This provides a more informative error signal to the spectral mapper. To achieve this goal, we choose Wide Residual BiLSTM Networks (WRBN) [13]

as the architecture for our senone classifier, which combines the effective feature extraction of residual networks 

[12] and the long-term context modeling of recurrent networks [14, 15].

During evaluation, a forward-pass through the residual spectral mapper generates denoised features which are then fed to an off-the-shelf Kaldi recipe [16]. These features achieve a much lower WER on their own as compared to DNN spectral mappers trained without mimic loss [5, 6]. With the addition of the stronger feedback from a senone classifier, we achieve results beating the state-of-the-art system, which includes both additional noise-robust features, and joint training of the front-end denoiser with the acoustic model back-end.

2 Prior Work

For the task of robust ASR, there has been some attention paid to strategies such as adding noise-robust features to acoustic models [7], using augmented training data [17]

, and recurrent neural network language models

[17, 18]

. Another approach is to use a more sophisticated acoustic model, such as Convolutional Neural Networks (CNNs)

[19, 20], Recurrent Neural Networks (RNNs) [21], and Residual Memory Networks (RMNs) [22]

that use residual connections with DNNs.

In terms of front-end models, DNNs are the most common approach [6], though RNNs have been used for speech enhancement as well, as in [23]. There have also been a few studies that used CNNs for front-end speech denoising [24, 25, 26]. In the last of these, the authors used a single ”bypass” connection from the encoder to the decoder, but none of the models described here can be said to use residual connections. In addition, none of these authors evaluated the output of their model for the task of ASR.

Residual networks have seen success in computer vision 

[12, 27], and speech recognition [28, 13]. These networks add shortcut connections to a neural network that pass the output of some layers to higher layers. The shortcut connections allow the network to compute a modification of the input, called the residual, rather than having to re-compute the important parts of the input at every layer. This model seems a natural fit for the task of spectral mapping, which seeks to reproduce the input with the noise removed. We use an architecture similar to Wide ResNet with a small change: convolutional (channel-wise) dropout rather than conventional dropout. Architectural details are in Section 3.2.

Senone classification in speech recognition systems has improved due to recurrent neural networks. The horizontal connections in LSTMs work well in modeling the temporal nature of speech. On the other hand, convolutional neural networks are good for extracting useful patterns from spectral features. DNNs further complement the performance of these models by warping the speech manifold so that it resembles the senone feature space. The CNN-LSTM-DNN combination (CLDNN) along with HMMs have seen good results  [29, 30]. Recently wide residual networks have been adapted for noise-robust speech recognition in the CHiME-4 setting and used with LSTMs and DNNs. This network, called WRBN is reported by [13] as a great acoustic model.

Mimic loss, proposed in [1], is a kind of knowledge transfer that uses an acoustic model trained on clean speech to teach the speech enhancement model how to produce more realistic speech — key to this idea is that the denoised speech should make a senone classifier behave like it is operating with clean speech. In contrast to joint training, the mimic loss does not tie the speech enhancement model to the particular acoustic model used; the enhancement module can be decoupled and used as a pure pre-processing unit with another recognizer. More details can be found in Section 4.

raw waveform

noisy spectrogram

cleaned spectrogram

posteriorgram

Step 1: Denoising

cleaned spectrogram

posteriorgram

recognized words

Step 2: ASR

STFT

Spectral
mapping

Mimic
loss

Acoustic
modeling

Mimic
loss

Acoustic
modeling

Decoding
Figure 1: System pipeline for spectral mapping with mimic loss. Bold text indicates training a model.

Noisy frames

128 filterblock

128 filterblock

256 filterblock

256 filterblock

Fullyconnected

Denoised frame

Size: Stride:

Size: Stride:

Size: Stride:

+
Figure 2: Our residual network architecture consists of four blocks and two fully-connected layers. Each block starts with a convolutional layer for down-sampling and increasing the number of filters. The output of this block is used twice, once as input to the two convolutional layers that compute the residual, and again as the original signal that is modified by adding the computed residual.

3 Spectral Mapping

Spectral mapping improves performance of the speech recognizer by learning a mapping from noisy spectral patterns to clean ones. In our previous work [5, 6]

, we have shown that a DNN spectral mapper, which takes noisy spectrogram as input to predict clean filterbank features for ASR, yields good results on the CHiME-2 noisy and reverberant dataset. Specifically, we first divide the input time-domain signals into 25 ms windows with a 10 ms shift, and then apply short time Fourier transform (STFT) with a hamming window to compute log spectral magnitudes in each time frame. For a 16 kHz signal, each window contains 400 samples, and we use 512-point Fourier transform to compute the magnitudes, forming a 257-dimensional log magnitude vector.

Many speech recognition systems extend the input features using delta and double-delta. These features are a simple arithmetic function of the surrounding frames. CNNs naturally learn filters of a similar nature to the delta function, and can easily learn to approximate these features if necessary. We find that the model works better without these redundant features. We use 5 frames of stacked content (both past & future) for both DNN and ResNets. Hence, the input feature dimension decreases to 2827 () compared to 8481 when delta features are included ().

3.1 Baseline Model

We use a baseline model for comparison, a DNN that is also a front-end-only system. Though this architecture is quite a bit simpler than the residual network architecture, similar architectures are commonly used in speech enhancement research [5, 7].

Unlike the proposed model, we add delta and double-delta features to the input for the baseline model, since a DNN cannot learn the delta function as easily. These features have been shown to dramatically improve ASR performance, and they improve spectral mapping performance as well.

Our baseline model is a 2-layer DNN with 2048 ReLU neurons in each layer, with an output layer of 257 neurons. We use batch norm and dropout to regularize the network. The batch norm uses the moving mean and variance at training time as well as test time. This is the same architecture that is used in 

[1].

3.2 ResNet Architecture

A residual network adds shortcut connections to neural network architectures, typically CNNs, in a way that causes the network to learn a modification of the original input, rather than being forced to reconstruct the important information at each layer. This usually takes the form of blocks of several neural network layers with the output of the first layer added to the output of the last layer, so that the interior layers can compute the residual.

Adding these connections has several advantages: the training time is decreased, the networks can grow deeper, and the model tends to behave more like an ensemble of smaller models [31]. In addition to all of those, however, we expect this model to be particularly good for the task of speech denoising, since the architecture matches the task at hand: reconstructing the input signal with the residual noisy signal removed.

Pretraining:

Cleanspeech

Senoneclassifier

Predictedsenones

Senonelabel

Noisyspeech

Spectralmapper

Denoisedspeech

Cleanspeech

Mimic loss training:

Noisyspeech

Spectralmapper

Denoisedspeech

Cleanspeech

Senoneclassifier

Softlabels

Senoneclassifier

Classifieroutput

Figure 3: When using mimic loss, the enhancement system is trained in two stages. In the pretraining stage, the senone classifier is trained on clean speech to predict senone labels with cross-entropy criterion () and the spectral mapper is pretrained to map from noisy speech to clean speech using MSE criterion (fidelity loss, ). In the mimic loss training stage, the pretrained spectral mapper is trained further using both fidelity loss and mimic loss (), the loss between the two sets of outputs from the classifier when fed parallel clean and denoised utterances. The gray models have frozen weights.

In previous work using CNNs for speech enhancement, it has been noted that performance sometimes degrades with the addition of max pooling between convolution layers

[25]. We also observe this phenomenon, and instead of doing max pooling, we use an additional CNN layer with stride to learn a down-sampling function. This layer has the additional effect of increasing the number of filters, so the output can be directly added to the output of the last layer of the block as a residual connection.

Inspired by Wide ResNet [27]

, we use dropout instead of batch normalization, though we use convolutional (channel-wise) dropout rather than conventional dropout in order to better preserve the local structure within each filter. This results in a small gain in WER of around 0.2 percent. The authors also suggested that a shallower, but wider network may work better than a very deep network; we use a network that is only 14 layers deep, and the layers are a comparable width to Wide ResNet. Neither adding filters nor layers improved performance.

The full architecture of this model can be seen in Figure 2. The first part of the model uses four convolutional blocks: two blocks of 128 filters, and two of 256 filters. After the convolutional blocks, we append two fully-connected layers of 2048 neurons and an output layer. The whole network uses ReLU neurons.

3.3 Training the Residual Network

We found that there was some sensitivity to the training procedure for the residual network, so we report our procedure here. We use the Adam optimizer [32] with an initial learning rate of and an exponential decay rate of 0.95 every steps. Following the training procedure for ResNets in the field of computer vision, we experimented with learning rate drops, going to one-tenth of the initial learning rate after convergence. We found that this resulted in sizable improvements in the fidelity loss (see Table 1 in Section 6), but no improvements in the final WER.

Training a model to faithfully reproduce the input via fidelity loss does not teach a model exactly what parts of the signal are important to focus on reproducing correctly. A lower learning rate allows the model to make more precise adjustments to its parameters, reproducing small details in the spectrogram more faithfully. However, the fact that these details don’t help for the task of speech recognition indicates that they are mostly irrelevant for speech comprehension.

4 Mimic Loss Training

In order to train with mimic loss, first the two component models must be pre-trained. We pre-train the spectral mapper to compute the function from a noisy spectral component for frequency at time slice , augmented with a five-frame window (designated ), to clean spectral slice . This is called the fidelity loss, written as follows:

(1)

While the residual network spectral mapper trained with only fidelity loss results in performance better than previous front-end-only systems, we add mimic loss for an additional gain in performance. This is done by training a senone classifier to learn a function from clean speech input to a set of

senones, and freezing the weights of the model. The spectral mapper is then trained to mimic the behavior of clean speech by backpropagating the L2-loss between clean and denoised input after being run through the acoustic model. The loss is computed at the output layer, before softmax is applied.

(2)

In early experiments, we found that using only mimic loss did not allow the model to converge, since it cares only about behavior and not the actual shape of the features. So we use a linear combination of fidelity loss and mimic loss:

(3)

where is a hyper-parameter controlling the ratio of fidelity and mimic losses. For our experiments, we use when the mimic model is a DNN, and when the mimic model is a WRBN. These values were chosen to ensure that the magnitude of the fidelity loss and mimic loss were roughly equal. Higher or lower values of do not usually produce better results. The entire process for training with mimic loss can be seen in Figure 3.

4.1 Senone Classification

In order to provide additional feedback to the spectral mapper, we train another model as a teaching model. This second model is trained for the task of senone classification, with clean speech as input. This model will ideally learn what parts of a speech signal are important for recognition and be able to help a spectral mapper model to learn to reproduce these important speech structures faithfully.

The loss used to train the senone classifier is typical acoustic model criterion: the cross-entropy loss between the outputs of the classifier, , and the senone label, , where is the function computed by the classifier.

4.2 Senone Classifier Models

We experiment with two different senone classifier models that are separate from the one used in the off-the-shelf Kaldi recipe used for recognition. This separation exists both in terms of the architecture and particular parameter values that are used, which gives some evidence to our claim that our front-end model is not tied to any particular acoustic model. For both models, we target 1999 senone classes.

For our first model, we use a 6-layer 1024-node DNN with batch norm and leaky ReLU neurons (with a leak factor of 0.3), the same model used in [1]. Our second model is a WRBN model that has recently been shown to perform well on the CHiME-4 challenge [13]. This allows us to add a sequential component to the training of the residual network via the senone classifier.

The WRBN model combines a wide residual network to a bi-directional LSTM model. The wide residual network consists of 3 residual blocks of 6 convolutional layers each, with 80, 160, and 320 channels. The first layer in the second and third blocks use a stride of 2 2 to downsample, with a 1 1 convolutional layer bypass connection. Following these blocks is a linear layer.

The LSTM part of the model is a 2-layer network with 512 nodes per layer in each direction. After the first layer, the two directions are added together before being passed to the second layer, after which the two directions are concatenated. The last two layers in the network are linear. The entire network uses ELU activations [33], batch norm, and dropout.

Both the classifier networks are trained using the Adam optimizer [32] with learning rate for the WRBN and for the DNN. We use 257 dimensional mean-normalized spectrogram features as input to the networks. Delta and delta-delta coefficients are not used. The DNN senone classifier uses a window of 5 context frames in the past and future while the WRBN is trained on a per utterance basis with full backpropagation through time.

The WRBN model achieved a cross-entropy loss of 1.1 on the clean speech development set, which is almost half the cross-entropy loss of the DNN model, which was 2.1, so we expect it to be able to provide much more helpful feedback to the spectral mapper model.

5 Experiments

Enhancement Model Fidelity loss
DNN spectral mapper 0.52
  with DNN mimic 0.51
  with WRBN mimic 0.51
Residual network mapper 0.47
  with learn rate drop 0.44
  with DNN mimic 0.48
  with WRBN mimic 0.49
Table 1: Fidelity loss on the development set for our baseline model and the residual network, both with and without mimic loss training.

We evaluate the quality of the denoised features produced with our residual network spectral mapper by training an off-the-shelf Kaldi recipe for Track 2 of the CHiME-2 challenge [34].

5.1 Task and data description

CHiME-2 is a medium-vocabulary task for word recognition under reverberant and noisy environments without speaker movements. In this task, three types of data are provided based on the Wall Street Journal (WSJ0) 5K vocabulary read speech corpus: clean, reverberant and reverberant+noisy. The clean utterances are extracted from the WSJ0 database. The reverberant utterances are created by convolving the clean speech with binaural room impulse responses (BRIR) corresponding to a frontal position in a family living room. Real-world non-stationary noise background recorded in the same room is mixed with the reverberant utterances to form the reverberant+noisy set. The noise excerpts are selected such that the signal-to-noise ratio (SNR) ranges among -6, -3, 0, 3, 6 and 9 dB without scaling. The multi-condition training, development and test sets of the reverberant+noisy set contain 7138, 2454 and 1980 utterances respectively, which are the same utterances in the clean set but with reverberation and noise at 6 different SNR conditions.

5.2 Description of the Kaldi recipe

In order to determine the effectiveness of our front-end system, we train the denoised features with an off-the-shelf Kaldi recipe for CHiME-2. The DNN-HMM hybrid system is trained using the clean WSJ0-5k alignments generated using the method stated above. The DNN acoustic model has 7 hidden layers, with 2048 sigmoid neurons in each layer and a softmax output layer. Splicing context size for the filterbank features was fixed at 11 frames (5 frames of past and 5 frames of future context), with the minibatch-size being 1024. After that, we train the DNN with state-level minimum Bayes risk (sMBR) sequence training. We regenerate the lattices after the first iteration and train for 4 more iterations. We use the CMU pronunciation dictionary and the official 5k closed-vocabulary trigram language model in our experiments.

6 Results

Enhancement Model WER
No enhancement 17.3
DNN spectral mapper 16.0
  with DNN mimic 14.4
  with WRBN mimic 14.0
Residual network mapper 10.8
  with DNN mimic 10.5
  with WRBN mimic 9.3
Table 2: Word error rates after generating denoised features and feeding them to off-the-shelf Kaldi recipe for training. The first line for each model indicates WER for models trained with fidelity loss only; the second includes the joint fidelity-mimic loss.

We report the best fidelity loss of all models on the development set in Table 1. Fidelity loss is a record of how well a model can exactly reproduce the clean speech signal, not taking into account whether the denoised signal is speech-like or not. In terms of fidelity loss, our residual networks gain about 10% over the baseline models. With the learning rate drop that is common in vision tasks, residual networks gain an additional 5%. However, this improvement in fidelity loss did not translate to any gain in WER. The last entries in the table show that the residual network performs slightly worse in terms of fidelity loss when mimic is added, which is to be expected given that the objective is split between fidelity loss and mimic loss.

In addition to our fidelity loss results, we present robust speech recognition results, generated by presenting our denoised spectral features to an off-the-shelf Kaldi recipe. The results are shown in Table 2. One point of note is that the features generated by the DNN spectral mapper without mimic loss only perform a little better than the original noisy features, likely due to introduced distortions [35].

It is also interesting to note that the WER gain for the residual network is much more significant than the fidelity loss alone would suggest, reaching around 30% relative improvement. This improvement holds whether the model is trained with or without mimic loss. Finally, we note that using a more sophisticated WRBN mimic leads to a large improvement in the performance of the residual network spectral mapper, but only a small gain for the DNN mapper. We speculate that the modeling power of the DNN may be limited, since it has only two layers.

Finally, we compare our best-performing model with other studies on the CHiME-2 test set that use only feature engineering and generation (e.g. more sophisticated language models not included). Even without mimic loss, our model performs much better than all other systems that use no additional noise-robust features or joint training of front-end speech enhancer and acoustic model. With the addition of mimic loss, our model also performs 10% better than the state-of-the-art, which uses both of these.

px Additional Joint ASR Study NR features training WER Chen et. al [21] - 16.0 Narayanan-Wang [2] 15.4 Bagchi et. al [1] - - 14.7 Weninger et.al [23] - 13.8 Wang et.al [7] 10.6 Residual network - - 10.8 ResNet + mimic loss - - 9.3

Table 3: Performance comparison with other studies on the CHiME2 test set. “Additional NR features” indicates that noise-robust features are added. “Joint ASR training” indicates that the final ASR system and enhancement model are jointly tuned. Our previous system [1] was a DNN trained with joint fidelity-mimic loss.

7 Conclusions

We have enhanced the performance of the mimic loss framework with the help of a ResNet-style architecture for spectral mapping and a more sophisticated senone classifier, with an almost 30% improvement over the DNN baseline and achieve the best acoustic-only adaptation result without using additional noise-robust features or joint training of a speech enhancement module and ASR system.

One route to achieving improved WER may be to do mimic loss at a higher level, such as the word level rather than the senone level. Since other work has found that joint training all the way up to the word level has helped performance, we expect that this would help our denoiser.

For some tasks, targeting an ideal ratio mask which is then multiplied with the original signal has achieved higher performance than spectral mapping. We plan to apply mimic loss to the technique of spectral masking; if successful, we could extend our work to the CHiME-3 and CHiME-4 challenges where mask generation during the beamforming stage has achieved the state-of-the-art.

Our code is publicly available at https://github.com/OSU-slatelab/residual_mimic_net.

8 Acknowledgements

This work was supported by the National Science Foundation under Grant IIS-1409431. We also thank the Ohio Supercomputer Center (OSC) [36] for providing us with computational resources. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Quadro P6000 GPU used for this research.

References

  • [1] Deblin Bagchi, Peter Plantinga, Adam Stiff, and Eric Fosler-Lussier, “Spectral feature mapping with mimic loss for robust speech recognition,” in Audio, Speech, and Signal Processing (ICASSP), International Conference on, 2018.
  • [2] Arun Narayanan and DeLiang Wang, “Improving robustness of deep neural network acoustic models via speech separation and joint adaptive training,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 1, pp. 92–101, 2015.
  • [3] Kun Han, Yuxuan Wang, and DeLiang Wang, “Learning spectral mapping for speech dereverberation,” in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014, pp. 4628–4632.
  • [4] Kun Han, Yuxuan Wang, DeLiang Wang, William S Woods, Ivo Merks, and Tao Zhang, “Learning spectral mapping for speech dereverberation and denoising,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 23, no. 6, pp. 982–992, 2015.
  • [5] Kun Han, Yanzhang He, Deblin Bagchi, Eric Fosler-Lussier, and DeLiang Wang, “Deep neural network based spectral feature mapping for robust speech recognition,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
  • [6] Deblin Bagchi, Michael I Mandel, Zhongqiu Wang, Yanzhang He, Andrew Plummer, and Eric Fosler-Lussier, “Combining spectral feature mapping and multi-channel model-based source separation for noise-robust automatic speech recognition,” in Automatic Speech Recognition and Understanding (ASRU), 2015 IEEE Workshop on. IEEE, 2015, pp. 496–503.
  • [7] Zhong-Qiu Wang and DeLiang Wang, “A joint training framework for robust automatic speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 4, pp. 796–806, 2016.
  • [8] Jimmy Ba and Rich Caruana, “Do deep nets really need to be deep?,” in Advances in neural information processing systems, 2014, pp. 2654–2662.
  • [9] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
  • [10] David Lopez-Paz, Léon Bottou, Bernhard Schölkopf, and Vladimir Vapnik, “Unifying distillation and privileged information,” arXiv preprint arXiv:1511.03643, 2015.
  • [11] Jinyu Li, Rui Zhao, Jui-Ting Huang, and Yifan Gong, “Learning small-size DNN with output-distribution-based criteria,” in Fifteenth annual conference of the international speech communication association, 2014.
  • [12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , 2016, pp. 770–778.
  • [13] Lukas Drude Jahn Heymann and Reinhold Haeb-Umbach, “Wide residual BLSTM network with discriminative speaker adaptation for robust speech recognition,” in Proceedings of the 4th International Workshop on Speech Processing in Everyday Environments (CHiME’16), 2016, pp. 12–17.
  • [14] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton, “Speech recognition with deep recurrent neural networks,” in Acoustics, speech and signal processing (icassp), 2013 ieee international conference on. IEEE, 2013, pp. 6645–6649.
  • [15] Alex Graves, Navdeep Jaitly, and Abdel-rahman Mohamed, “Hybrid speech recognition with deep bidirectional LSTM,” in Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on. IEEE, 2013, pp. 273–278.
  • [16] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, Jan Silovsky, Georg Stemmer, and Karel Vesely, “The kaldi speech recognition toolkit,” in IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. Dec. 2011, IEEE Signal Processing Society, IEEE Catalog No.: CFP11SRW-USB.
  • [17] Jun Du, Yan-Hui Tu, Lei Sun, Feng Ma, Hai-Kun Wang, Jia Pan, Cong Liu, Jing-Dong Chen, and Chin-Hui Lee, “The USTC-iFlytek system for CHiME-4 challenge,” Proc. CHiME, pp. 36–38, 2016.
  • [18] Takuya Yoshioka, Nobutaka Ito, Marc Delcroix, Atsunori Ogawa, Keisuke Kinoshita, Masakiyo Fujimoto, Chengzhu Yu, Wojciech J Fabian, Miquel Espi, Takuya Higuchi, et al., “The NTT CHiME-3 system: Advances in speech enhancement and recognition for mobile multi-microphone devices,” in Automatic Speech Recognition and Understanding (ASRU), 2015 IEEE Workshop on. IEEE, 2015, pp. 436–443.
  • [19] Yanmin Qian and Philip C Woodland, “Very deep convolutional neural networks for robust speech recognition,” in Spoken Language Technology Workshop (SLT), 2016 IEEE. IEEE, 2016, pp. 481–488.
  • [20] Yu Zhang, William Chan, and Navdeep Jaitly, “Very deep convolutional networks for end-to-end speech recognition,” in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2017, pp. 4845–4849.
  • [21] Z Chen, S Watanabe, H Erdogan, and JR Hershey,

    “Integration of speech enhancement and recognition using long-short term memory recurrent neural network,”

    in Proc. Interspeech, 2015.
  • [22] Murali Karthick Baskar, Martin Karafiát, Lukáš Burget, Karel Veselỳ, František Grézl, and Jan Černockỳ, “Residual memory networks: Feed-forward approach to learn long-term temporal dependencies,” in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2017, pp. 4810–4814.
  • [23] Felix Weninger, Hakan Erdogan, Shinji Watanabe, Emmanuel Vincent, Jonathan Le Roux, John R Hershey, and Björn Schuller, “Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR,” in International Conference on Latent Variable Analysis and Signal Separation. Springer, 2015, pp. 91–99.
  • [24] Like Hui, Meng Cai, Cong Guo, Liang He, Wei-Qiang Zhang, and Jia Liu, “Convolutional maxout neural networks for speech separation,” in Signal Processing and Information Technology (ISSPIT), 2015 IEEE International Symposium on. IEEE, 2015, pp. 24–27.
  • [25] Szu-Wei Fu, Yu Tsao, and Xugang Lu, “SNR-aware convolutional neural network modeling for speech enhancement.,” in Proc. Interspeech, 2016, pp. 3768–3772.
  • [26] Se Rim Park and Jin Won Lee, “A fully convolutional neural network for speech enhancement,” Proc. Interspeech 2017, pp. 1993–1997, 2017.
  • [27] Sergey Zagoruyko and Nikos Komodakis, “Wide residual networks,” in Proceedings of the British Machine Vision Conference (BMVC), Edwin R. Hancock Richard C. Wilson and William A. P. Smith, Eds. September 2016, pp. 87.1–87.12, BMVA Press.
  • [28] Wayne Xiong, Jasha Droppo, Xuedong Huang, Frank Seide, Mike Seltzer, Andreas Stolcke, Dong Yu, and Geoffrey Zweig, “The Microsoft 2016 conversational speech recognition system,” in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2017, pp. 5255–5259.
  • [29] Tara N Sainath, Oriol Vinyals, Andrew Senior, and Haşim Sak, “Convolutional, long short-term memory, fully connected deep neural networks,” in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015, pp. 4580–4584.
  • [30] Tara N Sainath, Ron J Weiss, Andrew Senior, Kevin W Wilson, and Oriol Vinyals, “Learning the speech front-end with raw waveform CLDNNs,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
  • [31] Andreas Veit, Michael J Wilber, and Serge Belongie, “Residual networks behave like ensembles of relatively shallow networks,” in Advances in Neural Information Processing Systems, 2016, pp. 550–558.
  • [32] Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [33] Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter, “Fast and accurate deep network learning by exponential linear units (ELUs),” arXiv preprint arXiv:1511.07289, 2015.
  • [34] Emmanuel Vincent, Jon Barker, Shinji Watanabe, Jonathan Le Roux, Francesco Nesta, and Marco Matassoni, “The second ‘CHiME’ speech separation and recognition challenge: Datasets, tasks and baselines,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013, pp. 126–130.
  • [35] Arun Narayanan and DeLiang Wang, “Investigation of speech separation as a front-end for noise robust speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 4, pp. 826–835, 2014.
  • [36] Ohio Supercomputer Center, “Ohio supercomputer center,” http://osc.edu/ark:/19495/f5s1ph73, 1987.