1 Introduction
Speech enhancement [1, 2] is one of the corner stones of building robust automatic speech recognition (ASR) and communication systems. The problem is of especial importance nowadays where modern systems are often built using datadriven approaches based on large scale deep neural networks [3, 4]. In this scenario, the mismatch between clean data used to train the system and the noisy data encountered when deploying the system will often degrade the recognition accuracy in practice, and speech enhancement algorithms work as a preprocessing module that help to reduce the noise in speech signals before they are fed into these systems.
Speech enhancement is a classic problem that has attracted much research efforts for several decades in the community. By making assumptions on the nature of the underlying noise, statistical based approaches, including the spectral subtraction method [5], the minimum meansquare error logspectral method [6]
, etc., can often obtain analytic solutions for noise suppression. However, due to these unrealistic assumptions, most of these statisticalbased approaches often fail to build estimators that can well approximate the complex scenarios in realworld. As a result, additional noisy artifacts are usually introduced in the recovered signals
[7].Related Work. Due to the availability of highquality, largescale data and the rapidly growing computational resources, datadriven approaches using regressionbased deep neural networks have attracted much interests and demonstrated substantial performance improvements over traditional statisticalbased methods [8, 9, 10, 11, 12]. The general idea of using deep neural networks, or more specifically, the MLPs for noise reduction is not new [13, 14], and dates back at least to [15]. In these works, MLPs are applied as general nonlinear function approximators to approximate the mapping from noisy utterance to its clean version. A multivariate regressionbased objective is then optimized using numeric methods to fit model parameters. To capture the temporal nature of speech signals, previous works also introduced recurrent neural networks (RNNs) [16], which removes the needs for the explicit choice of context window in MLPs.
Contributions. We propose an endtoend model based on convolutional and recurrent neural networks for speech enhancement, which we term as EHNet. EHNet is purely datadriven and does not make any assumptions about the underlying noise. It consists of three components: the convolutional component exploits the local patterns in the spectrogram in both frequency and temporal domains, followed by a bidirectional recurrent component to model the dynamic correlations between consecutive frames. The final component is a fullyconnected layer that predicts the clean spectrograms. Compared with existing models such as MLPs and RNNs, due to the sparse nature of convolutional kernels, EHNet is much more dataefficient and computationally tractable. Furthermore, the bidirectional recurrent component allows EHNet to model the dynamic correlations between consecutive frames adaptively, and achieves better generalization on both seen and unseen noise. Empirically, we evaluate the effectiveness of EHNet and compare it with stateoftheart methods on synthetic dataset, showing that EHNet achieves the best performance among all the competitors on all the 5 metrics. Specifically, our model leads up to a 0.6 improvement of PESQ measure [17] on seen noise and 0.64 improvement on unseen noise.
2 Models and Learning
In this section we introduce the proposed model, EHNet, in detail and discuss its design principles as well as its inductive bias toward solving the enhancement problem. At a high level, we view the enhancement problem as a multivariate regression problem, where the nonlinear regression function is parametrized by the network in Fig. 1. Alternatively, the whole network can be interpreted as a complex filter for noise reduction in the frequency domain.
2.1 Problem Formulation
Formally, let be the noisy spectrogram and be its corresponding clean version, where is the dimension of each frame, i.e., number of frequency bins in the spectrogram, and is the length of the spectrogram. Given a training set of pairs of noisy and clean spectrograms, the problem of speech enhancement can be formalized as finding a mapping that maps a noisy utterance to a clean one, where is parametrized by . We then solve the following optimization problem to find the best model parameter :
(1) 
Under this setting, the key is to find a parametric family for denoising function such that it is both rich and dataefficient.
2.2 Convolutional Component
One choice for the denoising function is vanilla multilayer perceptrons, which has been extensively explored in the past few years [8, 9, 10, 11]. However, despite being universal function approximators [18], the fullyconnected network structure of MLPs usually cannot exploit the rich patterns existed in spectrograms. For example, as we can see in Fig. 1, signals in the spectrogram tend to be continuous along the time dimension, and they also have similar values in adjacent frequency bins. This key observation motivates us to apply convolutional neural networks to efficiently and cheaply extract local patterns from the input spectrogram.
Let be a convolutional kernel of size . We define a feature map to be the convolution of the spectrogram with kernel , followed by an elementwise nonlinear mapping : . Throughout the paper, we choose
to be the rectified linear function (ReLU), as it has been extensively verified to be effective in alleviating the notorious gradient vanishing problem in practice
[19]. Each such convolutional kernel will produce a 2D feature map, and we apply separate convolutional kernels to the input spectrogram, leading to a collection of 2D feature maps .It is worth pointing out that without padding, with unit stride, the size of each feature map
is . However, in order to recover the original speech signal, we need to ensure that the final prediction of the model have exactly the same length in the time dimension as the input spectrogram. To this end, we chooseto be an odd integer and apply a zeropadding of size
at both sides of before convolution is applied to . This guarantees that the feature map has time steps, matching that of .On the other hand, because of the local similarity of the spectrogram in adjacent frequency bins, when convolving with the kernel , we propose to use a stride of size along the frequency dimension. As we will see in Sec. 3, such design will greatly reduce the number of parameters and the computation needed in the following recurrent component, without losing any prediction accuracy.
Remark
. We conclude this section by emphasizing that the application of convolution kernels is particularly well suited for speech enhancement in the frequency domain: each kernel can be understood as a nonlinear filter that detects a specific kind of local patterns existed in the noisy spectrograms, and the width of the kernel has a natural interpretation as the length of the context window. On the computational side, since convolution layer can also be understood as a special case of fullyconnected layer with shared and sparse connection weights, the introduction of convolutions can thus greatly reduce the computation needed by a MLP with the same expressive power.
2.3 Bidirectional Recurrent Component
To automatically model the dynamic correlations between adjacent frames in the noisy spectrogram, we introduce bidirectional recurrent neural networks (BRNN) that have recurrent connections in both directions. The output of the convolutional component is a collection of
feature maps , . Before feeding those feature maps into a BRNN, we need to first transform them into a 2D feature map:In other words, we vertically concatenate along the feature dimension to form a stacked 2D feature map that contains all the information from the previous convolutional feature map.
In EHNet
, we use deep bidirectional long shortterm memory (LSTM)
[20] as our recurrent component due to its ability to model longterm interactions. At each time step , given input, each unidirectional LSTM cell computes a hidden representation
using its internal gates:(2)  
(3)  
(4)  
(5)  
(6) 
where
is the sigmoid function,
means elementwise product, and , and are the input gate, the output gate and the forget gate, respectively. The hidden representation of bidirectional LSTM is then a concatenation of both and : . To build deep bidirectional LSTMs, we can stack additional LSTM layers on top of each other.2.4 Fullyconnected Component and Optimization
Let
be the output of the bidirectional LSTM layer. To obtain the estimated clean spectrogram, we apply a linear regression with truncation to ensure the prediction lies in the nonnegative orthant. Formally, for each
, we have:(7) 
As discussed in Sec. 2.1, the last step is to define the meansquared error between the predicted spectrogram and the clean one , and optimize all the model parameters simultaneously. Specifically, we use AdaDelta [21] with scheduled learning rate [22] to ensure a stationary solution.
3 Experiments
To demonstrate the effectiveness of EHNet on speech enhancement, we created a synthetic dataset, which consists of 7,500, 1,500 and 1,500 recordings (clean/noisy speech) for training, validation and testing, respectively. Each recording is synthesized by convolving a randomly selected clean speech file with one of the 48 room impulse responses available and adding a randomly selected noise file. The clean speech corpus consists of 150 files containing ten utterances with male, female, and children voices. The noise dataset consists of 377 recordings representing 25 different types of noise. The room impulse responses were measured for distances between 1 and 3 meters. A secondary noise dataset of 32 files, with noises that do not appear in the training set, is denoted UnseenNoise and used to generate another test set of 1,500 files. The randomly generated speech and noise levels provide signaltonoise ratio between 0 and 30 dB. All files are sampled with 16 kHz sampling rate and stored with 24 bits resolution.
Seen Noise  Unseen Noise  
Model  SNR  LSD  MSE  WER  PESQ  SNR  LSD  MSE  WER  PESQ 
Noisy Speech  15.18  23.07  0.04399  15.40  2.26  14.78  23.76  0.04786  18.4  2.09 
MS  18.82  22.24  0.03985  14.77  2.40  19.73  22.82  0.04201  15.54  2.26 
DNNSymm  44.51  19.89  0.03436  55.38  2.20  40.47  21.07  0.03741  54.77  2.16 
DNNCausal  40.70  20.09  0.03485  54.92  2.17  38.70  21.38  0.03718  54.13  2.13 
RNNNg  41.08  17.49  0.03533  44.93  2.19  44.60  18.81  0.03665  52.05  2.06 
EHNet  49.79  15.17  0.03399  14.64  2.86  39.70  17.06  0.04712  16.71  2.73 
Clean Speech  57.31  1.01  0.00000  2.19  4.48  58.35  1.15  0.00000  1.83  4.48 
3.1 Dataset and Setup
As a preprocessing step, we first use STFT to extract the spectrogram from each utterance. The spectrogram has 256 frequency bins and frames frames. To throughly measure the enhancement quality, we use the following 5 metrics to evaluate different models: signaltonoise ratio (SNR, dB), logspectral distortion (LSD), meansquarederror on time domain (MSE), word error rate (WER, ), and the PESQ measure. To measure WER, we use the DNNbased speech recognizer, described in [23]. The system is kept fixed (not finetuned) during the experiment. We compare our EHNet with the following stateoftheart methods:

MS. Microsoft’s internal speech enhancement system used in production, which uses a combination of statisticalbased enhancement rules.

DNNSymm [9]. DNNSymm contains 3 hidden layers, all of which have 2048 hidden units. It uses a symmetric context window of size 11.

DNNCausal [11]. Similar to DNNSymm, DNNCausal contains 3 hidden layers of size 2048, but instead of symmetric context window, it uses causal context window of size 7.

RNNNg [16]. RNNNg is a recurrent neural network with 3 hidden layers of size 500. The input at each time step covers frames in a context window of length 3.
The architecture of EHNet is as follows: the convolutional component contains 256 kernels of size , with stride along the frequency and the time dimensions, respectively. We use two layers of bidirectional LSTMs following the convolution component, each of which has 1024 hidden units. To train EHNet
, we fix the number of epochs to be 200, with a scheduled learning rate
for every 60 epochs. For all the methods, we use the validation set to do early stopping and save the best model on validation set for evaluation on the test set. EHNet does not overfit, as both weight decay and dropout hurt the final performance. We also experiment with deeper EHNet with more layers of bidirectional LSTMs, but this does not significantly improve the final performance. We also observe in our experiments that reducing the stride of convolution in the frequency dimension does not significantly boost the performance of EHNet, but greatly incurs additional computations.3.2 Results and Analysis
Experimental results on the dataset is shown in Table 1. On the test dataset with seen noise, EHNet consistently outperforms all the competitors with a large margin. Specifically, EHNet is able to improve the perceptual quality (PESQ measure) by 0.6 without hurting the recognition accuracy. This is very surprising as we treat the underlying ASR system as a black box and do not finetune it during the experiment. As a comparison, while all the other methods can boost the SNR ratio, they often decrease the recognition accuracy. More surprisingly, EHNet also generalizes to unseen noise as well, and it even achieves a larger boost (0.64) on the perceptual quality while at the same time increases the recognition accuracy.
To have a better understanding on the experimental result, we do a case study by visualizing the denoised spectrograms from different models. As shown in Fig. 2, MS is the most conservative algorithm among all. By not removing much noise, it also keeps most of the real signals in the speech. On the other hand, although DNNbased approaches do a good job in removing the background noise, they also tend to remove the real speech signals from the spectrogram. This explains the reason why DNNbased approaches degrade the recognition accuracies in Table 1. RNN does a better job than DNN, but also fails to keep the real signals in low frequency bins. As a comparison, EHNet finds a good tradeoff between removing background noise and preserving the real speech signals: it is better than DNN/RNN in preserving high/lowfrequency bins and it is superior than MS in removing background noise. It is also easy to see that EHNet produces denoised spectrogram that is most close to the groundtruth clean spectrogram.
4 Conclusion
We propose EHNet, which combines both convolutional and recurrent neural networks for speech enhancement. The inductive bias of EHNet makes it wellsuited to solve speech enhancement: the convolution kernels can efficiently detect local patterns in spectrograms and the bidirectional recurrent connections can automatically model the dynamic correlations between adjacent frames. Due to the sparse nature of convolutions, EHNet requires less computations than both MLPs and RNNs. Experimental results show that EHNet consistently outperforms all the competitors on all 5 different metrics, and is also able to generalize to unseen noises, confirming the effectiveness of EHNet in speech enhancement.
References
 [1] Ivan Jelev Tashev, Sound capture and processing: practical approaches, John Wiley & Sons, 2009.
 [2] Philipos C Loizou, Speech enhancement: theory and practice, CRC press, 2013.
 [3] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdelrahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, 2012.
 [4] Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al., “Deep speech 2: Endtoend speech recognition in english and mandarin,” in International Conference on Machine Learning, 2016, pp. 173–182.
 [5] Steven Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Transactions on acoustics, speech, and signal processing, vol. 27, no. 2, pp. 113–120, 1979.
 [6] Yariv Ephraim and David Malah, “Speech enhancement using a minimum meansquare error logspectral amplitude estimator,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 33, no. 2, pp. 443–445, 1985.
 [7] Amir Hussain, Mohamed Chetouani, Stefano Squartini, Alessandro Bastari, and Francesco Piazza, “Nonlinear speech enhancement: An overview,” in Progress in nonlinear speech processing, pp. 217–248. Springer, 2007.

[8]
Xugang Lu, Yu Tsao, Shigeki Matsuda, and Chiori Hori,
“Speech enhancement based on deep denoising autoencoder.,”
in Interspeech, 2013, pp. 436–440.  [9] Yong Xu, Jun Du, LiRong Dai, and ChinHui Lee, “An experimental study on speech enhancement based on deep neural networks,” IEEE Signal processing letters, vol. 21, no. 1, pp. 65–68, 2014.
 [10] Yong Xu, Jun Du, LiRong Dai, and ChinHui Lee, “A regression approach to speech enhancement based on deep neural networks,” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 23, no. 1, pp. 7–19, 2015.
 [11] Seyedmahdad Mirsamadi and Ivan Tashev, “Causal speech enhancement combining datadriven learning and suppression rule estimation.,” in INTERSPEECH, 2016, pp. 2870–2874.
 [12] Kaizhi Qian, Yang Zhang, Shiyu Chang, Xuesong Yang, Dinei Florêncio, and Mark HasegawaJohnson, “Speech enhancement using bayesian wavenet,” Proc. Interspeech 2017, pp. 2013–2017, 2017.
 [13] Shinichi Tamura, “An analysis of a noise reduction neural network,” in Acoustics, Speech, and Signal Processing, 1989. ICASSP89., 1989 International Conference on. IEEE, 1989, pp. 2001–2004.
 [14] Fei Xie and Dirk Van Compernolle, “A family of mlp based nonlinear spectral estimators for noise reduction,” in Acoustics, Speech, and Signal Processing, 1994. ICASSP94., 1994 IEEE International Conference on. IEEE, 1994, vol. 2, pp. II–53.
 [15] Shin’ichi Tamura and Alex Waibel, “Noise reduction using connectionist models,” in Acoustics, Speech, and Signal Processing, 1988. ICASSP88., 1988 International Conference on. IEEE, 1988, pp. 553–556.
 [16] Andrew L Maas, Quoc V Le, Tyler M O’Neil, Oriol Vinyals, Patrick Nguyen, and Andrew Y Ng, “Recurrent neural networks for noise reduction in robust asr,” in Thirteenth Annual Conference of the International Speech Communication Association, 2012.
 [17] ITUT Recommendation, “Perceptual evaluation of speech quality (pesq): An objective method for endtoend speech quality assessment of narrowband telephone networks and speech codecs,” Rec. ITUT P. 862, 2001.
 [18] Kurt Hornik, Maxwell Stinchcombe, and Halbert White, “Multilayer feedforward networks are universal approximators,” Neural networks, vol. 2, no. 5, pp. 359–366, 1989.
 [19] Andrew L Maas, Awni Y Hannun, and Andrew Y Ng, “Rectifier nonlinearities improve neural network acoustic models,” in Proc. ICML, 2013, vol. 30.
 [20] Sepp Hochreiter and Jürgen Schmidhuber, “Long shortterm memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
 [21] Matthew D Zeiler, “Adadelta: an adaptive learning rate method,” arXiv preprint arXiv:1212.5701, 2012.
 [22] Li Deng, Geoffrey Hinton, and Brian Kingsbury, “New types of deep neural network learning for speech recognition and related applications: An overview,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013, pp. 8599–8603.
 [23] Frank Seide, Gang Li, and Dong Yu, “Conversational speech transcription using contextdependent deep neural networks,” in Twelfth Annual Conference of the International Speech Communication Association, 2011.
Comments
There are no comments yet.