A fully recurrent feature extraction for single channel speech enhancement

06/09/2020 ∙ by Muhammed PV Shifas, et al. ∙ University of Crete 0

Convolutional neural network (CNN) modules are widely being used to build high-end speech enhancement neural models. However, the feature extraction power of vanilla CNN modules has been limited by the dimensionality constraint of the convolutional kernels integrated – thereby has failed to adequately model the noise context information at the feature extraction stage. To this end, adding recurrency factor into the feature extracting CNN layers, we introduce a robust context-aware feature extraction strategy for single-channel speech enhancement. As being robust in capturing the local statistics of noise attributes at the extracted features, the suggested model is highly effective on differentiating speech cues, even at very noisy conditions. When evaluated against enhancement models using vanilla CNN modules, in unseen noise conditions, the suggested model with recurrency in the feature extraction layers has produced a Segmental SNR (SSNR) gain of up to 1.5 dB, while the parameters to be optimized are reduced by 25



There are no comments yet.


page 2

page 4

Code Repositories


gruCNN-SE: A fully recurrent feature extraction

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

speech enhancement is a general terminology refers to the manipulation of noise artifacts in a speech recorded at an inferior acoustic condition. With the increased use of communication devices in outdoor noisy environments, the need for robust enhancement strategies are of paramount importance. By parametrically modeling the noise distribution, mainly with the first and second-order statistics, classical speech enhancement techniques based on conventional signal estimation theory have been universal in practice

[2, 4]. Though they were robust against a class of noises that has spectral distribution that can entirely be modeled by second-order statistics, the performance under more structurally distributed noises has not been satisfactory [5].

Having been proven efficient to model complex noise pattern, neural networks have attracted considerable attention for the enhancement task [24, 14]

. This primarily owes to the non-explicit assumption on noise statistics which enabled the model to learn the principle noise patterns that are pivotal to discriminate out the noise attributes. Although a simple feed-forward multilayer perceptron (MLP)

[1] network could model data non-linearity reasonably well, the extent of which has inherently been bound to the global patterns of the input [14]. Besides, the parameter complexity of MLP models increases linearly with the input and hidden space dimensions [12]. Capturing the local patterns in the noisy input with fixed size kernels, convolutional neural network (CNN) affirms robust enhancement in complex adversities [11, 17], while reducing the network parameters to be optimized. Later in [23, 20, 3], different recurrent neural modules [9] were integrated into the CNN based speech enhancement models as a supportive layer, to ensure the temporal flow of predicted samples. Though relatively complex, recently, waveform domain models build of dilated CNN modules are gaining popularity, showing promising quality enhancement [19, 15].

In the existing enhancement models, CNN layers – either casual or dilated – with specific kernel size are being used as the front-end feature extraction module. Although the performance of vanilla CNN neural module is supreme on high resolution data, a recent study in computer vision has revealed its vulnerability to adversarial attack as the input quality degrades

[6]. Unlike human vision, which is robust in detecting target patterns even at very low signaling conditions, computer vision with CNN would down perform as input degrades. Addressing this limitation of vanilla CNN, TS Hartmann, in [8], added recurrent connection into the CNN module, by which improved the performance of object classification model, which was then called CNN module.

Exploring the prospect of CNN neural module in speech domain, we introduce a new feature extraction strategy for speech enhancement models – where the features are extracted recurrently over time by capturing the temporal flow of speech. Through the inclusion of recurrency into the feature extraction layers, the proposed enhancement model (CNN_FC-SE) learns to extract features that are maximally relevant at every temporal context. In contrast to the CNN based enhancement models, the suggested model is robust of having refined features in the layers of the network, while at the same time reducing the parameter complexity considerably. When trained and evaluated on a multi-speaker data set, under different unseen noise conditions, the suggested CNN_FC-SE model has shown promising results over the traditional networks. The speech intelligibility is improved, in segmental SNR scale, up to 1.5 dB, across different SNR levels. Simultaneously, the parameter load is reduced by 25% with the conventional model.

The rest of this paper is organized as follows. In Section II, we discuss in detail about the suggested feature extraction strategy, and the CNN_FC-SE enhancement model using it. The model’s evaluation procedure is included in Section III. In Section IV, included the results and discussion on the observations. The paper is concluded in Section V.

Ii Recurrent Feature Extraction Technique

The problem of speech enhancement is framed on the manually extracted feature (spectral) domain of speech, for the larger computational complexity of temporal models. Since speech is highly regressive in nature, the sample’s growth is statistically based. Let be the slice of frequency bin values over time, from the noisy spectrum , such that = []; where

is the total number of frames considered. Then, the probability of

to happen can be expressed as


Though this modelling has not accounted the inter-bin dependency that might arise within a frame as varies from 1 to (the final bin), it is still a strong model of the speech auto-regression. Further, since the output at any time instant is independent of the future instance, the model is bound to be causal.

As such, preserving this statistical structure is essential when designing speech enhancement models that ensure the auto-regressive nature of the predictions. Moreover, performance of speech enhancement models very much depend on how accurately this dependency is being modeled.

In conventional speech enhancement neural models [25][16][3], the temporal recurrency of speech were modelled by fully connected recurrent neural modules (FC-RNN), like LSTM, GRU or SRU, employed towards the end of the model architecture, independent from the front-end feature extracting CNN layers. This two stage modelling does not account the recurrency factor at the feature extraction stage, leading to lack of qualitative features at the front-end layers. When it is done at the back-end FC-RNN module, attention could not have been given to the bin-wise recurrrency factor described in Eq. (1) – (3), due to the inherent fully connected structure of the module.

To this end, a new feature extraction strategy adopting the local recurrency of speech is suggested. In which, the feature extraction layers are carefully designed to model the local recursion over time – with kernels of specific size that keep track of the local statistics of previous frame patterns to be integrated into the current feature estimation. At a given frame index , the new feature extraction layer (CNN) takes the input – which is the noisy speech spectrum at the beginning layer, along with the feature status of the previous frame (), which is then being processed through the nonlinear transformations in Eq. (4) – (7) to get the feature representation of the current frame (). Whereby, the feature map encodes the information from the current frame statistics along with the past context.


where the operations and indicate convolution and element-wise matrix multiplication, respectively. The capitalized variables highlight the fact that they are matrices of dimension at every frame instant, where and are the dimension of frequency and channel axis, respectively. While training on this setting, the network will learn the optimal kernels (, , , , and ) that maximize the local bins recurrency, whereby ensures the best features at the network layers. It is worth to note that unlike fully connected RNNs, that use memory cells to store the long-term contextual information, CNN has not used any, which in turn reduces the parameter complexity.

Fig. 1: The recurrent extraction of CNN_FC-SE model. Where

was initialized to zero valued tensor at each layer.

By layering a set of CNN modules one after another, the CNN_FC-SE network has the final structure shown in Fig. 1. By looking into the already extracted features in the previous frame instance, the model would distill the features that are most temporally relevant at the current context. At the end of model architecture, it is a time distributed fully connected layer which regress the recurrently extracted features into the enhanced spectral bins. These predictions are being combined with the noisy phase information to reconstruct back the enhanced speech samples.

Iii Evaluation Procedure

As the primary focus is on evaluating the efficacy of suggested recurrent feature extraction strategy over the conventional CNN architecture, the comparing models should have had the same parameter settings. To this purpose, a model without any recurrent connection in the feature extracting CNN layers is considered (CNN_FC-SE). Since it does not incorporate any form of temporal recurrency at all in its modelling, the architecture is similar to Fig. 1, but without the recurrent connections. Secondly, to quantify the benefits of recurrency modelling precisely at the feature extraction stage, a model rather having the front-end CNN layers followed by the standard fully connected LSTM module [7] (CNN_LSTM-SE.) was implemented. Similar architectures have been reported for speech enhancement in [25][16] with minor variations.

All the models considered has six convolutional layers (recurrent / casual) frontally, followed by the final fully connected (recurrent / casual) layer. The convolutional kernels of each layer is set to [3

3] size, looking into the immediate past and future frame activities while extracting the current frame features. Although one could tune the numbers, it was found optimal to disentangle easily the performance gain by different models. Each layer of the models has a channel depth of 256 with Parametric ReLU (PReLU) activation. Further details about the individual layers are highlighted in TABLE

I, for an input tensor of shape .

Layer CNN_FC-SE CNN_LSTM-SE CNN_FC-SE Output shape
1 [] CNN [] CNN [] gruCNN [1, 161, 128, 256]
2 [] CNN [] CNN [] gruCNN [1, 161, 128, 256]
3 [] Maxpool [] Maxpool [] Maxpool [1, 81, 128, 256]
4 [] CNN [] CNN [] gruCNN [1, 81, 128, 256]
5 [] CNN [] CNN [] gruCNN [1, 81, 128, 256]
6 [] Maxpool [] Maxpool [] Maxpool [1, 41, 128, 256]
7 [] CNN [] CNN [] gruCNN [1, 41, 128, 256]
8 [] CNN [] CNN [] gruCNN [1, 41, 128, 256]
9 FC FC_LSTM FC [1, 161, 128, 1]
TABLE I: Layer-wise descriptions of different models

Data Set (Training and Testing) : The speech set is a selection of ten British English speakers – both male and female – from the Voice Bank speech corpus [22], each of which has around 400 clean utterances. Eight speaker’s data were used for training, and the remaining two (one male and one female) were reserved for the performance testing. The noisy mixtures were created manually. The noises are from the NOISEX data set [21], which contains 20 different types of common environmental noises. Fourteen of which were used for the training, and the remaining six were used as the unseen noises, under which the models are tested. For training set mixtures, each speech sample were masked by a random training set noise at a random SNR point from [0, 5, 10, 15, 20] dB. Similar process has been repeated for the test set, but with the unseen noises at unseen SNR points of [2.5, 12.5, 22.5] dB.

Before feeding into the models, the 16 kHz sampled signals were framed into 20ms frames with 10ms overlap. The frames are short time Fourier transfer (STFT) transformed with 320 STFT points. Finally, the log-power-spectra were used as the input / output features of the models [18].

Model Training: All the comparing models are trained in an end-to-end mode, where the losses are computed directly between the magnitude of the predicted () and the target () STFT components. For each noisy-clean training set pair , the model parameters are optimised by minimising the mean square error (MSE) objective function.


where K denotes the total number of output STFT bins, that is 161, and the variable T is the number of time frames recurrently generated in the training process; which has been set to T = 128. The T value for testing varies based on the input signal duration as the recurrency is being modeled over the temporal axis. The loss was minimized by the Adam optimizer with an exponentially decaying learning rate method with learning rate of 0.001, decay steps = 20,000 and decay rate = 0.99.

For the objective evaluation of processed samples, the perceptual evaluation of speech quality (PESQ) metric [10] measuring the quality, and the short-time objective intelligibility (STOI) [13] measuring the intelligibility are considered. The composite quality of the model’s predictions (COVL) has also been measured [13], which reports a compound count of the noise reduction and speech restoration. In addition, the SNR intelligibility gain through model processing is measured by the Segmental SNR (SSNR) score [13]. Subjectively, the quality of enhanced samples were measured by the mean opinion score (MOS). In total, 20 participants (non-native English speakers) had listened to and assigned the individual perceptual score based on the noise artifacts present, in a scale of 1-5 (0 – very annoying artifacts , 5 – no artifacts at all).

Fig. 2: Model enhancement under noises of different spectral distribution

Iv Results and Discussion

The mean objective scores on 220 test samples at each noise condition are displayed in TABLE II. Along with the processing types, the scores of unprocessed noisy speech have also been included to better understand the relative gain. Compared to the CNN_FC-SE architecture, which does not incorporate any form of recurrency described in Eq. (1) - (3), the suggested CNN_FC-SE model with recurrency modelled in the feature extraction layers, has distinctly outperformed on all the metrics. This gain is almost consistent across the noise conditions. With the inclusion of global recurrency, the performance of CNN_LSTM-SE has improved over CNN_FC-SE. This broadly conveys the benefits that can be achieved through temporal inclusive modeling.

Noise level Metric Noisy CNN_FC-SE CNN_LSTM-SE gruCNN_FC-SE
2.5 dB PESQ 1.20 1.41 1.51 1.57
STOI 0.68 0.71 0.72 0.74
COVL 1.58 1.96 2.15 2.22
SSNR - 3.63 2.39 3.20 3.94
12.5 dB PESQ 1.49 1.87 2.01 2.08
STOI 0.77 0.78 0.79 0.80
COVL 2.11 2.59 2.74 2.83
SSNR 3.24 7.61 7.85 8.96
22.5 dB PESQ 2.27 2.47 2.58 2.66
STOI 0.85 0.83 0.84 0.85
COVL 3.05 3.20 3.30 3.41
SSNR 12.26 11.21 11.14 12.83
TABLE II: Objective measures enumerating the performance

When compare the two recurrent models, over the CNN_LSTM-SE that only models the global recurrency patterns, the CNN_FC-SE model that account the bin-wise recurreny factor has shown better enhancement. Even at the higher SNR point of 22.5dB, where the noise attributes are mild, CNN_FC-SE model elicited noticeable enhancement, showing an SNR intelligibility gain of up to 1.5 dB over the other models. This evidently attributes to the new feature extraction strategy of the network.

Regarding the consistency of model’s predictions in various noise types, the model enhancement in two noise conditions are plotted in Fig. 2

. The first row (type–1) is a construction noise, and the second row (type–2) is a street noise. Since type–1 noise has the spectral energy being distributed equally throughout the lower range of the frequency band (0 - 3 kHz), including where the speech activities are negligible, It is straight forward for a neural network to get a correct estimate of the noise activities. Whereas, in the type–2 noise, the noise attributes are highly localized at the very low band (0 - 0.5 kHz) of the spectrum – marked in the box, where the speech activities are noticeable. Unless the model looks into the local statistics of the spectrum, it could easily be miss-classified as a speech event. This is happened in the case of CNN_FC-SE and CNN_LSTM-SE, whereas

CNN_FC-SE has been successful on disentangling out the noise activities through the exploration of the local patterns.

The subjective scoring of different models are displayed in TABLE III. In line with the objective scores, the suggested CNN_FC-SE model is ranked closer to the clean speech with a score of 3.16 on the 5 point scale, while there was not any statistically significant difference between the scores of the other two methods.

Pragmatically, performance gain of neural model could be argued by the additional parameters that is floated into the modeling. To see this, the parameter counts of different models are tabulated in TABLE IV. Though the CNN_FC-SE is of the lowest number among the models, performance of which is much weaker than the other two models. On the other hand, the suggested CNN_FC-SE produces far better enhancement with only 75% parameters of the CNN_LSTM-SE. This reduction in complexity is from the replacement of fully connected LSTM layer with the fixed kernels of

CNN to model the temporal flow. All of which indicate the potentiality to have it implemented on computationally constraint applications, like hearing aid. A Tensorflow implementation and enhanced samples from the model are provided at

111https://www.csd.uoc.gr/~shifaspv/IEEE_Letter-demo 222https://github.com/shifaspv/gruCNN-speech-enhancement-tensorflow.

Metric Noisy CNN_FC-SE CNN_LSTM-SE gruCNN_FC-SE Clean
MOS 2.010.97 2.750.92 2.770.89 3.160.92 4.860.42

Mean opinion score (MOS) with standard error

Parameters 11.13M 36.10M 27.22M
TABLE IV: The parameters count in Million (M)

V Conclusion

In this letter, we presented the concept of recurrent feature extraction that is beneficial for single-channel speech enhancement. In contrast to the traditional CNN based feature extraction approach, the suggested feature extraction module with recurrent connections in the convolution layers has been proven efficient, especially in conditions where the noise activities are of localized in nature. Subjective and objective evaluation have confirmed the benefits that the recurrent feature extraction technique has elicited. While at the same time, the parameter complexity of the modelling is reduced by 25%. On this ground, there is clear reason to believe that the same could be extended to the advanced enhancement models, like WaveNet and SEGAN.


  • [1] C. M. Bishop et al. (1995)

    Neural networks for pattern recognition

    Oxford university press. Cited by: §I.
  • [2] S. Boll (1979) Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on acoustics, speech, and signal processing 27 (2), pp. 113–120. Cited by: §I.
  • [3] X. Cui, Z. Chen, and F. Yin (2020) Speech enhancement based on simple recurrent unit network. Applied Acoustics 157, pp. 107019. Cited by: §I, §II.
  • [4] Y. Ephraim (1992) Statistical-model-based speech enhancement systems. Proceedings of the IEEE 80 (10), pp. 1526–1555. Cited by: §I.
  • [5] N. W. Evans, J. S. Mason, W. Liu, and B. Fauve (2006) An assessment on the fundamental limitations of spectral subtraction. In 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, Vol. 1, pp. I–I. Cited by: §I.
  • [6] R. Geirhos, D. H. Janssen, H. H. Schütt, J. Rauber, M. Bethge, and F. A. Wichmann (2017) Comparing deep neural networks against humans: object recognition when the signal gets weaker. arXiv preprint arXiv:1706.06969. Cited by: §I.
  • [7] F. A. Gers, J. Schmidhuber, and F. Cummins (1999) Learning to forget: continual prediction with lstm. Cited by: §III.
  • [8] T. S. Hartmann (2018) Seeing in the dark with recurrent convolutional neural networks. arXiv preprint arXiv:1811.08537. Cited by: §I.
  • [9] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §I.
  • [10] Y. Hu and P. C. Loizou (2007) Evaluation of objective quality measures for speech enhancement. IEEE Transactions on audio, speech, and language processing 16 (1), pp. 229–238. Cited by: §III.
  • [11] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §I.
  • [12] Y. LeCun, Y. Bengio, and G. Hinton (2015) Deep learning. nature 521 (7553), pp. 436. Cited by: §I.
  • [13] P. C. Loizou (2013) Speech enhancement: theory and practice. CRC press. Cited by: §III.
  • [14] X. Lu, Y. Tsao, S. Matsuda, and C. Hori (2013)

    Speech enhancement based on deep denoising autoencoder.

    In Interspeech, pp. 436–440. Cited by: §I.
  • [15] P. Muhammed Shifas, N. Adiga, V. Tsiaras, and Y. Stylianou (2019) A non-causal fftnet architecture for speech enhancement. Proc. Interspeech 2019, pp. 1826–1830. Cited by: §I.
  • [16] G. Naithani, T. Barker, G. Parascandolo, L. Bramsl, N. H. Pontoppidan, T. Virtanen, et al. (2017)

    Low latency sound source separation using convolutional recurrent neural networks

    In 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 71–75. Cited by: §II, §III.
  • [17] S. R. Park and J. Lee (2016) A fully convolutional neural network for speech enhancement. arXiv preprint arXiv:1609.07132. Cited by: §I.
  • [18] M. Portnoff (1980) Time-frequency representation of digital signals and systems based on short-time fourier analysis. IEEE Transactions on Acoustics, Speech, and Signal Processing 28 (1), pp. 55–69. Cited by: §III.
  • [19] D. Rethage, J. Pons, and X. Serra (2018) A wavenet for speech denoising. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5069–5073. Cited by: §I.
  • [20] K. Tan and D. Wang (2018) A convolutional recurrent neural network for real-time speech enhancement.. In Interspeech, pp. 3229–3233. Cited by: §I.
  • [21] A. Varga and H. J. Steeneken (1993)

    Assessment for automatic speech recognition: ii. noisex-92: a database and an experiment to study the effect of additive noise on speech recognition systems

    Speech communication 12 (3), pp. 247–251. Cited by: §III.
  • [22] C. Veaux, J. Yamagishi, and S. King (2013) The voice bank corpus: design, collection and data analysis of a large regional accent speech database. In 2013 International Conference Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), pp. 1–4. Cited by: §III.
  • [23] F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. Le Roux, J. R. Hershey, and B. Schuller (2015) Speech enhancement with lstm recurrent neural networks and its application to noise-robust asr. In International Conference on Latent Variable Analysis and Signal Separation, pp. 91–99. Cited by: §I.
  • [24] Y. Xu, J. Du, L. Dai, and C. Lee (2014) An experimental study on speech enhancement based on deep neural networks. IEEE Signal processing letters 21 (1), pp. 65–68. Cited by: §I.
  • [25] H. Zhao, S. Zarar, I. Tashev, and C. Lee (2018) Convolutional-recurrent neural networks for speech enhancement. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2401–2405. Cited by: §II, §III.