gruCNN-SE: A fully recurrent feature extraction
Convolutional neural network (CNN) modules are widely being used to build high-end speech enhancement neural models. However, the feature extraction power of vanilla CNN modules has been limited by the dimensionality constraint of the convolutional kernels integrated – thereby has failed to adequately model the noise context information at the feature extraction stage. To this end, adding recurrency factor into the feature extracting CNN layers, we introduce a robust context-aware feature extraction strategy for single-channel speech enhancement. As being robust in capturing the local statistics of noise attributes at the extracted features, the suggested model is highly effective on differentiating speech cues, even at very noisy conditions. When evaluated against enhancement models using vanilla CNN modules, in unseen noise conditions, the suggested model with recurrency in the feature extraction layers has produced a Segmental SNR (SSNR) gain of up to 1.5 dB, while the parameters to be optimized are reduced by 25READ FULL TEXT VIEW PDF
This paper investigates different trade-offs between the number of model...
For an autonomous corridor following task where the environment is
Varieties of noises are major problem in recognition of Electromyography...
Attempts to develop speech enhancement algorithms with improved speech
With the growing popularity of cloud storage, removing duplicated data a...
Due to the simple design pipeline, end-to-end (E2E) neural models for sp...
The capability of the human to pay attention to both coarse and fine-gra...
gruCNN-SE: A fully recurrent feature extraction
speech enhancement is a general terminology refers to the manipulation of noise artifacts in a speech recorded at an inferior acoustic condition. With the increased use of communication devices in outdoor noisy environments, the need for robust enhancement strategies are of paramount importance. By parametrically modeling the noise distribution, mainly with the first and second-order statistics, classical speech enhancement techniques based on conventional signal estimation theory have been universal in practice[2, 4]. Though they were robust against a class of noises that has spectral distribution that can entirely be modeled by second-order statistics, the performance under more structurally distributed noises has not been satisfactory .
. This primarily owes to the non-explicit assumption on noise statistics which enabled the model to learn the principle noise patterns that are pivotal to discriminate out the noise attributes. Although a simple feed-forward multilayer perceptron (MLP) network could model data non-linearity reasonably well, the extent of which has inherently been bound to the global patterns of the input . Besides, the parameter complexity of MLP models increases linearly with the input and hidden space dimensions . Capturing the local patterns in the noisy input with fixed size kernels, convolutional neural network (CNN) affirms robust enhancement in complex adversities [11, 17], while reducing the network parameters to be optimized. Later in [23, 20, 3], different recurrent neural modules  were integrated into the CNN based speech enhancement models as a supportive layer, to ensure the temporal flow of predicted samples. Though relatively complex, recently, waveform domain models build of dilated CNN modules are gaining popularity, showing promising quality enhancement [19, 15].
In the existing enhancement models, CNN layers – either casual or dilated – with specific kernel size are being used as the front-end feature extraction module. Although the performance of vanilla CNN neural module is supreme on high resolution data, a recent study in computer vision has revealed its vulnerability to adversarial attack as the input quality degrades. Unlike human vision, which is robust in detecting target patterns even at very low signaling conditions, computer vision with CNN would down perform as input degrades. Addressing this limitation of vanilla CNN, TS Hartmann, in , added recurrent connection into the CNN module, by which improved the performance of object classification model, which was then called CNN module.
Exploring the prospect of CNN neural module in speech domain, we introduce a new feature extraction strategy for speech enhancement models – where the features are extracted recurrently over time by capturing the temporal flow of speech. Through the inclusion of recurrency into the feature extraction layers, the proposed enhancement model (CNN_FC-SE) learns to extract features that are maximally relevant at every temporal context. In contrast to the CNN based enhancement models, the suggested model is robust of having refined features in the layers of the network, while at the same time reducing the parameter complexity considerably. When trained and evaluated on a multi-speaker data set, under different unseen noise conditions, the suggested CNN_FC-SE model has shown promising results over the traditional networks. The speech intelligibility is improved, in segmental SNR scale, up to 1.5 dB, across different SNR levels. Simultaneously, the parameter load is reduced by 25% with the conventional model.
The rest of this paper is organized as follows. In Section II, we discuss in detail about the suggested feature extraction strategy, and the CNN_FC-SE enhancement model using it. The model’s evaluation procedure is included in Section III. In Section IV, included the results and discussion on the observations. The paper is concluded in Section V.
The problem of speech enhancement is framed on the manually extracted feature (spectral) domain of speech, for the larger computational complexity of temporal models. Since speech is highly regressive in nature, the sample’s growth is statistically based. Let be the slice of frequency bin values over time, from the noisy spectrum , such that = ; where
is the total number of frames considered. Then, the probability ofto happen can be expressed as
Though this modelling has not accounted the inter-bin dependency that might arise within a frame as varies from 1 to (the final bin), it is still a strong model of the speech auto-regression. Further, since the output at any time instant is independent of the future instance, the model is bound to be causal.
As such, preserving this statistical structure is essential when designing speech enhancement models that ensure the auto-regressive nature of the predictions. Moreover, performance of speech enhancement models very much depend on how accurately this dependency is being modeled.
In conventional speech enhancement neural models , the temporal recurrency of speech were modelled by fully connected recurrent neural modules (FC-RNN), like LSTM, GRU or SRU, employed towards the end of the model architecture, independent from the front-end feature extracting CNN layers. This two stage modelling does not account the recurrency factor at the feature extraction stage, leading to lack of qualitative features at the front-end layers. When it is done at the back-end FC-RNN module, attention could not have been given to the bin-wise recurrrency factor described in Eq. (1) – (3), due to the inherent fully connected structure of the module.
To this end, a new feature extraction strategy adopting the local recurrency of speech is suggested. In which, the feature extraction layers are carefully designed to model the local recursion over time – with kernels of specific size that keep track of the local statistics of previous frame patterns to be integrated into the current feature estimation. At a given frame index , the new feature extraction layer (CNN) takes the input – which is the noisy speech spectrum at the beginning layer, along with the feature status of the previous frame (), which is then being processed through the nonlinear transformations in Eq. (4) – (7) to get the feature representation of the current frame (). Whereby, the feature map encodes the information from the current frame statistics along with the past context.
where the operations and indicate convolution and element-wise matrix multiplication, respectively. The capitalized variables highlight the fact that they are matrices of dimension at every frame instant, where and are the dimension of frequency and channel axis, respectively. While training on this setting, the network will learn the optimal kernels (, , , , and ) that maximize the local bins recurrency, whereby ensures the best features at the network layers. It is worth to note that unlike fully connected RNNs, that use memory cells to store the long-term contextual information, CNN has not used any, which in turn reduces the parameter complexity.
By layering a set of CNN modules one after another, the CNN_FC-SE network has the final structure shown in Fig. 1. By looking into the already extracted features in the previous frame instance, the model would distill the features that are most temporally relevant at the current context. At the end of model architecture, it is a time distributed fully connected layer which regress the recurrently extracted features into the enhanced spectral bins. These predictions are being combined with the noisy phase information to reconstruct back the enhanced speech samples.
As the primary focus is on evaluating the efficacy of suggested recurrent feature extraction strategy over the conventional CNN architecture, the comparing models should have had the same parameter settings. To this purpose, a model without any recurrent connection in the feature extracting CNN layers is considered (CNN_FC-SE). Since it does not incorporate any form of temporal recurrency at all in its modelling, the architecture is similar to Fig. 1, but without the recurrent connections. Secondly, to quantify the benefits of recurrency modelling precisely at the feature extraction stage, a model rather having the front-end CNN layers followed by the standard fully connected LSTM module  (CNN_LSTM-SE.) was implemented. Similar architectures have been reported for speech enhancement in  with minor variations.
All the models considered has six convolutional layers (recurrent / casual) frontally, followed by the final fully connected (recurrent / casual) layer. The convolutional kernels of each layer is set to [3
3] size, looking into the immediate past and future frame activities while extracting the current frame features. Although one could tune the numbers, it was found optimal to disentangle easily the performance gain by different models. Each layer of the models has a channel depth of 256 with Parametric ReLU (PReLU) activation. Further details about the individual layers are highlighted in TABLEI, for an input tensor of shape .
|1|| CNN|| CNN|| gruCNN||[1, 161, 128, 256]|
|2|| CNN|| CNN|| gruCNN||[1, 161, 128, 256]|
|3|| Maxpool|| Maxpool|| Maxpool||[1, 81, 128, 256]|
|4|| CNN|| CNN|| gruCNN||[1, 81, 128, 256]|
|5|| CNN|| CNN|| gruCNN||[1, 81, 128, 256]|
|6|| Maxpool|| Maxpool|| Maxpool||[1, 41, 128, 256]|
|7|| CNN|| CNN|| gruCNN||[1, 41, 128, 256]|
|8|| CNN|| CNN|| gruCNN||[1, 41, 128, 256]|
|9||FC||FC_LSTM||FC||[1, 161, 128, 1]|
Data Set (Training and Testing) : The speech set is a selection of ten British English speakers – both male and female – from the Voice Bank speech corpus , each of which has around 400 clean utterances. Eight speaker’s data were used for training, and the remaining two (one male and one female) were reserved for the performance testing. The noisy mixtures were created manually. The noises are from the NOISEX data set , which contains 20 different types of common environmental noises. Fourteen of which were used for the training, and the remaining six were used as the unseen noises, under which the models are tested. For training set mixtures, each speech sample were masked by a random training set noise at a random SNR point from [0, 5, 10, 15, 20] dB. Similar process has been repeated for the test set, but with the unseen noises at unseen SNR points of [2.5, 12.5, 22.5] dB.
Before feeding into the models, the 16 kHz sampled signals were framed into 20ms frames with 10ms overlap. The frames are short time Fourier transfer (STFT) transformed with 320 STFT points. Finally, the log-power-spectra were used as the input / output features of the models .
Model Training: All the comparing models are trained in an end-to-end mode, where the losses are computed directly between the magnitude of the predicted () and the target () STFT components. For each noisy-clean training set pair , the model parameters are optimised by minimising the mean square error (MSE) objective function.
where K denotes the total number of output STFT bins, that is 161, and the variable T is the number of time frames recurrently generated in the training process; which has been set to T = 128. The T value for testing varies based on the input signal duration as the recurrency is being modeled over the temporal axis. The loss was minimized by the Adam optimizer with an exponentially decaying learning rate method with learning rate of 0.001, decay steps = 20,000 and decay rate = 0.99.
For the objective evaluation of processed samples, the perceptual evaluation of speech quality (PESQ) metric  measuring the quality, and the short-time objective intelligibility (STOI)  measuring the intelligibility are considered. The composite quality of the model’s predictions (COVL) has also been measured , which reports a compound count of the noise reduction and speech restoration. In addition, the SNR intelligibility gain through model processing is measured by the Segmental SNR (SSNR) score . Subjectively, the quality of enhanced samples were measured by the mean opinion score (MOS). In total, 20 participants (non-native English speakers) had listened to and assigned the individual perceptual score based on the noise artifacts present, in a scale of 1-5 (0 – very annoying artifacts , 5 – no artifacts at all).
The mean objective scores on 220 test samples at each noise condition are displayed in TABLE II. Along with the processing types, the scores of unprocessed noisy speech have also been included to better understand the relative gain. Compared to the CNN_FC-SE architecture, which does not incorporate any form of recurrency described in Eq. (1) - (3), the suggested CNN_FC-SE model with recurrency modelled in the feature extraction layers, has distinctly outperformed on all the metrics. This gain is almost consistent across the noise conditions. With the inclusion of global recurrency, the performance of CNN_LSTM-SE has improved over CNN_FC-SE. This broadly conveys the benefits that can be achieved through temporal inclusive modeling.
When compare the two recurrent models, over the CNN_LSTM-SE that only models the global recurrency patterns, the CNN_FC-SE model that account the bin-wise recurreny factor has shown better enhancement. Even at the higher SNR point of 22.5dB, where the noise attributes are mild, CNN_FC-SE model elicited noticeable enhancement, showing an SNR intelligibility gain of up to 1.5 dB over the other models. This evidently attributes to the new feature extraction strategy of the network.
Regarding the consistency of model’s predictions in various noise types, the model enhancement in two noise conditions are plotted in Fig. 2
. The first row (type–1) is a construction noise, and the second row (type–2) is a street noise. Since type–1 noise has the spectral energy being distributed equally throughout the lower range of the frequency band (0 - 3 kHz), including where the speech activities are negligible, It is straight forward for a neural network to get a correct estimate of the noise activities. Whereas, in the type–2 noise, the noise attributes are highly localized at the very low band (0 - 0.5 kHz) of the spectrum – marked in the box, where the speech activities are noticeable. Unless the model looks into the local statistics of the spectrum, it could easily be miss-classified as a speech event. This is happened in the case of CNN_FC-SE and CNN_LSTM-SE, whereasCNN_FC-SE has been successful on disentangling out the noise activities through the exploration of the local patterns.
The subjective scoring of different models are displayed in TABLE III. In line with the objective scores, the suggested CNN_FC-SE model is ranked closer to the clean speech with a score of 3.16 on the 5 point scale, while there was not any statistically significant difference between the scores of the other two methods.
Pragmatically, performance gain of neural model could be argued by the additional parameters that is floated into the modeling. To see this, the parameter counts of different models are tabulated in TABLE IV. Though the CNN_FC-SE is of the lowest number among the models, performance of which is much weaker than the other two models. On the other hand, the suggested CNN_FC-SE produces far better enhancement with only 75% parameters of the CNN_LSTM-SE. This reduction in complexity is from the replacement of fully connected LSTM layer with the fixed kernels of
CNN to model the temporal flow. All of which indicate the potentiality to have it implemented on computationally constraint applications, like hearing aid. A Tensorflow implementation and enhanced samples from the model are provided at111https://www.csd.uoc.gr/~shifaspv/IEEE_Letter-demo 222https://github.com/shifaspv/gruCNN-speech-enhancement-tensorflow.
Mean opinion score (MOS) with standard error
In this letter, we presented the concept of recurrent feature extraction that is beneficial for single-channel speech enhancement. In contrast to the traditional CNN based feature extraction approach, the suggested feature extraction module with recurrent connections in the convolution layers has been proven efficient, especially in conditions where the noise activities are of localized in nature. Subjective and objective evaluation have confirmed the benefits that the recurrent feature extraction technique has elicited. While at the same time, the parameter complexity of the modelling is reduced by 25%. On this ground, there is clear reason to believe that the same could be extended to the advanced enhancement models, like WaveNet and SEGAN.
Neural networks for pattern recognition. Oxford university press. Cited by: §I.
Speech enhancement based on deep denoising autoencoder.. In Interspeech, pp. 436–440. Cited by: §I.
Low latency sound source separation using convolutional recurrent neural networks. In 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 71–75. Cited by: §II, §III.
Assessment for automatic speech recognition: ii. noisex-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech communication 12 (3), pp. 247–251. Cited by: §III.