Speech Enhancement Generative Adversarial Network in TensorFlow
Current speech enhancement techniques operate on the spectral domain and/or exploit some higher-level feature. The majority of them tackle a limited number of noise conditions and rely on first-order statistics. To circumvent these issues, deep networks are being increasingly used, thanks to their ability to learn complex functions from large example sets. In this work, we propose the use of generative adversarial networks for speech enhancement. In contrast to current techniques, we operate at the waveform level, training the model end-to-end, and incorporate 28 speakers and 40 different noise conditions into the same model, such that model parameters are shared across them. We evaluate the proposed model using an independent, unseen test set with two speakers and 20 alternative noise conditions. The enhanced samples confirm the viability of the proposed model, and both objective and subjective evaluations confirm the effectiveness of it. With that, we open the exploration of generative architectures for speech enhancement, which may progressively incorporate further speech-centric design choices to improve their performance.READ FULL TEXT VIEW PDF
Speech enhancement deep learning systems usually require large amounts o...
The advent of learning-based methods in speech enhancement has revived t...
Popular neural network-based speech enhancement systems operate on the
Most methods of voice restoration for patients suffering from aphonia ei...
Speech enhancement is a crucial task for several applications. Among the...
In this work we evaluate a neural based speech intelligibility booster b...
Speech enhancement systems can show improved performance by adapting the...
Speech Enhancement Generative Adversarial Network in TensorFlow
Speech enhancement tries to improve the intelligibility and quality of speech contaminated by additive noise . Its main applications are related to improving the quality of mobile communications in noisy environments. However, we also find important applications related to hearing aids and cochlear implants, where enhancing the signal before amplification can significantly reduce discomfort and increase intelligibility . Speech enhancement has also been successfully applied as a preprocessing stage in speech recognition and speaker identification systems [3, 4, 5].
Classic speech enhancement methods are spectral subtraction , Wiener filtering , statistical model-based methods , and subspace algorithms [9, 10]. Neural networks have been also applied to speech enhancement since the 80s [11, 12]. Recently, the denoising auto-encoder architecture 
has been widely adopted. However, recurrent neural networks (RNNs) are also used. For instance, the recurrent denoising auto-encoder has shown significant performance exploiting the temporal context information in embedded signals. Most recent approaches apply long short-term memory networks to the denoising task[4, 14]. In  and 
, noise features are estimated and included in the input features of deep neural networks. The use of dropout, post-filtering, and perceptually motivated metrics are shown to be effective.
Most of the current systems are based on the short-time Fourier analysis/synthesis framework . They only modify the spectrum magnitude, as it is often claimed that short-time phase is not important for speech enhancement . However, further studies  show that significant improvements of speech quality are possible, especially when a clean phase spectrum is known. In 1988, Tamura et al.  proposed a deep network that worked directly on the raw audio waveform, but they used feed-forward layers that worked frame-by-frame (60 samples) on a speaker-dependent and isolated-word database.
A recent breakthrough in the deep learning generative modeling field are generative adversarial networks (GANs) 
. GANs have achieved a good level of success in the computer vision field to generate realistic images and generalize well to pixel-wise, complex (high-dimensional) distributions[20, 21, 22]. As far as we are concerned, GANs have not yet been applied to any speech generation nor enhancement task, so this is the first approach to use the adversarial framework to generate speech signals.
The main advantages of the proposed speech enhancement GAN (SEGAN) are:
It provides a quick enhancement process. No causality is required and, hence, there is no recursive operation like in RNNs.
It works end-to-end, with the raw audio. Therefore, no hand-crafted features are extracted and, with that, no explicit assumptions about the raw data are done.
It learns from different speakers and noise types, and incorporates them together into the same shared parametrization. This makes the system simple and generalizable in those dimensions.
GANs  are generative models that learn to map samples z from some prior distribution to samples x from another distribution , which is the one of the training examples (e.g., images, audio, etc.). The component within the GAN structure that performs the mapping is called the generator (G), and its main task is to learn an effective mapping that can imitate the real data distribution to generate novel samples related to those of the training set. Importantly, G does so not by memorizing input-output pairs, but by mapping the data distribution characteristics to the manifold defined in our prior .
The way in which G learns to do the mapping is by means of an adversarial training, where we have another component, called the discriminator (D). D is typically a binary classifier, and its inputs are either real samples, coming from the dataset that G is imitating, or fake samples, made up by G. The adversarial characteristic comes from the fact that D has to classify the samples coming fromas real, whereas the samples coming from G, , have to be classified as fake. This leads to G trying to fool D, and the way to do so is that G adapts its parameters such that D classifies G’s output as real. During back-propagation, D gets better at finding realistic features in its input and, in turn, G corrects its parameters to move towards the real data manifold described by the training data (Fig. 1). This adversarial learning process is formulated as a minimax game between G and D, with the objective
We can also work with a conditioned version of GANs, where we have some extra information in G and D to perform mapping and classification (see  and references therein). In that case, we may add some extra input , with which we change the objective function to
There have been recent improvements in the GAN methodology to stabilize training and increase the quality of the generated samples in G. For instance, the classic approach suffered from vanishing gradients due to the sigmoid cross-entropy loss used for training. To solve this, the least-squares GAN (LSGAN) approach  substitutes the cross-entropy loss by the least-squares function with binary coding (1 for real, 0 for fake). With this, the formulation in Eq. 2 changes to
The enhancement problem is defined so that we have an input noisy signal and we want to clean it to obtain the enhanced signal . We propose to do so with a speech enhancement GAN (SEGAN). In our case, the G network performs the enhancement. Its inputs are the noisy speech signal together with the latent representation z, and its output is the enhanced version . We design G to be fully convolutional, so that there are no dense layers at all. This enforces the network to focus on temporally-close correlations in the input signal and throughout the whole layering process. Furthermore, it reduces the number of training parameters and hence training time.
The G network is structured similarly to an auto-encoder (Fig. 223], getting a convolution result out of every steps of the filter. We choose strided convolutions as they were shown to be more stable for GAN training than other pooling approaches 
. Decimation is done until we get a condensed representation, called the thought vectorc, which gets concatenated with the latent vector z. The encoding process is reversed in the decoding stage by means of fractional-strided transposed convolutions (sometimes called deconvolutions), followed again by PReLUs.
The G network also features skip connections, connecting each encoding layer to its homologous decoding layer, and bypassing the compression performed in the middle of the model (Fig. 2). This is done because the input and output of the model share the same underlying structure, which is that of natural speech. Therefore, many low level details could be lost to reconstruct the speech waveform properly if we force all information to flow through the compression bottleneck. Skip connections directly pass the fine-grained information of the waveform to the decoding stage (e.g., phase, alignment). In addition, they offer a better training behavior, as the gradients can flow deeper through the whole structure .
An important feature of G is its end-to-end structure, so that it processes raw speech sampled at 16 kHz, getting rid of any intermediate transformations to extract acoustic features (contrasting to many common pipelines). In this type of model, we have to be careful with typical regression losses like mean absolute error or mean squared error, as noted in the raw speech generative model WaveNet . These losses work under strong assumptions on how our output distribution is shaped and, therefore, impose important modeling limitations (like not allowing multi-modal distributions and biasing the predictions towards an average of all the possible predictions). Our solution to overcome these limitations is to use the generative adversarial setting. This way, D is in charge of transmitting information to G of what is real and what is fake, such that G can slightly correct its output waveform towards the realistic distribution, getting rid of the noisy signals as those are signaled to be fake. In this sense, D can be understood as learning some sort of loss for G’s output to look real.
In preliminary experiments, we found it convenient to add a secondary component to the loss of G in order to minimize the distance between its generations and the clean examples. To measure such distance, we chose the norm, as it has been proven to be effective in the image manipulation domain [20, 26]. This way, we let the adversarial component to add more fine-grained and realistic results. The magnitude of the norm is controlled by a new hyper-parameter . Therefore, the G loss, which we choose to be the one of LSGAN (Eq. 4), becomes
To evaluate the effectiveness of the SEGAN, we resort to the data set by Valentini et al. . We choose it because it is open and available111http://dx.doi.org/10.7488/ds/1356, and because the amount and type of data fits our purposes for this work: generalizing on many types of noise for many different speakers. The data set is a selection of 30 speakers from the Voice Bank corpus : 28 are included in the train set and 2 in the test set.
To make the noisy training set, a total of 40 different conditions are considered : 10 types of noise (2 artificial and 8 from the Demand database ) with 4 signal-to-noise ratio (SNR) each (15, 10, 5, and 0 dB). There are around 10 different sentences in each condition per training speaker. To make the test set, a total of 20 different conditions are considered : 5 types of noise (all from the Demand database) with 4 SNR each (17.5, 12.5, 7.5, and 2.5 dB). There are around 20 different sentences in each condition per test speaker. Importantly, the test set is totally unseen by (and different from) the training set, using different speakers and conditions.
Regarding the weight of our regularization, after some experimentation, we set it to 100 for the whole training. We initially set it to 1, but we observed that the G loss was two orders of magnitude under the adversarial one, so the had no practical effect on the learning. Once we set it to 100, we saw a minimization behavior in the and an equilibrium behavior in the adversarial one. As the got lower, the quality of the output samples increased, which we hypothesize helped G being more effective in terms of realistic generation.
Regarding the architecture, G is composed of 22 one-dimensional strided convolutional layers of filter width 31 and strides of . The amount of filters per layer increases so that the depth gets larger as the width (duration of signal in time) gets narrower. The resulting dimensions per layer, being it samples feature maps, is 163841, 819216, 409632, 204832, 102464, 51264, 256128, 128128, 64256, 32256, 16512, and 81024. There, we sample the noise samples z from our prior 8
1024-dimensional normal distribution. As mentioned, the decoder stage of G is a mirroring of the encoder with the same filter widths and the same amount of filters per layer. However, skip connections and the addition of the latent vector make the number of feature maps in every layer to be doubled.
The network D follows the same one-dimensional convolutional structure as G’s encoder stage, and it fits to the conventional topology of a convolutional classification network. The differences are that (1) it gets two input channels of 16384 samples, (2) it uses virtual batch-norm  before LeakyReLU non-linearities with , and (3) in the last activation layer there is a one-dimensional convolution layer with one filter of width one that does not downsample the hidden activations (1
1 convolution). The latter (3) reduces the amount of parameters required for the final classification neuron, which is fully connected to all hidden activations with a linear behavior. This means that we reduce the amount of required parameters in that fully-connected component fromto 8, and the way in which the 1024 channels are merged is learnable in the parameters of the convolution.
To evaluate the quality of the enhanced speech, we compute the following objective measures (the higher the better). All metrics compare the enhanced signal with the clean reference of the 824 test set files. They have been computed using the implementation included in , and available at the publisher website222https://www.crcpress.com/downloads/K14513/K14513_CD_Files.zip.
PESQ: Perceptual evaluation of speech quality, using the wide-band version recommended in ITU-T P.862.2  (from –0.5 to 4.5).
CSIG: Mean opinion score (MOS) prediction of the signal distortion attending only to the speech signal  (from 1 to 5).
CBAK: MOS prediction of the intrusiveness of background noise  (from 1 to 5).
COVL: MOS prediction of the overall effect  (from 1 to 5).
SSNR: Segmental SNR [35, p. 41] (from 0 to ).
Table 1 shows the results of these metrics. To have a comparative reference, it also shows the results of these metrics when applied directly to the noisy signals and to signals filtered using the Wiener method based on a priori SNR estimation , as provided in . It can be observed how SEGAN gets slightly worse PESQ. However, in all the other metrics, which better correlate with speech/noise distortion, SEGAN outperforms the Wiener method. It produces less speech distortion (CSIG) and removes noise more effectively (CBAK and SSNR). Therefore, it achieves a better tradeoff between the two factors (COVL).
A perceptual test has also been carried out to compare SEGAN with the noisy signal and the Wiener baseline. For that, 20 sentences were selected from the test set. As the database does not indicate the amount and type of noise for each file, the selection was done by listening to some of the provided noisy files, trying to balance different noise types. Most of the files have low SNR, but a few with high SNR were also included.
A total of 16 listeners were presented with the 20 sentences in a randomized order. For each sentence, the following three versions were presented, also in random order: noisy signal, Wiener-enhanced signal, and SEGAN-enhanced signal. For each signal, the listener rated the overall quality, using a scale from 1 to 5. In the description of the 5 categories, they were instructed to pay attention to both the signal distortion and the noise intrusiveness (e.g., 5=excellent: very natural speech with no degradation and not noticeable noise). Listeners could listen to each signal as many times as they wanted, and were asked to pay attention to the comparative rate of the three signals.
In Table 2, it can be observed how SEGAN is preferred over both the noisy signal and the Wiener baseline. However, as there is a large variation in the SNR of the noisy signal, the MOS range is very large, and the difference between Wiener and SEGAN is not significant. However, as the listeners compared all the systems at same time, it is possible to compute the comparative MOS (CMOS) by subtracting the MOS of the two systems being compared. Fig. 4 depicts this relative comparison. We can see how the signals generated by SEGAN are preferred. More specifically, SEGAN is preferred over the original (noisy) signal in 67% of the cases, while the noisy signal is preferred in 8% of the cases (no preference in 25% of the cases). With respect to the Wiener system, SEGAN is preferred in 53% of cases and Wiener is preferred in 23% of the cases (no preference in 24% of the cases).
In this work, an end-to-end speech enhancement method has been implemented within the generative adversarial framework. The model works as an encoder-decoder fully-convolutional structure, which makes it fast to operate for denoising waveform chunks. The results show that, not only the method is viable, but it can also represent an effective alternative to current approaches. Possible future work involves the exploration of better convolutional structures and the inclusion of perceptual weightings in the adversarial training, so that we reduce possible high frequency artifacts that might be introduced by the current model. Further experiments need to be done to compare SEGAN with other competitive approaches.
This work was supported by the project TEC2015-69266-P (MINECO/FEDER, UE).
X. Lu, Y. Tsao, S. Matsuda, and C. Hori, “Speech enhancement based on deep denoising autoencoder.” inProc. of INTERSPEECH, 2013, pp. 436–440.
K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” inProc. of the IEEE Int. Conf. on Computer Vision (ICCV), 2015, pp. 1026–1034.
Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
T. Tieleman and G. Hinton, “Lecture 6.5-RMSprop: divide the gradient by a running average of its recent magnitude,” COURSERA: Neural Networks for Machine Learning 4, 2, 2012.