Improved Speech Enhancement with the Wave-U-Net

11/27/2018 ∙ by Craig Macartney, et al. ∙ City, University of London 0

We study the use of the Wave-U-Net architecture for speech enhancement, a model introduced by Stoller et al for the separation of music vocals and accompaniment. This end-to-end learning method for audio source separation operates directly in the time domain, permitting the integrated modelling of phase information and being able to take large temporal contexts into account. Our experiments show that the proposed method improves several metrics, namely PESQ, CSIG, CBAK, COVL and SSNR, over the state-of-the-art with respect to the speech enhancement task on the Voice Bank corpus (VCTK) dataset. We find that a reduced number of hidden layers is sufficient for speech enhancement in comparison to the original system designed for singing voice separation in music. We see this initial result as an encouraging signal to further explore speech enhancement in the time-domain, both as an end in itself and as a pre-processing step to speech recognition systems.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Audio source separation refers to the problem of extracting one or more target sources while suppressing interfering sources and noise (Vincent et al., 2018). Two related tasks are those of speech enhancement and singing voice separation, both of which involve extracting the human voice as a target source. The former involves attempting to improve speech intelligibility and quality when obscured by additive noise (Loizou, 2013; Pascual et al., 2017; Vincent et al., 2018); whilst the latter’s focus is on separating music vocals from accompaniment (Stoller et al., 2018).

Most audio source separation methods operate not directly in the time-domain, but with time-frequency representations as input and output (front-end). Since 2017, the U-Net architecture on magnitude spectrograms has achieved new state of the art results in audio source separation for music (Jansson et al., 2017) and speech dereverbration (Ernst et al., 2018)

. Also, neural network architectures operating in the time domain have recently been proposed for speech enhancement

(Pascual et al., 2017; Rethage et al., ). These approaches have been combined in the Wave-U-Net (Stoller et al., 2018) and applied to singing voice separation. In this paper we apply the Wave-U-Net to speech enhancement and show that it produces results that are better than the current state of the art.

The remainder of this paper is structured as follows. In section 2, we briefly review related work from the literature. In section 3, we introduce briefly the Wave-U-Net architecture and its application to speech. Section 4 presents the experiments we conducted and their results including comparison to other methods. Section 5 concludes this article with a final summary and perspectives for future work.

2 Related work

Source separation of audio has seen great improvement in recent years through deep learning models

(Huang et al., 2015; Nugraha et al., 2016). These methods, as well as more traditional ones, mostly operate in the time-frequency domain, from deep recurrent architectures predicting soft masks, such as (Huang et al., ), to convolutional encoder-decoder architectures like that of (Chandna et al., 2017). Recently, the U-Net architecture on magnitude spectrograms has achieved new state of the art results in audio source separation for music (Jansson et al., 2017) and speech dereverbration (Ernst et al., 2018).

Also recently, models operating in the time domain have been developed. The development of Wavenet (van den Oord et al., 2016) inspired other developments, including (Pascual et al., 2017; Rethage et al., ). The SEGAN (Pascual et al., 2017) architecture was developed for the purpose of speech enhancement and denoising. It employs a neural network in the time-domain with an encoder and decoder pathway that successively halves and doubles the resolution of feature maps in each layer, respectively, and features skip connections between encoder and decoder layers. It offers state-of-the-art results on the Voice Bank (VCTK) dataset (Valentini-Botinhao, 2017).

The Wavenet for Speech Denoising (Rethage et al., ), another architecture to operate directly in the time domain, takes its inspiration from (van den Oord et al., 2016). It has a non-causal conditional input and a parallel output of samples for each prediction and is based on the repeated application of dilated convolutions with exponentially increasing dilation factors to factor in context information.

3 Wave-U-Net for Speech Enhancement

Figure 1: The Wave-U-Net architecture following (Stoller et al., 2018).

The Wave-U-Net architecture of (Stoller et al., 2018) combines elements of both of the abovementioned architectures with the U-Net. The overall architecture is a one-dimensional U-Net with down and upsampling blocks.

As per the spectrogram-based U-Net architectures (e.g. (Jansson et al., 2017)), the Wave-U-Net uses a series of downsampling and upsampling blocks to make its predictions, whilst at each level of the network, the time resolution is halved.

In applying the Wave-U-Net architecture to the application of speech enhancement, our objective is to separate a mixture waveform into K source waveforms with for all as the number of audio channels and as the numbers of audio samples. In our case of monaural speech enhancement we have and .

In doing so, for K sources to be estimated, a 1D convolution, zero-padded before convolving, of filter size 1 with K · C filters is applied to convert the stack of features at each audio sample into a source prediction for each sample. This is followed by a tanh nonlinearity to obtain a source signal estimate with values in the interval

. All convolutions except the final ones are followed by LeakyReLU non-linearities.

4 Experiments

4.1 Datasets

We use the same VCTK dataset (Valentini-Botinhao, 2017) as the SEGAN Pascual et al. (2017), which is available publicly, encouraging comparisons with future speech enhancement methods.

The dataset includes clean and noisy audio data at 48kHz sampling frequency. However, like the SEGAN, we downsample to 16kHz for training and testing. The clean data are recordings of sentences, sourced from various text passages, uttered by 30 English-speakers, male and female, with various accents – 28 intended for training and 2 reserved for testing (Valentini-Botinhao et al., 2016b). The noisy data were generated by mixing the clean data with various noise datasets, as per the instructions provided in (Pascual et al., 2017; Valentini-Botinhao, 2017; Valentini-Botinhao et al., 2016a).

With respect to the training set, 40 different noise conditions are considered (Pascual et al., 2017; Valentini-Botinhao et al., 2016b). These are composed of 10 types of noise (2 of which are artificially-generated111available here: http://data.cstr.ed.ac.uk/cvbotinh/SE/data/ (Valentini-Botinhao, 2017) and 8 sourced from the DEMAND database (Thiemann et al., 2013), each mixed with clean speech at one of 4 signal-to-noise ratios (SNR) (15, 10, 5, and 0 dB). In total, this yields training samples, with approximately 10 different sentences in each condition per training speaker.

Testing conditions are mismatched from those of the training. The speakers, noise types and SNRs are all different. The separate test set with 2 speakers, unseen during training, consists of a total of 20 different noise conditions: 5 types of noise sourced from the DEMAND database at one of 4 SNRs each (17.5, 12.5, 7.5, and 2.5 dB) (Valentini-Botinhao, 2017; Valentini-Botinhao et al., 2016a). This yields 824 test items, with approximately 20 different sentences in each condition per test speaker (Valentini-Botinhao, 2017; Valentini-Botinhao et al., 2016a).

4.2 Experimental setup

As per (Stoller et al., 2018), our baseline model trains on randomly-sampled audio excerpts, using the ADAM optimization algorithm, a learning rate of 0.0001, decay rates = 0.9 and = 0.999 and a batch size of 16. We specify an initial network layer size of 12, like in (Stoller et al., 2018), although this is varied across experiments, as described in the Results section below. 16 extra filters per layer are also specified, with downsampling block filters of size 15 and upsampling block filters of size 5 like in (Stoller et al., 2018)

. We train for 2,000 iterations with mean squared error (MSE) over all source output samples in a batch as loss and apply early stopping if there is no improvement on the validation set for 20 epochs. We use a fixed validation set of 10 randomly selected tracks. Then, the best model is fine-tuned with the batch size doubled and the learning rate lowered to 0.00001, again until 20 epochs have passed without improved validation loss.

4.3 Results

Metric Noisy Wiener SEGAN Wave-U-Net
PESQ  1.97     2.22     2.16           2.40
CSIG  3.35     3.23     3.48           3.52
CBAK  2.44     2.68     2.94           3.24
COVL  2.63     2.67     2.80           2.96
SSNR  1.68     5.07     7.73           9.97
Table 1: Objective evaluation - comparing the mean results of the untreated noisy signal, the Wiener-, SEGAN- and Wave-U-Net-enhanced signals. Higher scores are better for all metrics.
Metric 12-layer 11-layer 10-layer 9-layer 8-layer
PESQ     2.40     2.38     2.41     2.41     2.39
CSIG     3.49     3.47     3.43     3.54     3.51
CBAK     3.23     3.22     3.24     3.23     3.18
COVL     2.95     2.92     2.92     2.97     2.95
SSNR     9.79     9.95     9.98     9.87     9.30
Table 2: Objective evaluation - mean results, comparing variations of the Wave-U-Net model with different numbers of layers, without fine-tuning applied.

To evaluate and compare the quality of the enhanced speech yielded by the Wave-U-Net, we mirror the objective measures provided in (Pascual et al., 2017). Each measurement compares the enhanced signal with the clean reference of each of the 824 test set items. They have been calculated using the implementation provided in (Loizou, 2013)222available here: https://ecs.utdallas.edu/loizou/speech/software.htm. The first metric is that of the Perceptual Evaluation of Speech Quality (PESQ) - more specifically the wide-band version recommended in ITU-T P.862.2 (from –0.5 to 4.5) (Loizou, 2013; Pascual et al., 2017). Secondly, composite measures of metrics that aim to computationally approximate the Mean Opinion Score (MOS) that would be produced from human perceptual trials are computed (Rethage et al., ). These are: CSIG, a prediction of the signal distortion attending only to the speech signal (Hu et al., 2008) (from 1 to 5); CBAK, a prediction of the intrusiveness of background noise (Hu et al., 2008) (from 1 to 5); and COVL, a prediction of the overall effect (Hu et al., 2008) (from 1 to 5). Last is the Segmental Signal-to-Noise Ratio (SSNR) (Quackenbush et al., 1988) (from 0 to ).

Table 1 shows the results of these metrics for comparison across different speech enhancement architectures versus the overall best-performing Wave-U-Net model, that of the 10-layer with fine-tuning applied. As a comparative reference, it also shows the results of these metrics when applied: directly to the noisy signals; to signals filtered using the Wiener method, based on a priori SNR estimation; and to the SEGAN-enhanced signal, as provided in (Pascual et al., 2017). The results indicate that the Wave-U-Net is the most effective model for speech enhancement.

Table 2 shows the performance differences between different variations of the Wave-U-Net, with different numbers of layers. No fine-tuning was performed to obtain the results shown here, which explains the difference between the 10-layer Wave-U-Net in Tables 1 and 2. The results suggest that fine-tuning does not make a meaningful difference, except on the CSIG measurement, and that performance reaches a peak around the 10- and 9-layer models, smaller than the best performing equivalent model for music vocals source separation in (Stoller et al., 2018)

. This is likely due to the size of the receptive field, where for speech the optimal size is probably smaller than for music.

5 Conclusions

5.1 Summary

The Wave-U-Net combines the advantages of several of the most recent successful architectures for music and speech source separation and our results show that it is particularly effective at speech enhancement. The results improve over the state of the art by a good margin even without significant adaptation or parameter tuning. This indicates that there is great potential for this approach in speech enhancement.

5.2 Future work

In comparison to the SEGAN architecture, it is possible that the advantage stems from the upsampling that avoids aliasing, which should be further investigated. The results indicate that there is room for increasing effectiveness and efficiency by further adapting the model size and other parameters, e.g. filter sizes, to the task and expanding to multi-channel audio and multi-source-separation.

References