Audio source separation refers to the problem of extracting one or more target sources while suppressing interfering sources and noise (Vincent et al., 2018). Two related tasks are those of speech enhancement and singing voice separation, both of which involve extracting the human voice as a target source. The former involves attempting to improve speech intelligibility and quality when obscured by additive noise (Loizou, 2013; Pascual et al., 2017; Vincent et al., 2018); whilst the latter’s focus is on separating music vocals from accompaniment (Stoller et al., 2018).
Most audio source separation methods operate not directly in the time-domain, but with time-frequency representations as input and output (front-end). Since 2017, the U-Net architecture on magnitude spectrograms has achieved new state of the art results in audio source separation for music (Jansson et al., 2017) and speech dereverbration (Ernst et al., 2018)
. Also, neural network architectures operating in the time domain have recently been proposed for speech enhancement(Pascual et al., 2017; Rethage et al., ). These approaches have been combined in the Wave-U-Net (Stoller et al., 2018) and applied to singing voice separation. In this paper we apply the Wave-U-Net to speech enhancement and show that it produces results that are better than the current state of the art.
The remainder of this paper is structured as follows. In section 2, we briefly review related work from the literature. In section 3, we introduce briefly the Wave-U-Net architecture and its application to speech. Section 4 presents the experiments we conducted and their results including comparison to other methods. Section 5 concludes this article with a final summary and perspectives for future work.
2 Related work
Source separation of audio has seen great improvement in recent years through deep learning models(Huang et al., 2015; Nugraha et al., 2016). These methods, as well as more traditional ones, mostly operate in the time-frequency domain, from deep recurrent architectures predicting soft masks, such as (Huang et al., ), to convolutional encoder-decoder architectures like that of (Chandna et al., 2017). Recently, the U-Net architecture on magnitude spectrograms has achieved new state of the art results in audio source separation for music (Jansson et al., 2017) and speech dereverbration (Ernst et al., 2018).
Also recently, models operating in the time domain have been developed. The development of Wavenet (van den Oord et al., 2016) inspired other developments, including (Pascual et al., 2017; Rethage et al., ). The SEGAN (Pascual et al., 2017) architecture was developed for the purpose of speech enhancement and denoising. It employs a neural network in the time-domain with an encoder and decoder pathway that successively halves and doubles the resolution of feature maps in each layer, respectively, and features skip connections between encoder and decoder layers. It offers state-of-the-art results on the Voice Bank (VCTK) dataset (Valentini-Botinhao, 2017).
The Wavenet for Speech Denoising (Rethage et al., ), another architecture to operate directly in the time domain, takes its inspiration from (van den Oord et al., 2016). It has a non-causal conditional input and a parallel output of samples for each prediction and is based on the repeated application of dilated convolutions with exponentially increasing dilation factors to factor in context information.
3 Wave-U-Net for Speech Enhancement
The Wave-U-Net architecture of (Stoller et al., 2018) combines elements of both of the abovementioned architectures with the U-Net. The overall architecture is a one-dimensional U-Net with down and upsampling blocks.
As per the spectrogram-based U-Net architectures (e.g. (Jansson et al., 2017)), the Wave-U-Net uses a series of downsampling and upsampling blocks to make its predictions, whilst at each level of the network, the time resolution is halved.
In applying the Wave-U-Net architecture to the application of speech enhancement, our objective is to separate a mixture waveform into K source waveforms with for all as the number of audio channels and as the numbers of audio samples. In our case of monaural speech enhancement we have and .
In doing so, for K sources to be estimated, a 1D convolution, zero-padded before convolving, of filter size 1 with K · C filters is applied to convert the stack of features at each audio sample into a source prediction for each sample. This is followed by a tanh nonlinearity to obtain a source signal estimate with values in the interval. All convolutions except the final ones are followed by LeakyReLU non-linearities.
The dataset includes clean and noisy audio data at 48kHz sampling frequency. However, like the SEGAN, we downsample to 16kHz for training and testing. The clean data are recordings of sentences, sourced from various text passages, uttered by 30 English-speakers, male and female, with various accents – 28 intended for training and 2 reserved for testing (Valentini-Botinhao et al., 2016b). The noisy data were generated by mixing the clean data with various noise datasets, as per the instructions provided in (Pascual et al., 2017; Valentini-Botinhao, 2017; Valentini-Botinhao et al., 2016a).
With respect to the training set, 40 different noise conditions are considered (Pascual et al., 2017; Valentini-Botinhao et al., 2016b). These are composed of 10 types of noise (2 of which are artificially-generated111available here: http://data.cstr.ed.ac.uk/cvbotinh/SE/data/ (Valentini-Botinhao, 2017) and 8 sourced from the DEMAND database (Thiemann et al., 2013), each mixed with clean speech at one of 4 signal-to-noise ratios (SNR) (15, 10, 5, and 0 dB). In total, this yields training samples, with approximately 10 different sentences in each condition per training speaker.
Testing conditions are mismatched from those of the training. The speakers, noise types and SNRs are all different. The separate test set with 2 speakers, unseen during training, consists of a total of 20 different noise conditions: 5 types of noise sourced from the DEMAND database at one of 4 SNRs each (17.5, 12.5, 7.5, and 2.5 dB) (Valentini-Botinhao, 2017; Valentini-Botinhao et al., 2016a). This yields 824 test items, with approximately 20 different sentences in each condition per test speaker (Valentini-Botinhao, 2017; Valentini-Botinhao et al., 2016a).
4.2 Experimental setup
As per (Stoller et al., 2018), our baseline model trains on randomly-sampled audio excerpts, using the ADAM optimization algorithm, a learning rate of 0.0001, decay rates = 0.9 and = 0.999 and a batch size of 16. We specify an initial network layer size of 12, like in (Stoller et al., 2018), although this is varied across experiments, as described in the Results section below. 16 extra filters per layer are also specified, with downsampling block filters of size 15 and upsampling block filters of size 5 like in (Stoller et al., 2018)
. We train for 2,000 iterations with mean squared error (MSE) over all source output samples in a batch as loss and apply early stopping if there is no improvement on the validation set for 20 epochs. We use a fixed validation set of 10 randomly selected tracks. Then, the best model is fine-tuned with the batch size doubled and the learning rate lowered to 0.00001, again until 20 epochs have passed without improved validation loss.
To evaluate and compare the quality of the enhanced speech yielded by the Wave-U-Net, we mirror the objective measures provided in (Pascual et al., 2017). Each measurement compares the enhanced signal with the clean reference of each of the 824 test set items. They have been calculated using the implementation provided in (Loizou, 2013)222available here: https://ecs.utdallas.edu/loizou/speech/software.htm. The first metric is that of the Perceptual Evaluation of Speech Quality (PESQ) - more specifically the wide-band version recommended in ITU-T P.862.2 (from –0.5 to 4.5) (Loizou, 2013; Pascual et al., 2017). Secondly, composite measures of metrics that aim to computationally approximate the Mean Opinion Score (MOS) that would be produced from human perceptual trials are computed (Rethage et al., ). These are: CSIG, a prediction of the signal distortion attending only to the speech signal (Hu et al., 2008) (from 1 to 5); CBAK, a prediction of the intrusiveness of background noise (Hu et al., 2008) (from 1 to 5); and COVL, a prediction of the overall effect (Hu et al., 2008) (from 1 to 5). Last is the Segmental Signal-to-Noise Ratio (SSNR) (Quackenbush et al., 1988) (from 0 to ).
Table 1 shows the results of these metrics for comparison across different speech enhancement architectures versus the overall best-performing Wave-U-Net model, that of the 10-layer with fine-tuning applied. As a comparative reference, it also shows the results of these metrics when applied: directly to the noisy signals; to signals filtered using the Wiener method, based on a priori SNR estimation; and to the SEGAN-enhanced signal, as provided in (Pascual et al., 2017). The results indicate that the Wave-U-Net is the most effective model for speech enhancement.
Table 2 shows the performance differences between different variations of the Wave-U-Net, with different numbers of layers. No fine-tuning was performed to obtain the results shown here, which explains the difference between the 10-layer Wave-U-Net in Tables 1 and 2. The results suggest that fine-tuning does not make a meaningful difference, except on the CSIG measurement, and that performance reaches a peak around the 10- and 9-layer models, smaller than the best performing equivalent model for music vocals source separation in (Stoller et al., 2018)
. This is likely due to the size of the receptive field, where for speech the optimal size is probably smaller than for music.
The Wave-U-Net combines the advantages of several of the most recent successful architectures for music and speech source separation and our results show that it is particularly effective at speech enhancement. The results improve over the state of the art by a good margin even without significant adaptation or parameter tuning. This indicates that there is great potential for this approach in speech enhancement.
5.2 Future work
In comparison to the SEGAN architecture, it is possible that the advantage stems from the upsampling that avoids aliasing, which should be further investigated. The results indicate that there is room for increasing effectiveness and efficiency by further adapting the model size and other parameters, e.g. filter sizes, to the task and expanding to multi-channel audio and multi-source-separation.
Chandna et al. 
Pritish Chandna, M. Miron, Jordi Janer, and Emilia Gómez.
Monoaural Audio Source Separation Using Deep Convolutional Neural Networks.13th International Conference on Latent Variable Analysis and Signal Separation (LVA ICA2017), (2017), 2017. URL http://mtg.upf.edu/node/3680.
- Ernst et al.  Ori Ernst, Shlomo E. Chazan, Sharon Gannot, and Jacob Goldberger. Speech dereverberation using fully convolutional networks. CoRR, abs/1803.08243, 2018. URL http://arxiv.org/abs/1803.08243.
- Hu et al.  Yi Hu, Philipos C Loizou, and Senior Member. Evaluation of Objective Quality Measures for Speech Enhancement. IEEE transactions on audio, speech and language processing, 16(1):229, 2008. doi: 10.1109/TASL.2007.911054. URL http://www.utdallas.edu/loizou/speech/noizeus/.
Po-Sen Huang, Minje Kim, Mark Hasegawa-Johnson, and Paris Smaragdis.
Singing-Voice Separation from Monaural Recordings Using Deep Recurrent Neural Networks.Technical report. URL http://www.isle.illinois.edu/sst/pubs/2014/huang-ismir2014.pdf.
- Huang et al.  Po-Sen Huang, Minje Kim, Mark Hasegawa-Johnson, and Paris Smaragdis. Joint optimization of masks and deep recurrent neural networks for monaural source separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(12):2136–2147, 2015.
- Jansson et al.  Andreas Jansson, Eric J. Humphrey, Nicola Montecchio, Rachel M. Bittner, Aparna Kumar, and Tillman Weyde. Singing voice separation with deep u-net convolutional networks. In Proceedings of the 18th International Society for Music Information Retrieval Conference, ISMIR 2017, Suzhou, China, October 23-27, 2017, pages 745–751, 2017. URL https://ismir2017.smcnus.org/wp-content/uploads/2017/10/171_Paper.pdf.
- Loizou  Philipos C Loizou. Speech Enhancement: Theory and Practice. CRC Press, Inc., Boca Raton, FL, USA, 2nd edition, 2013. ISBN 1466504218, 9781466504219.
- Nugraha et al.  Aditya Arie Nugraha, Antoine Liutkus, and Emmanuel Vincent. Multichannel audio source separation with deep neural networks. IEEE/ACM Trans. Audio, Speech & Language Processing, 24(9):1652–1664, 2016.
- Pascual et al.  Santiago Pascual, Antonio Bonafonte, and Joan Serrà. Segan: Speech enhancement generative adversarial network. In Proc. Interspeech 2017, pages 3642–3646, 2017. doi: 10.21437/Interspeech.2017-1428. URL http://dx.doi.org/10.21437/Interspeech.2017-1428.
- Quackenbush et al.  Schuyler R. Quackenbush, T. P. (Thomas Pinkney) Barnwell, and Mark A. Clements. Objective measures of speech quality. Prentice Hall, 1988. ISBN 0136290566.
-  Dario Rethage, Jordi Pons, and Xavier Serra. A Wavenet for Speech Denoising. Technical report. URL https://arxiv.org/pdf/1706.07162.pdf.
- Stoller et al.  Daniel Stoller, Sebastian Ewert, and Simon Dixon. Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation. 2018. URL https://arxiv.org/abs/1806.03185.
- Thiemann et al.  Joachim Thiemann, Nobutaka Ito, and Emmanuel Vincent. The Diverse Environments Multi-channel Acoustic Noise Database (DEMAND): A database of multichannel environmental noise recordings The Diverse Environments Multi-channel Acoustic Noise Database (DEMAND): A database of multichannel environmental noise record-ings. page 10, 2013. doi: 10.5281/zen. URL https://hal.inria.fr/hal-00796707.
- Valentini-Botinhao  Cassia Valentini-Botinhao. Noisy speech database for training speech enhancement algorithms and TTS models, 2016 [sound]. University of Edinburgh. School of Informatics. Centre for Speech Technology Research (CSTR), 2017. URL http://dx.doi.org/10.7488/ds/2117.
- Valentini-Botinhao et al. [2016a] Cassia Valentini-Botinhao, Xin Wang, Shinji Takaki, and Junichi Yamagishi. Investigating rnn-based speech enhancement methods for noise-robust text-to-speech. In 9th ISCA Speech Synthesis Workshop, pages 146–152, 2016a. doi: 10.21437/SSW.2016-24. URL http://dx.doi.org/10.21437/SSW.2016-24.
- Valentini-Botinhao et al. [2016b] Cassia Valentini-Botinhao, Xin Wang, Shinji Takaki, and Junichi Yamagishi. Speech Enhancement for a Noise-Robust Text-to-Speech Synthesis System using Deep Recurrent Neural Networks. 2016b. doi: 10.21437/Interspeech.2016-159. URL http://dx.doi.org/10.21437/Interspeech.2016-159.
- van den Oord et al.  Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W. Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. In The 9th ISCA Speech Synthesis Workshop, Sunnyvale, CA, USA, 13-15 September 2016, page 125, 2016. URL http://www.isca-speech.org/archive/SSW_2016/abstracts/ssw9_DS-4_van_den_Oord.html.
- Vincent et al.  Emmanuel Vincent, Tuomas Virtanen, and Sharon Gannot, editors. Audio Source Separation and Speech Enhancement. John Wiley & Sons Ltd, Chichester, UK, 9 2018. ISBN 9781119279860. doi: 10.1002/9781119279860. URL http://doi.wiley.com/10.1002/9781119279860.