Improving singing voice separation with the Wave-U-Net using Minimum Hyperspherical Energy

10/22/2019 ∙ by Joaquin Perez-Lapillo, et al. ∙ 0

In recent years, deep learning has surpassed traditional approaches to the problem of singing voice separation. The Wave-U-Net is a recent deep network architecture that operates directly on the time domain. The standard Wave-U-Net is trained with data augmentation and early stopping to prevent overfitting. Minimum hyperspherical energy (MHE) regularization has recently proven to increase generalization in image classification problems by encouraging a diversified filter configuration. In this work, we apply MHE regularization to the 1D filters of the Wave-U-Net. We evaluated this approach for separating the vocal part from mixed music audio recordings on the MUSDB18 dataset. We found that adding MHE regularization to the loss function consistently improves singing voice separation, as measured in the Signal to Distortion Ratio on test recordings, leading to the current best time-domain system for singing voice extraction.



There are no comments yet.


page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Audio source separation is a core research area in audio signal processing with applications in speech, music and environmental audio. The main goal is to extract one or more target sources while suppressing other sources and noise [19]. One case is the separation of a singing voice, commonly presenting the main melody, from the musical accompaniment. Separating vocals from accompaniment has several applications, such as editing and remixing music, automatic transcription, generating karaoke tracks and music information retrieval [9].

We address this problem by applying Minimum Hyperspherical Energy (MHE) regularization to the Wave-U-Net model. In particular our contributions are:

2 Related Work

Approaches to source separation can be broadly divided into two groups: traditional and deep neural networks [13]. The first group contains techniques such as non-negative matrix factorization, Bayesian methods, and the analysis of repeating structures [14, 10, 11]. However, several deep neural network architectures have surpassed those models and achieve state-of-the-art performance today. [5] introduced the U-Net architecture for singing voice extraction, using spectrograms. The Wave-U-Net, proposed by [15], processes the audio data in the time domain, and is trained as a deep end-to-end model with state-of-the-art performance.

Many strategies for regularization have been developed, with early stopping, data augmentation, weight decay and dropout being the most commonly used [2]. Neural network based singing voice separation systems have used some of these methods in the past. [5] uses dropout. In [15], data augmentation is applied by varying the balance of the sources. Both methods also use early stopping.

In deep learning in general, overfitting has also been addressed by compressing the network [3], modifying the network architecture [4], and alleviating redundancy through diversification [20]

. This last approach enforces diversity between neurons via regularization. Minimum Hyperspherical Energy, proposed by

[7], also falls into this category. Inspired by a problem in physics [17], the MHE term aims to encourage variety between filters. MHE has increased performance in image classification, but the method has not yet been applied to audio.

3 Method

3.1 Architecture

Our work is based on the best Wave-U-Net configuration found by [15]

for singing voice separation, which is a deep U-Net composed of 24 hidden convolutional layers, 12 in each path, considering a filter size of 15 in the downsampling path and 5 for the upsampling. The first layer computes 24 feature maps, and each subsequent layer doubles that number. Each layer in the upsampling path is also connected to their corresponding layer in the downsampling path with skip-connections. The output layer generates an estimated waveform in the range [-1,1] for the vocals, while the accompaniment is obtained as the difference between the original mixture and the estimated singing voice, as in

[15]. The original loss function is mean squared error (MSE) per sample.

3.2 Adding MHE to Wave-U-Net training

The MHE regularised loss function for the Wave-U-Net can be defined as:


where is the number of hidden layers and is the number of filters in the -th layer. represents the hyperspherical energy of the -th layer, given the parameter , and it is calculated as:


where is the Euclidean norm and

is the weight vector of neurons

projected onto a unit hypersphere. The dimensionality of the hypersphere is given by the number of incoming weights per neuron. For , we use for , and otherwise [7].

With regard to , i.e. the weighting constant of the regulariser, [7] recommends to use a constant depending on the number of hidden layers, with the aim of reducing the weighting of MHE for very deep architectures. We use here .

There are two possible configurations of MHE, full or half space. The half-space variation creates a virtual neuron for every existing neuron with inverted signs of the weights. The second term of equation 1 is then applied to neurons in each hidden layer . There are two alternative distance functions, Euclidean and angular. When dealing with angular distances, is replaced with . The parameter s controls the scaling of MHE. Using the convention of [7] we have a total of twelve possible configurations as shown in Table 1.

Parameter Values
model [MHE, half_MHE]
value [0, 1, 2, a0 ,a1, a2]
Table 1: MHE configurations. Values of with prefix use angular distance. Each model can be used with each [7].

3.3 Dataset

We use the MUSDB18 dataset [12]111 It is composed of 150 full-length musical tracks. The data is divided into 100 songs for development and 50 for testing, corresponding to 6.5 and 3.5 hours, respectively. The development set is further divided into training and validation sets by randomly selecting 25 tracks for the latter.

For the task of singing voice separation, the drums, bass, and other sources are mixed to represent the accompaniment.

For the final experiment, we added the CCMixter dataset [8] featuring 50 more songs for training, as in [15]. In this setup, the last model is trained with a total of 150 songs. The test set remains unchanged.

3.4 Experimental setup

We explored the influence of the MHE configuration and hyperparameters with an initial grid search. For the grid search, all twelve MHE configurations described in Table 1 and a baseline model were implemented using the same network parametrization in order to compare their performance. Our implementation is available as open source222

As in [15], the models were trained using the Adam optimizer [6] with decay rates of and , a batch size of 16 and learning rate of . Data augmentation is applied by attenuating the vocals by a random factor in the range [0.7,1.0]. The best model selected in the first round of training is further fine-tuned, doubling the batch size and reducing the learning rate to . The model with the lowest validation loss is finally tested against the unseen data.

In order to reduce the computational cost for the exploratory experiments, an epoch was defined as 1,000 iterations, instead of the original 2,000, and the early stopping criterion was reduced from 20 to 10 epochs without improvement of validation loss. Additionally, the tracks were mixed down to mono.

A further experiment was performed exploring the regularization constant and an increased early stopping criterion. We finally evaluated the singing voice separation performance of the best MHE configuration with the original settings using the extended dataset and the parametrisation as in [15].

3.5 Evaluation

The models are evaluated using the signal-to-distortion ratio (SDR), as proposed by [18]. The audio tracks are partitioned into several non-overlapping segments of length one second to calculate SDR for each individual segment. The resulting SDRs are then averaged over the songs, and over the whole dataset.

For near-silent parts in one source the SDR can reach very low values, which can affect the global mean statistic. To deal with this issue, median statistics are also provided as in [15]

, along with standard deviation and median absolute deviation (MAD) for vocals and accompaniment.

4 Results

4.1 Hyperparameter exploration

Figure 1: Test set SDR results (in dB) for singing voice.
Figure 2: Test set SDR results (in dB) for accompaniment.

Figures 1 and 2 show statistics for the SDR obtained by each model over the MUSDB18 test set. The baseline model obtained the highest median SDR with 3.66 dB for the vocals estimation, followed closely by MHE_0 (full-space MHE, Euclidean distance, ,) with 3.63 dB. However, MHE_0 obtained the highest mean SDR with -0.31 dB, compared to -0.38 dB for the baseline. For the accompaniment, the MHE_0 model showed the highest median and mean SDR, albeit by a small margin.

Model value Distance
mhe half_mhe 0 1 2 euclidean angular
Vocals Med 3.56 3.56 3.58 3.57 3.52 3.58 3.54
MAD 2.85 2.90 2.89 2.87 2.86 2.87 2.87
Mean -0.51 -0.58 -0.42 -0.57 -0.64 -0.51 -0.58
SD 13.81 14.02 13.78 13.99 13.97 13.89 13.94
Accomp. Med 7.34 7.34 7.37 7.34 7.31 7.35 7.33
MAD 2.11 2.11 2.11 2.11 2.10 2.10 2.11
Mean 7.46 7.49 7.49 7.48 7.44 7.49 7.45
SD 3.88 3.81 3.87 3.82 3.84 3.83 3.87
Table 2: Average test set SDR results (in dB) aggregated by MHE hyperparameters.

Table 2 compares different MHE configurations, aggregating the results of Figures 1 and 2 by groups of parameters. From this, it becomes clear that the ideal s value is and the preferred distance is Euclidean. When comparing MHE versus half_MHE models, the former obtains better results on vocals, and the latter on the accompaniment estimation.

4.2 MHE loss curves

Figure 3: MHE loss dynamics over training process. Longer curves represent longer training processes.

Figure 3 shows the development of the MHE loss during training. While full MHE models rapidly reduce the MHE loss in the first epochs of the training process, Euclidean half_MHE models tend to form steps. The behaviour seems to increase in frequency when higher s values are in use.

The MHE loss tends to be relatively stable, with changes made every epoch in the range of

. This is probably due to the large number of parameters being considered and the value chosen for learning rate (


The following experiments focus on comparing the baseline model with the best performing MHE model, MHE_0, in different scenarios.

4.3 Early stopping and regularization strength

Table 3 shows the SDR results for MHE_0 with 20 epochs early stopping criterion and different values for . This doubling of the early stopping criterion led to a 88% increase in the number of epochs for the MHE_0 model. In this setting the MHE_0 model obtained higher results in all statistics compared to the baseline model on the leftmost column.

The increased early stopping criterion leads to longer training times for the MHE_0 model with .

So far, the regularization constant was set to following the recommendation in [7]. Table 3 shows the SDR results for and , too. It is clear that both increasing and decreasing the regularization constant have a negative effect on SDR.

Value of Basel. 1/(2L) 1/L 1
Voc. Med 3.65 3.50 3.69 3.64
MAD 3.04 2.82 2.98 2.96
Mean -0.56 -0.39 0.01 -0.10
SD 14.23 13.53 13.31 13.30
Acc. Med 7.37 7.32 7.44 7.42
MAD 2.10 2.11 2.14 2.14
Mean 7.53 7.44 7.56 7.54
SD 3.74 3.92 3.89 3.89
Epochs 120 76 167 120
Table 3: Test set SDR (in dB) for 20 epochs early stopping.

4.4 Extended training data

Considering the same conditions reported by [15] for their best vocals separator system, called M4, we re-implemented this configuration and trained it in parallel with an MHE_0 model. The results in Table 4 show that MHE_0 outperforms our M4 implementation (left row) and the originally reported M4 results. This shows that MHE_0 is currently the best time-domain singing voice separation model.

Model M4(reimp.) MHE_0 M4(orig.)
Voc. Med 4.44 4.69 4.46
MAD 3.15 3.24 3.21
Mean 0.17 0.75 0.65
SD 14.38 13.91 13.67
Acc. Med 10.54 10.88 10.69
MAD 3.01 3.13 3.15
Mean 11.71 12.10 11.85
SD 6.48 6.77 7.03
Epochs 90 122 -
Table 4: Test set SDR (in dB) for the extended training set.
Figure 4: Spectrograms of vocals with silent period.

5 Discussion

The results confirm that the MHE_0 model, i.e. full-space MHE with Euclidean distance and s value, outperforms the alternative MHE versions, similar to the results in [7]. The training for MHE models needs more epochs compared to the baseline and benefits from an increased early stopping criterion. Overall, including MHE regularization in the Wave-U-Net loss function improves singing voice separation, when sufficient time is given for the training.

Particularly in the vocals source, there are some silent periods. These can produce very low SDR results and with very audible artifacts. The MHE helps to reduce artifacts in these periods, as can be seen in Figure 4.

The results achieved here do not yet match the best singing voice extracting method based on time-frequency representations [16]. The performance gap of around 1.5 dB is a research challenge that is worthwhile, because of the time-domain systems’ potential for low latency processing.

6 Conclusion

This study explores the use of a novel regularization method, minimum hyperspherical energy (MHE), for improving the task of singing voice separation in the Wave-U-Net. It is, to the authors’ knowledge, the first time that this technique is applied to an audio-related problem.

Our results suggest that MHE regularization, combined with the appropriate early stopping criterion, is worth including in the loss function of deep learning separator systems such as the Wave-U-Net, as it leads to a new state of the art in our experiments.

For future work we intend to address other applications, such as speech enhancement and separation, as well as other loss formulations. We are also interested in designing low latency systems based on this approach and aim to reduce the computational cost of the Wave-U-Net.


  • [1] Y. Bengio and Y. LeCun (Eds.) (2015) 3rd international conference on learning representations, ICLR 2015, san diego, ca, usa, may 7-9, 2015, conference track proceedings. External Links: Link Cited by: 6.
  • [2] I. Goodfellow, Y. Bengio, and A. Courville (2016) Deep learning. MIT Press. Note: Cited by: §2.
  • [3] S. Han, H. Mao, and W. J. Dally (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv:1510.00149 [cs]. External Links: Link, 1510.00149 Cited by: §2.
  • [4] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017) MobileNets: efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861 [cs]. External Links: Link, 1704.04861 Cited by: §2.
  • [5] A. Jansson, E. Humphrey, N. Montecchio, R. Bittner, A. Kumar, and T. Weyde (2017) Singing voice separation with deep u-net convolutional networks. Paper presented at the 18th International Society for Music Information Retrieval Conference, 23-27 Oct 2017, Suzhou, China.. Cited by: §2, §2.
  • [6] D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. See 3rd international conference on learning representations, ICLR 2015, san diego, ca, usa, may 7-9, 2015, conference track proceedings, Bengio and LeCun, External Links: Link Cited by: §3.4.
  • [7] W. Liu, R. Lin, Z. Liu, L. Liu, Z. Yu, B. Dai, and L. Song (2018) Learning towards minimum hyperspherical energy. arXiv:1805.09298 [cs, stat]. External Links: Link, 1805.09298 Cited by: §2, §3.2, §3.2, §3.2, Table 1, §4.3, §5.
  • [8] A. Liutkus, D. Fitzgerald, and Z. Rafii (2015) Scalable audio separation with light kernel additive modelling. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 76–80. External Links: ISBN 978-1-4673-6997-8, Link, Document Cited by: §3.3.
  • [9] A. A. Nugraha, A. Liutkus, and E. Vincent (2016) Multichannel audio source separation with deep neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24 (9), pp. 1652–1664. External Links: ISSN 2329-9290, 2329-9304, Link, Document Cited by: §1.
  • [10] A. Ozerov, P. Philippe, F. Bimbot, and R. Gribonval (2007) Adaptation of bayesian models for single-channel source separation and its application to voice/music separation in popular songs. IEEE Transactions on Audio, Speech and Language Processing 15 (5), pp. 1564–1578. External Links: ISSN 1558-7916, Link, Document Cited by: §2.
  • [11] Z. Rafii and B. Pardo (2013) REpeating pattern extraction technique (REPET): a simple method for music/voice separation. IEEE Transactions on Audio, Speech, and Language Processing 21 (1), pp. 73–84. External Links: ISSN 1558-7916, 1558-7924, Link, Document Cited by: §2.
  • [12] Z. Rafii, A. Liutkus, F. Stöter, S. I. Mimilakis, and R. Bittner (2017-12) The MUSDB18 corpus for music separation. Note: External Links: Document, Link Cited by: §3.3.
  • [13] Z. Rafii, A. Liutkus, F. Stöter, S. I. Mimilakis, D. FitzGerald, and B. Pardo (2018) An overview of lead and accompaniment separation in music. arXiv:1804.08300 [cs, eess]. External Links: Link, 1804.08300 Cited by: §2.
  • [14] P. Smaragdis, C. Fevotte, G. J. Mysore, N. Mohammadiha, and M. Hoffman (2014) Static and dynamic source separation using nonnegative factorizations: a unified view. IEEE Signal Processing Magazine 31 (3), pp. 66–75. External Links: ISSN 1053-5888, Link, Document Cited by: §2.
  • [15] D. Stoller, S. Ewert, and S. Dixon (2018) Wave-u-net: a multi-scale neural network for end-to-end audio source separation. arXiv:1806.03185 [cs, eess, stat]. External Links: Link, 1806.03185 Cited by: §2, §2, §3.1, §3.3, §3.4, §3.4, §3.5, §4.4.
  • [16] F. Stöter, S. Uhlich, A. Liutkus, and Y. Mitsufuji (2019) Open-unmix - a reference implementation for music source separation. Journal of Open Source Software 4 (41). External Links: ISSN 2475-9066, Link, Document Cited by: §5.
  • [17] J.J. Thomson (1904) XXIV. on the structure of the atom: an investigation of the stability and periods of oscillation of a number of corpuscles arranged at equal intervals around the circumference of a circle; with application of the results to the theory of atomic structure. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 7 (39), pp. 237–265. External Links: ISSN 1941-5982, Link, Document Cited by: §2.
  • [18] E. Vincent, R. Gribonval, and C. Fevotte (2006) Performance measurement in blind audio source separation. IEEE Transactions on Audio, Speech and Language Processing 14 (4), pp. 1462–1469. External Links: ISSN 1558-7916, Link, Document Cited by: §3.5.
  • [19] E. Vincent, T. Virtanen, and S. Gannot (2018) Audio source separation and speech enhancement. John Wiley & Sons. External Links: ISBN 978-1-119-27991-4 978-1-119-27988-4 Cited by: §1.
  • [20] P. Xie, J. Zhu, and E. P. Xing (2017) Diversity-promoting bayesian learning of latent variable models. arXiv:1711.08770 [cs, stat]. External Links: Link, 1711.08770 Cited by: §2.