Audio source separation is a core research area in audio signal processing with applications in speech, music and environmental audio. The main goal is to extract one or more target sources while suppressing other sources and noise . One case is the separation of a singing voice, commonly presenting the main melody, from the musical accompaniment. Separating vocals from accompaniment has several applications, such as editing and remixing music, automatic transcription, generating karaoke tracks and music information retrieval .
We address this problem by applying Minimum Hyperspherical Energy (MHE) regularization to the Wave-U-Net model. In particular our contributions are:
An extensive evaluation of MHE configurations and hyperparameters for singing voice separation.
A new state-of-the-art for time-domain singing voice extraction on MUSDB18.
2 Related Work
Approaches to source separation can be broadly divided into two groups: traditional and deep neural networks . The first group contains techniques such as non-negative matrix factorization, Bayesian methods, and the analysis of repeating structures [14, 10, 11]. However, several deep neural network architectures have surpassed those models and achieve state-of-the-art performance today.  introduced the U-Net architecture for singing voice extraction, using spectrograms. The Wave-U-Net, proposed by , processes the audio data in the time domain, and is trained as a deep end-to-end model with state-of-the-art performance.
Many strategies for regularization have been developed, with early stopping, data augmentation, weight decay and dropout being the most commonly used . Neural network based singing voice separation systems have used some of these methods in the past.  uses dropout. In , data augmentation is applied by varying the balance of the sources. Both methods also use early stopping.
. This last approach enforces diversity between neurons via regularization. Minimum Hyperspherical Energy, proposed by, also falls into this category. Inspired by a problem in physics , the MHE term aims to encourage variety between filters. MHE has increased performance in image classification, but the method has not yet been applied to audio.
Our work is based on the best Wave-U-Net configuration found by 
for singing voice separation, which is a deep U-Net composed of 24 hidden convolutional layers, 12 in each path, considering a filter size of 15 in the downsampling path and 5 for the upsampling. The first layer computes 24 feature maps, and each subsequent layer doubles that number. Each layer in the upsampling path is also connected to their corresponding layer in the downsampling path with skip-connections. The output layer generates an estimated waveform in the range [-1,1] for the vocals, while the accompaniment is obtained as the difference between the original mixture and the estimated singing voice, as in. The original loss function is mean squared error (MSE) per sample.
3.2 Adding MHE to Wave-U-Net training
The MHE regularised loss function for the Wave-U-Net can be defined as:
where is the number of hidden layers and is the number of filters in the -th layer. represents the hyperspherical energy of the -th layer, given the parameter , and it is calculated as:
where is the Euclidean norm and
is the weight vector of neuronsprojected onto a unit hypersphere. The dimensionality of the hypersphere is given by the number of incoming weights per neuron. For , we use for , and otherwise .
With regard to , i.e. the weighting constant of the regulariser,  recommends to use a constant depending on the number of hidden layers, with the aim of reducing the weighting of MHE for very deep architectures. We use here .
There are two possible configurations of MHE, full or half space. The half-space variation creates a virtual neuron for every existing neuron with inverted signs of the weights. The second term of equation 1 is then applied to neurons in each hidden layer . There are two alternative distance functions, Euclidean and angular. When dealing with angular distances, is replaced with . The parameter s controls the scaling of MHE. Using the convention of  we have a total of twelve possible configurations as shown in Table 1.
|value||[0, 1, 2, a0 ,a1, a2]|
We use the MUSDB18 dataset 111https://sigsep.github.io/datasets/musdb.html. It is composed of 150 full-length musical tracks. The data is divided into 100 songs for development and 50 for testing, corresponding to 6.5 and 3.5 hours, respectively. The development set is further divided into training and validation sets by randomly selecting 25 tracks for the latter.
For the task of singing voice separation, the drums, bass, and other sources are mixed to represent the accompaniment.
3.4 Experimental setup
We explored the influence of the MHE configuration and hyperparameters with an initial grid search. For the grid search, all twelve MHE configurations described in Table 1 and a baseline model were implemented using the same network parametrization in order to compare their performance. Our implementation is available as open source222https://github.com/jperezlapillo/Hyper-Wave-U-Net.
As in , the models were trained using the Adam optimizer  with decay rates of and , a batch size of 16 and learning rate of . Data augmentation is applied by attenuating the vocals by a random factor in the range [0.7,1.0]. The best model selected in the first round of training is further fine-tuned, doubling the batch size and reducing the learning rate to . The model with the lowest validation loss is finally tested against the unseen data.
In order to reduce the computational cost for the exploratory experiments, an epoch was defined as 1,000 iterations, instead of the original 2,000, and the early stopping criterion was reduced from 20 to 10 epochs without improvement of validation loss. Additionally, the tracks were mixed down to mono.
A further experiment was performed exploring the regularization constant and an increased early stopping criterion. We finally evaluated the singing voice separation performance of the best MHE configuration with the original settings using the extended dataset and the parametrisation as in .
The models are evaluated using the signal-to-distortion ratio (SDR), as proposed by . The audio tracks are partitioned into several non-overlapping segments of length one second to calculate SDR for each individual segment. The resulting SDRs are then averaged over the songs, and over the whole dataset.
4.1 Hyperparameter exploration
Figures 1 and 2 show statistics for the SDR obtained by each model over the MUSDB18 test set. The baseline model obtained the highest median SDR with 3.66 dB for the vocals estimation, followed closely by MHE_0 (full-space MHE, Euclidean distance, ,) with 3.63 dB. However, MHE_0 obtained the highest mean SDR with -0.31 dB, compared to -0.38 dB for the baseline. For the accompaniment, the MHE_0 model showed the highest median and mean SDR, albeit by a small margin.
Table 2 compares different MHE configurations, aggregating the results of Figures 1 and 2 by groups of parameters. From this, it becomes clear that the ideal s value is and the preferred distance is Euclidean. When comparing MHE versus half_MHE models, the former obtains better results on vocals, and the latter on the accompaniment estimation.
4.2 MHE loss curves
Figure 3 shows the development of the MHE loss during training. While full MHE models rapidly reduce the MHE loss in the first epochs of the training process, Euclidean half_MHE models tend to form steps. The behaviour seems to increase in frequency when higher s values are in use.
The MHE loss tends to be relatively stable, with changes made every epoch in the range of
. This is probably due to the large number of parameters being considered and the value chosen for learning rate ().
The following experiments focus on comparing the baseline model with the best performing MHE model, MHE_0, in different scenarios.
4.3 Early stopping and regularization strength
Table 3 shows the SDR results for MHE_0 with 20 epochs early stopping criterion and different values for . This doubling of the early stopping criterion led to a 88% increase in the number of epochs for the MHE_0 model. In this setting the MHE_0 model obtained higher results in all statistics compared to the baseline model on the leftmost column.
The increased early stopping criterion leads to longer training times for the MHE_0 model with .
So far, the regularization constant was set to following the recommendation in . Table 3 shows the SDR results for and , too. It is clear that both increasing and decreasing the regularization constant have a negative effect on SDR.
4.4 Extended training data
Considering the same conditions reported by  for their best vocals separator system, called M4, we re-implemented this configuration and trained it in parallel with an MHE_0 model. The results in Table 4 show that MHE_0 outperforms our M4 implementation (left row) and the originally reported M4 results. This shows that MHE_0 is currently the best time-domain singing voice separation model.
The results confirm that the MHE_0 model, i.e. full-space MHE with Euclidean distance and s value, outperforms the alternative MHE versions, similar to the results in . The training for MHE models needs more epochs compared to the baseline and benefits from an increased early stopping criterion. Overall, including MHE regularization in the Wave-U-Net loss function improves singing voice separation, when sufficient time is given for the training.
Particularly in the vocals source, there are some silent periods. These can produce very low SDR results and with very audible artifacts. The MHE helps to reduce artifacts in these periods, as can be seen in Figure 4.
The results achieved here do not yet match the best singing voice extracting method based on time-frequency representations . The performance gap of around 1.5 dB is a research challenge that is worthwhile, because of the time-domain systems’ potential for low latency processing.
This study explores the use of a novel regularization method, minimum hyperspherical energy (MHE), for improving the task of singing voice separation in the Wave-U-Net. It is, to the authors’ knowledge, the first time that this technique is applied to an audio-related problem.
Our results suggest that MHE regularization, combined with the appropriate early stopping criterion, is worth including in the loss function of deep learning separator systems such as the Wave-U-Net, as it leads to a new state of the art in our experiments.
For future work we intend to address other applications, such as speech enhancement and separation, as well as other loss formulations. We are also interested in designing low latency systems based on this approach and aim to reduce the computational cost of the Wave-U-Net.
-  Y. Bengio and Y. LeCun (Eds.) (2015) 3rd international conference on learning representations, ICLR 2015, san diego, ca, usa, may 7-9, 2015, conference track proceedings. External Links: Cited by: 6.
-  (2016) Deep learning. MIT Press. Note: http://www.deeplearningbook.org Cited by: §2.
-  (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv:1510.00149 [cs]. External Links: Cited by: §2.
-  (2017) MobileNets: efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861 [cs]. External Links: Cited by: §2.
-  (2017) Singing voice separation with deep u-net convolutional networks. Paper presented at the 18th International Society for Music Information Retrieval Conference, 23-27 Oct 2017, Suzhou, China.. Cited by: §2, §2.
-  (2015) Adam: A method for stochastic optimization. See 3rd international conference on learning representations, ICLR 2015, san diego, ca, usa, may 7-9, 2015, conference track proceedings, Bengio and LeCun, External Links: Cited by: §3.4.
-  (2018) Learning towards minimum hyperspherical energy. arXiv:1805.09298 [cs, stat]. External Links: Cited by: §2, §3.2, §3.2, §3.2, Table 1, §4.3, §5.
-  (2015) Scalable audio separation with light kernel additive modelling. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 76–80. External Links: Cited by: §3.3.
-  (2016) Multichannel audio source separation with deep neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24 (9), pp. 1652–1664. External Links: Cited by: §1.
-  (2007) Adaptation of bayesian models for single-channel source separation and its application to voice/music separation in popular songs. IEEE Transactions on Audio, Speech and Language Processing 15 (5), pp. 1564–1578. External Links: Cited by: §2.
-  (2013) REpeating pattern extraction technique (REPET): a simple method for music/voice separation. IEEE Transactions on Audio, Speech, and Language Processing 21 (1), pp. 73–84. External Links: Cited by: §2.
-  (2017-12) The MUSDB18 corpus for music separation. Note: https://doi.org/10.5281/zenodo.1117372 External Links: Cited by: §3.3.
-  (2018) An overview of lead and accompaniment separation in music. arXiv:1804.08300 [cs, eess]. External Links: Cited by: §2.
-  (2014) Static and dynamic source separation using nonnegative factorizations: a unified view. IEEE Signal Processing Magazine 31 (3), pp. 66–75. External Links: Cited by: §2.
-  (2018) Wave-u-net: a multi-scale neural network for end-to-end audio source separation. arXiv:1806.03185 [cs, eess, stat]. External Links: Cited by: §2, §2, §3.1, §3.3, §3.4, §3.4, §3.5, §4.4.
-  (2019) Open-unmix - a reference implementation for music source separation. Journal of Open Source Software 4 (41). External Links: Cited by: §5.
-  (1904) XXIV. on the structure of the atom: an investigation of the stability and periods of oscillation of a number of corpuscles arranged at equal intervals around the circumference of a circle; with application of the results to the theory of atomic structure. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 7 (39), pp. 237–265. External Links: Cited by: §2.
-  (2006) Performance measurement in blind audio source separation. IEEE Transactions on Audio, Speech and Language Processing 14 (4), pp. 1462–1469. External Links: Cited by: §3.5.
-  (2018) Audio source separation and speech enhancement. John Wiley & Sons. External Links: Cited by: §1.
-  (2017) Diversity-promoting bayesian learning of latent variable models. arXiv:1711.08770 [cs, stat]. External Links: Cited by: §2.