Blind source separation (BSS) is a fundamental problem in signal processing. It consists of separating a set of mixture signals into a set of source signals without using any extra information 
. In this work, we will be considering the task of Music Source Separation (MSS), which is an ill-posed and underdetermined case of BSS, where multiple sources (instrumental signals) must be separated from a single mixture (music recording). Current MSS methods are based on Deep Neural Networks (DNNs) that need a lot of labelled data (mixtures and ground-truth isolated instrumental signals) to be trained under a supervised scenario[31, 33]. However, labelled audio data for MSS is difficult to obtain. In the literature, there are only a few large-scale public datasets for MSS, such as MUSDB18  and Slakh .
Even though it is known that the use of data augmentation techniques such as random pitch-shifting and random mixing of source signals can improve model generalisation [34, 1], separation performance will always depend on the type of audio data used during training. When the data distribution of the training set is different from the data distribution of the test set, the performance of any predictor is degraded. This effect is known as dataset shift , and happens due to mismatched characteristics between data used for training and testing.
Under this scenario, domain adaptation techniques address this problem by adapting predictors from a source domain, where usually a large amount of labelled data is available, to a target domain
, where only few or no labelled data is available. Domain adaptation is already consolidated as an important research topic in computer vision, where it is used in complex classification tasks. Even in closer fields, such as acoustic scene analysis [10, 37], speech recognition  and speech enhancement , domain adaptation methods have already been proposed. However, to our knowledge, methods of this nature have not yet been investigated for MSS. Therefore, our work also attempts to fill this gap in the literature.
We propose an adversarial unsupervised domain adaptation approach for MSS. By using the mixtures and the available ground-truth signals from MUSDB18 and a set of unlabelled data (mixtures) from a different domain, we show that our framework is able to improve separation performance in the new domain while maintaining the original performance on MUSDB18, considerably reducing the degradation effect caused by dataset shift. Although our experiments are carried out for the particular task of Harmonic-Percussive Source Separation (HPSS), our framework can be easily adapted to other MSS tasks with different types of sources and domains.
In summary, our contributions include:
The first work focused on unsupervised domain adaptation for MSS;
An adversarial unsupervised domain adaptation framework for MSS that can be used with any neural network architecture, any type of audio representation and any number of sources;
The public release of the “Tap & Fiddle Dataset”, a dataset containing recordings of traditional Scandinavian fiddle tunes with accompanying foot-tapping along with isolated tracks for “foot-tapping” and “violin”. This dataset has different timbral characteristics than MUSDB18 and is useful for domain adaptation experiments;
A prototype experiment where we show an improvement over benchmark methods for the HPSS task.
Ii Related Work
Ii-a Harmonic-Percussive Source Separation
The task of HPSS consists of separating a music signal into two source signals, one with the harmonic components and other with the percussive sounds . Signal processing methods for HPSS perform separation by exploiting the fact that percussive signals form vertical lines in the mixture time-frequency representation, while the harmonic components tend to form horizontal structures, e.g. [7, 3, 15]. However, due to their strict assumptions and hand-crafted features, methods of this nature have intrinsic performance limitations.
Over the years, data-driven approaches have shown significant improvements over traditional methods for HPSS and current state-of-the-art methods are based on DNNs [28, 14, 4, 17]. In previous work carried out by the authors , the
W-MDenseNet, an encoder-decoder DNN that uses convolutions with several kernel shapes to perform HPSS, was proposed. In this work, the same architecture is used, but here we add a domain discriminator into the framework and modify the loss function to support adversarial domain adaptation.
Moreover, since our approach is also grounded in Generative Adversarial Networks (GANs), it is important to point out some key aspects in which our proposal is different from other GAN-based source separation methods [30, 5, 22].
Works on GAN-based MSS use a source discriminator, which is trained to differentiate real source signals from fake source signals. This is different from our work, where we use a domain discriminator trained to differentiate mixtures across two different domains.
Ii-A2 Unlabelled data
In order to train a source discriminator, a large number of single-source signals are required, even though those signals do not necessarily have to be paired with a music mixture. Here, we only need mixtures from each of the two domains to successfully train our domain discriminator.
Ii-A3 Input to discriminator
The input to a source discriminator of GAN-based MSS works is the output of the separator network. Our approach applies the domain discriminator on the encoded feature-maps, in the middle of the separator network and not directly on its output.
Ii-B Domain Adaptation
Domain adaptation methods can be either supervised or unsupervised depending on the type of data from the target domain that is used. While Supervised Domain Adaptation (SDA) methods use labelled data, Unsupervised Domain Adaptation (UDA) exploits only unlabelled data (mixture signals) from the target domain.
A typical SDA approach is to first train a model using a large number of labelled samples from the source domain and then re-train some (or all) of its layers using a smaller labelled dataset of interest (target domain). This technique is known as fine-tuning [24, 36]. Another SDA approach is joint training, where the two datasets are merged into a new dataset and only a single training stage is done, using labelled data from both domains in every batch [18, 19].
UDA methods usually consider that the system is under the covariate shift paradigm
, assuming that, even though the marginal distribution of source domain data is different from the marginal distribution of target domain data, the conditional probability of the output remains the same. Therefore, if the marginal distributions can be matched, the same predictor can be applied successfully over samples from either of the two domains. In order to do this, some UDA methods propose to re-weight  or select samples from the source domain , while others project the data through an embedding function such that not only the marginals become similar on the embedded space, but also the embedded features keep their discrimination potential [6, 38]. The latter case is also the type of UDA method in our proposal. We look for a transformation that creates an embedded space in which the confusion between the two domains is maximised.
Similar to , we propose to find a domain-invariant and separation-discriminative embedded space that is learned from data via adversarial training. However, differently from , we deal with the task of source separation (regression) instead of image recognition (classification). In addition, we use CNNs for the encoder-decoder and the domain discriminator, while in  simple feed-forward networks are used, and while  performs adversarial training using the gradient reversal layer method, we conduct conditional GAN iterative optimisation as in .
Iii Proposed Framework
We assume that both the input data and the outputs are magnitude spectrograms, where is the number of frequency bins and
the number of frames. To simplify the notation, we treat them as vectors in, where . Hence, the input (mixture signal) is notated as and its labels (ground-truth isolated source signals) as the matrix , where the first column is the original harmonic vector and the second column is the original percussive vector
. Furthermore, we consider that the mixture-label pairs follow the joint distribution, or, in other words, we say that the data “come from domain ”. For the general supervised HPSS case, the goal is to train a model based on this data that can be a good predictor of .
In  we proposed the
W-MDenseNet, a convolutional encoder-decoder for HPSS, where the network output is an estimateof . Here, we model the encoder-decoder-based separation process as a sequence of two mappings. First, the encoder with parameters maps the input to an embedded feature space and then the decoder , with parameters , maps to the output such that:
This separator can be optimised for the general supervised HPSS case using the mean square error as the loss :
where and are weights for the harmonic and percussive outputs respectively — we use for each since we want to assign equal importance to each source — , represents the Euclidean norm, the Frobenius norm and is the diagonal matrix .
However, in this work we assume there also exists a new domain , where mixtures follow the marginal distribution , which is considered different from . Our main goal is now to be able to robustly predict labels given that the input can be from either domain or . Apart from the labelled samples from domain , we have access to set of mixtures from that can be used for performing UDA.
Our approach adopts a similar methodology to  and . We propose to learn encoded features that can not only guarantee a good separation performance, but that are also invariant to domain changes. This means that must not contain any discriminative information about the origin of the input ( or ). By doing so, we can make the distributions and to become as similar as possible. In order to measure their similarity, we use a domain discriminator
to discriminate the encoded feature-maps between the two domains. Such domain discriminator is a binary classifier that can be trained using only mixture signals by minimising the binary cross-entropy:
Fig. 1 summarises the domain adaptation scenario.
In addition, we ensure that will become domain-invariant by forcing the encoder sub-network to generate feature-maps that can fool the domain discriminator. This is achieved by maximising when training the encoder weights. Such a min-max game is played by the encoder sub-network and the domain discriminator during training just like in GAN training . At the same time, can keep its separation-discriminative properties if we include the minimisation of in the loss function. The final encoder loss is, therefore, a combination of the (unsupervised) adversarial loss , which can be optimised using only mixture signals from each of the two domains, and the (supervised) loss , which can be optimised based only on samples from since it requires labelled data. In summary, the loss functions of each sub-network are:
where and are weights given to the unsupervised part and to the supervised part of the loss.
It should be noted that , and must be trained together in an iterative way as in GAN training . If is optimised to completion, the encoder sub-network will not be able to increase the domain-discriminator confusion, causing the separator performance to overfit over domain . In our experiments, at every training iteration, we perform updates on before updating and . The full training algorithm can be found in the supplementary material of this paper.
|Method||Test Set||Type of|
|MUSDB18 (Domain )||Tap & Fiddle (Domain )||Data|
MUSDB18  is the largest public dataset for MSS containing real-world audio recordings. It contains full-track songs and includes both the mixtures and the original sources, divided between a training subset of music recordings and a test subset of . The available isolated tracks are vocals, bass, drums and “other”. We use the drum track as the ground-truth for the percussive source, while the sum of the other tracks is used as ground-truth for the harmonic source.
As a different domain, we collected and publicly release the Tap & Fiddle (T&F) dataset . The T&F dataset contains stereo recordings of traditional Scandinavian fiddle tunes with accompanying foot-tapping, which is standard performance practice within these musical styles. It consists of recordings with completely separate fiddle and foot-tapping sounds as well as mixed signals. The dataset is divided into a training set with files and a test set with . All recordings are solo and have an average duration of seconds. Detailed information regarding the T&F Dataset can be found in .
V Experimental Setup
In our experiments, the music signals are converted to mono and resampled to KHz. The inputs are normalised magnitude spectrograms of size generated by the application of an STFT of size with % overlap. A validation split of of all labelled data available for training is set.
We use the W-MDenseNet  as the separator architecture. As a post-processing step, we apply Wiener filtering  to the source estimates and use the mixture phase to return to the time domain. We concatenate the encoded feature-maps of each of the three branches of the W-MDenseNet to form . Details about hyper-parameter choices can be found in the paper’s supplementary material. The architecture of the domain-discriminator network is depicted in Fig. 2.
After experimentation, we choose the values of for and for . Training is performed using the Adam optimiser with an initial learning rate of , which is reduced by a factor of if the supervised validation loss stops improving for
consecutive epochs, and if no improvement happens inepochs the training is stopped. The separation quality is evaluated using the BSS_eval  set of objective metrics that are largely used by the MSS community.
Recordings from MUSDB18 represent domain while recordings from the T&F dataset represent domain . We aim to investigate how different training scenarios perform across the two domains. We compare our UDA proposal to traditional supervised HPSS approaches that use only labelled data from one of the domains, to SDA frameworks, which include joint training using labelled data from both datasets and fine-tuning over samples from T&F after training on MUSDB18, and to another state-of-the-art DNN for MSS named OpenUnmix . This method was previously trained on an augmented version of MUSDB18 and serves as a baseline in our comparison.
In addition to the mixtures in the T&F dataset, we have a collection of new recordings of Scandinavian fiddle tunes with accompanying foot-tapping. This collection is also part of domain and although no labels are available, it can also be used by our UDA method. We then test two versions of our approach: HPSS_UDA_small, which uses the mixtures on the train set of T&F for performing the adaptation to domain , and HPSS_UDA_large, which uses the larger set of mixtures from our internal collection. Results are shown in Table I.
By inspecting Table I, we can readily note that models that were trained only with samples from one dataset had poor performance on the other, which makes it possible to conclude that MUSDB18 and T&F have very different priors over the data. This fact is also reflected in the performance of OpenUnmix, which is much lower on T&F if compared with the performance provided by the ideal masking methods. Moreover, as expected, the joint trained model, SDA_joint, achieved relatively good performance overall because it uses supervised data from both domains. The SDA_tune model, which is the HPSS_MUSDB model fine-tuned for T&F, was indeed greatly improved when evaluated over this domain, but, as a trade-off, it lost a lot of its original performance on the original MUSDB18 dataset. On the other hand, both versions of the proposed UDA approach got a boost in performance on all of the metrics on T&F without losing any considerable performance on MUSDB18. This means that our proposed UDA approach can perform HPSS on both domains successfully, even though the labelled data used for training came only from domain .
The quantity of unlabelled data from domain also impacted the performance of the proposed method. Even though the results of UDA_large are similar to UDA_small over domain , the former performs much better over samples from domain than the latter due to the fact that it uses more than double the amount of mixtures from this particular domain during training to perform domain adaptation. Another interesting result is that UDA_large, which is a semi-supervised framework, had similar performance over MUSDB18, but much better over T&F if compared to SDA_joint, which is a fully supervised method. This means that UDA using large amounts of unlabelled data can be much more promising than joint training using a smaller amount of labelled data.
More information about our work can be found in the paper’s supplementary document and supplementary webpage111http://c4dm.eecs.qmul.ac.uk/auda-hpss.
In this work we presented an adversarial UDA model for HPSS. Our proposal is a semi-supervised framework that is able to exploit unlabelled mixtures from a target domain in order to improve HPSS generalisation to samples from this particular domain. Results showed that our framework improves separation performance on the target domain without losing considerable performance on the source domain.
As future work, we plan to investigate how the utilisation of small amounts of labelled samples from the target domain affect domain adaptation performance. We believe that this “few-shot” approach can be useful in improving source separation performance in the absence of many data samples.
-  (2019-09) Improving singing voice separation using deep U-Net and Wave-U-Net with data augmentation. In European Signal Processing Conference (EUSIPCO), A Coruna, Spain. External Links: Cited by: §I.
-  P. Comon and C. Jutten (Eds.) (2010) Handbook of blind source separation. 1st edition, Academic Press, Inc., Oxford, UK. External Links: Cited by: §I.
-  (2014-10) Extending harmonic-percussive separation of audio signals. In International Society for Music Information Retrieval Conference (ISMIR), Vol. 15, Taipei, Taiwan, pp. 611–616. Cited by: §II-A.
-  (2018-09) Harmonic-percussive source separation with deep neural networks and phase recovery. In International Workshop on Acoustic Signal Enhancement (IWAENC), Vol. 16, Tokyo, Japan, pp. 421–425. External Links: Cited by: §II-A.
-  (2018-04) SVSGAN: singing voice separation via generative adversarial network. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, Canada, pp. 726–730. External Links: Cited by: §II-A.
-  (2013-12) Unsupervised visual domain adaptation using subspace alignment. In IEEE International Conference on Computer Vision, Sydney, Australia, pp. 2960–2967. External Links: Cited by: §II-B.
-  (2010-09) Harmonic/percussive separation using median filtering. In International Conference on Digital Audio Effects (DAFx), Vol. 13, Graz, Austria, pp. 246–253. Cited by: §II-A.
Unsupervised domain adaptation by backpropagation. In
International Conference on Machine Learning (ICML), Lille, France, pp. 1180–1189. Cited by: §II-B, §III.
-  (2016-01) Domain-adversarial training of neural networks. Journal of Machine Learning Research 17 (59), pp. 1–35. Cited by: §III.
Unsupervised adversarial domain adaptation for acoustic scene classification. In Detection and Classification of Acoustic Scenes and Events Workshop (DCASE), Surrey, UK, pp. 138–142. Cited by: §I.
-  (2013-06) Connecting the dots with landmarks: discriminatively learning domain-invariant features for unsupervised domain adaptation. In International Conference on Machine Learning (ICML), Atlanta, USA, pp. 222–230. Cited by: §II-B.
-  (2014-12) Generative adversarial nets. In Conference on Neural Information Processing Systems (NeurIPS), Vol. 28, Montreal, Canada, pp. 2672–2680. External Links: Cited by: §II-A, §II-B, §III, §III.
-  (2006-12) Correcting sample selection bias by unlabeled data. In Conference on Neural Information Processing Systems (NeurIPS), Vancouver, CA, pp. 601–608. Cited by: §II-B.
-  (2017-09) Harmonic and percussive source separation using a convolutional auto encoder. In European Signal Processing Conference (EUSIPCO), Vol. 25, Kos Island, Greece, pp. 1804–1808. Cited by: §II-A.
-  (2014) Kernel additive models for source separation. IEEE Transactions on Signal Processing 62 (16), pp. 4298–4310. External Links: Cited by: §II-A.
-  (2020-12) Tap & Fiddle: A dataset with Scandinavian fiddle tunes with accompanying foot-tapping. Zenodo. External Links: Cited by: §IV.
-  (2019-10) Investigating kernel shapes and skip connections for deep learning-based harmonic-percussive separation. In Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, USA, pp. 40–44. External Links: Cited by: §II-A, §III, §V.
-  (2018-11) Building corpora for single-channel speech separation across multiple domains. arXiv e-print:1811.02641. External Links: Cited by: §II-B.
-  (2019-10) Cutting music source separation some Slakh: a dataset to study the impact of training data quality and quantity. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, USA, pp. 45–49. External Links: Cited by: §I, §II-B.
-  (2018-09) Adversarial feature-mapping for speech enhancement. In Interspeech, Hyderabad, India, pp. 3259–3263. External Links: Cited by: §I.
-  (2016-08) Multichannel music separation with deep neural networks. In The European Signal Processing Conference (EUSIPCO), Budapest, Hungary, pp. 1748–1752. External Links: Cited by: §V.
CASS: cross adversarial source separation via autoencoder. arXiv e-print:1905.09877. External Links: Cited by: §II-A.
-  (2008-08) Separation of a monaural audio signal into harmonic/percussive components by complementary diffusion on spectrogram. In European Signal Processing Conference (EUSIPCO), Lausanne, Switzerland, pp. 1–4. Cited by: §II-A.
Learning and transferring mid-level image representations using convolutional neural networks. In
Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, USA, pp. 1717–1724. External Links: Cited by: §II-B.
-  (2019) Moment matching for multi-source domain adaptation. In IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea. Cited by: §I.
-  J. Quionero-Candela, A. Sugiyama, Masashi Schwaighofer, and N. Lawrence (Eds.) (2008) Dataset shift in machine learning. The MIT Press. External Links: Cited by: §I.
-  (2017) The MUSDB18 corpus for music separation. External Links: Cited by: §I, §IV.
-  (2018-09) Stationary/transient audio separation using convolutional autoencoders. In International Conference on Digital Audio Effects (DAFx), Aveiro, Portugal, pp. 65–71. Cited by: §II-A.
-  (2000-10) Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference 90 (2), pp. 227–244. External Links: Cited by: §II-B.
-  (2018-04) Adversarial semi-supervised audio source separation applied to singing voice extraction. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, Canada, pp. 2391–2395. External Links: Cited by: §II-A.
Open-Unmix – a reference implementation for music source separation.
Journal of Open Source Software4 (41), pp. 1667. External Links: Cited by: §I, TABLE I, §VI.
-  (2017-09) An unsupervised deep domain adaptation approach for robust speech recognition. Neurocomputing 257, pp. 79–87. External Links: Cited by: §I.
MMDenseLSTM: an efficient combination of convolutional and recurrent neural networks for audio source separation. In International Workshop on Acoustic Signal Enhancement (IWAENC), Vol. 16, Tokyo, Japan, pp. 106–110. External Links: Cited by: §I.
-  (2017-03) Improving music source separation based on deep neural networks through data augmentation and network blending. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, USA, pp. 261–265. External Links: Cited by: §I.
-  (2006-07) Performance measurement in blind audio source separation. IEEE Transactions on Audio, Speech and Language Processing 14 (4), pp. 1462–1469. External Links: Cited by: §V.
-  (2018-09) Transferring gans: generating images from limited data. In European Conference on Computer Vision (ECCV), Munich, Germany, pp. 220–236. External Links: Cited by: §II-B.
-  (2020-05) A-CRNN: a domain adaptation model for sound event detection. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, pp. 276–280. Cited by: §I.
-  (2018-07) TLR: transfer latent representation for unsupervised domain adaptation. In IEEE International Conference on Multimedia and Expo (ICME), San Diego, USA, pp. 1–6. External Links: Cited by: §II-B.