Music source separation, i.e., separating the sources (instruments) involved in an audio recording, has been a major research topic in the signal processing community, partly due to its wide downstream applications in music upmixing and remixing, karaoke, DJ-related applications, and as a pre-processing tool for other problems [24, 25, 8, 22, 23, 7, 20, 18, 11, 12, 17, 10, 21, 9, 15, 6]
. Its technical difficulty has also been well acknowledged. For example, the estimation of the number of sources involved in a song remains challenging. Even when the number of sources is known or given beforehand, in supervised training we need the multi-track recordings that provide the ground truth single-instrument tracks (a.k.a., ‘stems’) that compose a song. Such multi-track recordings are rarely publicly available due to copyright issues [3, 2]. What are typically available instead are the mixed versions of the songs, where the multiple tracks have been mixed and combined into a monaural (i.e., one-channel) or stereo (two-channel) recording. The lack of available data with ground truth stems not only limits the development of data-driven methods, but also hinders systematic evaluation of new methods proposed for the task.
Over the past decades, the main methods for tackling this task can be roughly classified into two categories: model-based methods and data-centered methods. As discussed in a recent review paper, the performance of model-based methods could change dramatically when their core assumptions are not met. On the other hand, data-centered methods rely heavily on the availability of professionally produced or recorded multi-track data, which is hard to come by due to copyright issues.
As some medium-scale multi-track datasets have been released in the past few years, the development of data-driven models grows fast. A data-driven source separation model is typically a supervised model which is trained by taking an audio mixture (a monaural or stereo recording) as the input, and aiming to recover the tracks (i.e., multiple monaural or stereo recordings) that compose the mixture. In doing so, there have been two main approaches in the literature. In the first approach, which we refer to as the no data augmentation approach, the tracks that compose an input mixture are originally from the same song. In other words, when we have multi-track songs, we would have exactly input/output pairs for modeling training. In the second approach, or random-mixing data augmentation [21, 23], tracks from different songs are randomly combined to create audio mixtures, leading to artificial input/output pairs. In such a case, we can have input/output pairs. The downside of this approach is these input mixtures are not realistic sound mixtures in terms of the tonic, harmonic and rhythmic relations among the tracks that compose the mixtures. But, for the purpose of training data-driven models for source separation, the benefit of the resulting great increase in the number of training data seems to outweigh this potential concern, as demonstrated in the literature [21, 23].
While there are some other data augmentation methods such as adding noise or randomly dropout [4, 19], the aforementioned random-mixing data augmentation, albeit simple, has been shown particularly successful . However, when mixing two tracks, there are actually multiple aspects to consider , suggesting room for the development of more advanced mixing-specific data augmentation methods for source separation. To our best knowledge, this has not yet been investigated in the literature. It is therefore our goal to develop, and to empirically evaluate the effectiveness of, new data augmentation methods that stem from the random-mixing approach. We consider in total three types of mixing-specific data augmentation methods for source separation, as conceptually visualized in Figure 1 and detailed in Section IV.
Besides, current available data is still not enough for many common musical instruments, such as the violin. To investigate the benefit of data augmentation for music source separation in general, we consider in this paper the separation of violin and piano tracks in a violin piano ensemble, a task that has rarely been considered in the literature. Moreover, we consider the case when no multi-track recordings of violin piano ensembles are available for model training, but instead only a collection of piano solos and violin solos. In such a case, mixing-based data augmentation becomes a major viable approach.
In sum, the main contribution of this study is to propose a series of data augmentation/selection approaches that enable non-paired violin/piano solo stems to approximate features of realistic paired stems, which in turn facilitate the training of deep learning-based source separation models.
For reproducibility, we share the code, pre-trained models, the audio files of the ground truth and separated stems (by the ‘Wet’ model; see Section IV) of the test data publicly at https://github.com/SunnyCYC/aug4mss and https://sunnycyc.github.io/aug4mss_demo/.
Below, we review related work and the adopted network architecture for source separation in Section II. Section III describes the training and test data employed in our implementation. Section IV presents the proposed data augmentation methods, while Section V talks about the evaluation results. Finally, we conclude the paper in Section VI.
Ii-a Related works
Data augmentation for improving the performance of deep neural networks has been an important topic in the field of music information retrieval (MIR) in recent years. Schluter and Grill were one of the earliest to systematically explore the utility of music data augmentation for singing voice detection with neural networks. They found pitch shifting combined with time stretching and random frequency filtering to be quite helpful in reducing the classification error. Uhlich et al.  proposed two neural network architectures capable of yielding state-of-the-art results for music source separation at that time, and further boosted the performance through data augmentation and network blending. Hawthorne et al.  experiments with the use of mixing techniques such as equalization, contrast, and reverberation in an attempt to make their automatic piano transcription model more robust to different recording environments and piano qualities. To our knowledge, such mixing techniques have not been employed in existing work on source separation.
New neural network architectures for blind music source separation have been continuously proposed as well. For example, Manilow et al.  intellectually utilized the inherent synergy between transcription and source separation to improve both tasks using a multi-task learning architecture. Liu and Yang 
combined dilated convolution with modified gated recurrent units (GRU) to extend the receptive field of each dilated GRU unit, enabling their model to perform better and faster than state-of-the-art models for separating vocals and accompaniment.
Ii-B Model Architecture
As our focus is on the data augmentation techniques, we adopt an existing blind source separation framework called Open-Unmix  as the backbone architecture in our work. Open-Unmix is a hybrid convolutional-recurrent architecture (see Figure 2
) that is open source. It takes a fixed-length chunk of the Short-time Fourier transform (STFT) spectrogram of the mixture as the input and aims to get the corresponding separated spectrogram of one of the sources at the output. Although the model is trained on fixed-length chunks, at testing time it can be applied to spectrograms of arbitrary length. The parameters of the network is learned by minimizing the difference between the spectrograms of the ground truth and the separation result, calculated in terms of mean square error.
The original Open-Unmix model does not deal with the separation of piano and violin tracks , but it is easy to use their code base to train the model on our piano and violin data. In comparison, the other famous open-source separation model, called Spleeter , has a built-in function to isolate out the piano track from an input mixture, but it does not provide the script for retraining the model to isolate out the violin track. In consequence, we consider the pre-trained Spleeter model released by the authors as the baseline method for performance comparison, instead of the backbone architecture of our model.
Iii-a Violin/piano Solo as Training Stems
We collected six hours of classical violin solo recordings and six hours of pop piano solo recordings from the Internet as our training and validation data. For each instrument, five hours of data is allocated for training and the remaining one hour for validation. All songs are divided into 10-second chunks. During training and validation, one chunk will be randomly selected from each instrument for mixing and serving as the input data.
Iii-B MedleyDB as Evaluation Data
To evaluate the performance of our method on real data, we select multi-track songs that contain realistic violin and piano stems from the MedleyDB [3, 2] for evaluation. There are only 16 such songs, highlighting the difficulty of collecting multi-track recordings for model training. Moreover, after listening to these songs one-by-one, we discard 10 of them, as there are severe leakage issues and accordingly the piano/violin stems are not purely piano/violin. We also note that most of the remaining six songs contain more than one violin stems. For such songs, we consider the combination of the piano track and one of the violin tracks as the input mixture, thereby creating multiple partially realistic mixtures from the same song. All the other instruments from these songs (e.g., cello) are also excluded. As a result, we have 16 violin piano ensembles in total for evaluation.
We note that, due to copyright restrictions, we are unfortunately not able to share the audio files of the training data. Researchers interested in violin piano separation would have to collect the training data on their own. However, as mentioned by the end of Section I, we make public the aforementioned 16 violin piano ensembles so that people can evaluate their models on the same test set.
|Chroma distance||0.48||Mean within songs: 0.45|
|Mean across songs: 0.51|
|Correlation||20||Within songs: 230|
|Across songs: 010|
|Reduce silence||20||top db|
Iv Proposed Augmentation/Selection Methods
As shown in Figure 1 and Table I, we consider three types of augmentation methods here. The mixing type augmentation includes factors related to the common mixing process, such as the use of equalization, contrast, reverb, and the addition of pink noise. The pairing type methods are designed based on consideration for the tonic, harmonic and rhythmic relations between the real paired stems. Lastly, the silence type method is designed for eliminating the difference of silence duration in violin/piano stems to avoid potential imbalance of the training data. In what follows, we refer to the unprocessed original stems as the original stems.
Iv-a Wet Stems
Following , we apply the common approaches in the mixing process and also the addition of pink noise as our first augmentation method. The pink noise is employed to simulate the background noise seen in real recordings. The augmentation parameters are shown in Table I
. For each original stem, we set a 30% probability to apply a specific process (e.g. equalization). The pink noise is applied usingcolorednoise, and the others are applied using a python package called pysox . In what follows, we also refer to a model trained with this data augmentation method (i.e., the ‘Wet’ method) as the Wet model.
Iv-B Chroma Distance-based Pairing
Based on general principles of music theory or psychoacoustics, the paired stems in music are usually highly coherent in pitch, or are in similar keys. Therefore, instead of randomly selecting stems for combination, this augmentation method only picks the stems that have short chroma distance with one another for combination, in a hope that the resulting mixture would be more similar to real ensembles. Specifically, we average the chromagram, a representation of the time-varying intensities of the twelve different pitch classes, to derive a 12-dimensional chroma feature for each stem, and then calculate the Euclidean distance between all violin/piano stems. We then need a threshold for selecting qualified training violin/piano stems. In doing so, we rely on statistics calculated from the MedleyDB test songs. As shown in Table I, the mean chroma distance between the violin/piano stems from the same song of MedleyDB is 0.45, shorter than the mean distance of violin/piano stems from different songs of MedleyDB. We therefore set the threshold to 0.48. Only stems with chroma distance lower than this threshold will be selected and mixed as our training data.
Iv-C Correlation-based Pairing
Another pairing method is to consider whether the piano stem and violin stem are active (i.e., non-silent) at the same time; namely, whether they co-occur. To implement this, we calculate the absolute value of the 2-dimensional cross correlation between the waveform magnitude (using scipy.signal.correlate2d) of all the violin/piano stems and set a threshold for selecting the stems to be combined for training. As shown in Table I, the cross correlation values for true paired stems in MedleyDB test songs range from 2–30. We therefore empirically set the threshold to 20.
Iv-D Silence Removal before Mixing
For the case of violin piano ensemble, we empirically observe sometimes the activity of the violin part would be too sparse, making the two sounds unbalanced in the mixture. We consider it worth investigating whether this would influence the performance of the resulting separation model. Therefore, we apply librosa.effects.split  with ‘top db’ 20 to remove the silence part in the training/validation data, before dividing them into 10-second chunks for mixing. We have also experimented with other values of ‘top db’ and found 20 works better.
V-a Experiment Settings & Evaluation Metrics
As discussed in Section I, there are many common instruments that suffer from the lack of multi-track data. For instruments less common, such as erhu and suona, even the number of solo recordings is limited. Therefore, we study the effectiveness of the proposed augmentation methods in two scenarios: a data-limited one, and a data-rich one.
Specifically, for each instrument and each augmentation method, we train (from scratch) a separation model using the Open-Unmix architecture and the corresponding processed stems. In each training epoch, the model will randomly selectpairs of stems from the pool of processed stems to mix as the training data. To simulate the data-limited case, we set 250 and adopt only 16 minutes of the training data for each instrument. For the data-rich case, we use the full data for training and set 2000. After each training epoch, the model will also select 100 pairs of stems from the validation pool of processed stems to mix as validation data. The validation loss is adopted by an early stop mechanism and a learning rate scheduler to fine-tune the whole training process. For all experiments, an early stop patience, 140 epochs, is adopted, and the learning rate is set to 0.001. All the training and testing songs are monaural tracks with a 44,100 hz sampling rate. The window size and hop length of STFT are 4,096 and 1,024. The predicted spectrograms are converted back to time-domain waveforms using the Griffin-Lim algorithm for phase estimation before inverse STFT.
For evaluation metrics, we adopt the signal-to-distortion ratio (SDR), signal-to-interference ratio (SIR), and signal-to-artifacts ration (SAR), implemented in theBSS_eval toolkit . Following the convention in SiSEC (https://sisec.inria.fr/), we report the median values over the testing songs.
We consider the following two methods as the baseline methods. First, we use the Open-Unmix model to train the separation model using the existing random-mixing data augmentation method; we refer to this method as the Random method. Second, we use the official pretrained 5-stem model of Spleeter , the current state-of-the-art for singing voice separation. Specifically, we use the default ‘mask’-based implementation. Given an input mixture, it generates as the output separated stems of piano, vocal, drum, bass, and others. From the result of Spleeter, we sum all but the piano stem as the separated violin stem.
Training an Open-Unmix based network using any of the proposed data augmentation method takes around 15 hours on an NVIDIA GeForce GTX 1080 GPU. At testing time, it takes 3 mins to complete the separation (for both the piano and violin) of an 18-min violin piano ensemble, on the same GPU.
V-B Results & Discussion
The result of the data-limited case is shown in Table II. The following observations can be made: First, for piano, all the proposed augmented models outperform the two baselines in SDR and SAR. In particular, the improvement made by ‘Correlation’ method is more than 2 dB for both metrics. Second, for violin, except for the ‘Correlation’ method that still gains improvement in SDR and SIR, the help of other augmentation methods becomes less obvious. The performance drop of ‘Wet’ method for violin indicates potential risk for the randomly applied mixing type method to distort the data under data-limited scenario. Finally, all the Open-Unmix based models we train greatly outperform the pre-trained, violin-agnostic Spleeter model in almost all the metrics for both instruments, except for the piano in terms of SIR. This is not surprising, as the Spleeter model has not been specifically trained on violin data. The failure of recognizing the violin also hurts its performance on the piano.
The result of the data-rich case are shown in Table III. From all the metrics we can see that the help of augmentations in both instruments becomes less obvious. For the piano, the baseline ‘Random’ method seems strong enough. For the violin, the baseline ‘Random’ method is only inferior to the proposed method with a small margin. This implies that, when the training data is big enough, the diversity of the data would also increase and overshadow the help of sophisticated augmentation methods.
|Spleeter (pretrained) ||5.20||17.35||2.30|
|Spleeter (pretrained) ||0.28||2.05||0.22|
The spectrograms of the original and predicted audio of a text song are shown in Figure 3. It can be seen that, the pre-trained Spleeter model cannot separate the piano from the violin well. In contrast, despite some visible residuals, the models trained with the proposed methods work well in separating the piano and violin. We also see from the result of the violin that some of the proposed methods (e.g., the ‘wet’ model) can get rid of the leakage of the piano part and noises in the low frequency bands, while the baseline methods suffer.
Figure 4 provides another example result. As there is a note offset in the latter half of the violin, the performance difference among the methods can be seen more clearly. We can see again that the proposed method suffers less from the leakage of the piano in the low frequency bands. Besides, the high-frequency piano residuals can be found to be a bit more in the result of the baseline ‘random’ model than in the ‘wet’ model. We also note that an extra low-frequency component (highlighted by the red arrow) in the ground truth violin in this example. Since it is below the lowest frequency of the violin sound (i.e., 196 hz), we listened to it and found that it seems to be some background noise or reverb. The proposed methods can also exclude this low-frequency part from the separation result.
Vi Conclusions and Future works
In this paper, we have proposed and investigated a number of mixing-related data augmentation methods to facilitate the training of deep learning models for music source separation. Result demonstrates the effectiveness of the proposed augmentation methods in the case of small data. When the training data is big enough, the existing random-mixing based approach  is strong enough. We have also shown that the implemented models outperform the current available best model, Spleeter , in violin/piano separation. We believe such training and augmentation methods have potential in benefiting other source separation tasks with limited amount of training data available. In the future, we are interested in extending our experiments to other popular yet less investigated instruments (e.g., guitar and saxophone), and in experimenting with more advanced mixing methods (e.g., applying mixing after the stems are selected, rather than applying mixing to the stems individually beforehand as done here).
-  (2016) Pysox : Leveraging the audio signal processing power of Sox in Python. Proc. International Society for Music Information Retrieval Conference, pp. 4–6. Cited by: §IV-A.
-  (2016) Medleydb 2.0 : New data and a system for sustainable data collection. Proc. International Society for Music Information Retrieval Conference, pp. 2–4. Cited by: §I, §III-B.
-  (2014) MedleyDB: A multitrack dataset for annotation-intensive MIR research. Proc. International Society for Music Information Retrieval Conference, pp. 155–160. Cited by: §I, §III-B.
Investigations on data augmentation and loss functions for deep learning based speech-background separation. Proc. INTERSPEECH, pp. 3499–3503. External Links: Cited by: §I.
-  (2019) Enabling factorized piano music modeling and generation with the MAESTRO dataset. Proc. International Conference on Learning Representations, pp. 1–12. External Links: Cited by: §II-A, §IV-A.
-  (2020) Addressing the confounds of accompaniments in singer identification. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, Cited by: §I.
-  (2017) Singing voice separation with deep U-Net convolutional networks. In Proc. International Society for Music Information Retrieval Conference, Cited by: §I.
-  (2016) Monaural music source separation using convolutional sparse coding. IEEE/ACM Transactions on Audio, Speech, and Language Processing. Cited by: §I.
-  (2019) Spleeter: A fast and state-of-the art music source separation tool with pre-trained models. In International Society for Music Information Retrieval Conference, Late-breaking paper, Cited by: §I, §II-B, §V-A, TABLE II, §VI.
-  (2019) Audio query-based music source separation. In Proc. International Society for Music Information Retrieval Conference, pp. 878–885. Cited by: §I.
Denoising auto-encoder with recurrent skip connections and residual regression for music source separation.
Proc. IEEE Int. Conf. Machine Learning and Applications, Cited by: §I.
Dilated convolution with dilated GRU for music source separation.
Proc. Int. Joint Conference on Artificial Intelligence, Cited by: §I, §II-A.
-  (2017) Deep clustering and conventional networks for music separation: stronger together. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 61–65. Cited by: §I.
-  (2017) Towards a better understanding of mix engineering. Ph.D. Thesis, Queen Mary University of London. Cited by: §I.
-  (2020) Simultaneous separation and transcription of mixtures with multiple polyphonic and percussive instruments. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 771–775. Cited by: §I, §II-A.
-  (2015) Librosa: Audio and music signal analysis in Python. Proc. Python in Science Conference, pp. 18–24. Cited by: §IV-D.
-  (2019) Conditioned-U-Net: introducing a control mechanism in the U-Net for multiple source separations. In Proc. International Society for Music Information Retrieval Conference, pp. 159–165. Cited by: §I.
-  (2018) An overview of lead and accompaniment separation in music. IEEE/ACM Trans. Audio Speech Lang. Process. 26 (8), pp. 1307–1335. External Links: Cited by: §I, §I.
-  (2015) Exploring data augmentation for improved singing voice detection with neural networks. Proc. International Society for Music Information Retrieval Conference, pp. 121–126. External Links: Cited by: §I, §II-A.
-  (2018) Adversarial semi-supervised audio source separation applied to singing voice extraction. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 2391–2395. Cited by: §I.
-  (2019) Open-Unmix - A reference implementation for music source separation. Journal of Open Source Software 4 (41), pp. 1667. External Links: Cited by: Fig. 2, §I, §I, §I, §II-B, §II-B.
-  (2016) A real-time audio-to-audio karaoke generation system for monaural recordings based on singing voice suppression and key conversion techniques. Journal of Information Processing 24 (3), pp. 470–482. Cited by: §I.
-  (2017) Improving music source separation based on deep neural networks through data augmentation and network blending. Proc. IEEE Int. Conf. Acoust. Speech Signal Process, pp. 261–265. Cited by: §I, §I, §II-A, §VI.
-  (2006-07) Performance measurement in blind audio source separation. IEEE Trans. Audio, Speech, Language Processing 14 (4), pp. 1462–1469. Cited by: §I, §V-A.
-  (2013) Low-rank representation of both singing voice and music accompaniment via learned dictionaries. In Proc. International Society for Music Information Retrieval Conference, Cited by: §I.