The research area of Music Information Retrieval (MIR) is constrained by a lack of labeled data sets, which limits our ability to train robust systems and evaluate them well. Specifically, the task of musical source separation has been hindered by a dearth of well-labeled data . This leads to severe shortcoming in terms of the range of instrument source classes that current systems can separate. Many systems, in fact, only separate the four classes (voice, bass, drums and “other”) in the widely-used MUSDB18  dataset, making them unsuitable for separating most musical instruments.
Simultaneously, the recent availability of large pretrained models has revolutionized generative and discriminative tasks in the domains of computer vision and natural language processing. The combination of VQGAN and CLIP 
has captured the attention of many artists, who have been captivated by the system’s ability to use natural language to create generative art. Similarly, researchers have shown how to steer large pretrained language models for downstream discriminative tasks either using transfer learning or so-called few-shot “prompt engineering” . Recent work has taken this ethos to the MIR domain, leveraging the representations learned by the large training regime of an unsupervised generative music model for downstream MIR tasks, like key detection and music tagging .
In this work, we further this ethos by exploring how large, pretrained music models can be used for musical source separation, leveraging the vast amounts of unlabeled or weakly labeled data that these models see during training. We combine the VQ-VAE from OpenAI’s Jukebox, a generative model of musical audio, with a music tagger. We task Jukebox with producing audio that matches a predefined set of tags that correspond with the musical source we wish to separate. To do this, we perform gradient ascent in the embedding space of the VQ-VAE and use the decoded audio as a mask on the input mixture. We demonstrate experimentally that this setup is able to separate a wider variety of sources than previous purpose-built separation systems consider, all without updating the weights of Jukebox or the tagger. We provide additional demos and runnable code on our demo site.111https://ethman.github.io/tagbox
2 Prior Work
Recently, many source separation researchers have focused on methods that produce high-quality results on the datasets for which there is sufficient ground truth source data. For instance, the website Papers with Code shows a steady increase in the best performing separation systems on the MUSDB18  dataset over the past few years.222https://bit.ly/pwc_musdb18 Similarly, the recent Music Demixing Challenge 
invited people to compete to determine the best performing system on a test set that had the same source definitions as MUSDB18. As a result, the community has produced a large number of deep learning-basedsupervised separation systems that are purpose-built to separate sources as defined by MUSDB18. However, the source definitions in MUSDB18 are limiting,  including isolated source data for only Vocals, Bass, Drums, and a catchall “Other” source for all other source types. Furthermore, MUSDB18 is relatively small, totalling 150 songs, which leads the authors of many state-of-the-art systems [14, 5, 24] to collect additional data or lean heavily on augmentation.
Prior to the deep learning era, unsupervised source separation was the norm. One of the most popular algorithms was Non-negative Matrix Factorization (NMF) . While NMF is theoretically flexible enough to separate any source, it often required hand-designed algorithms to determine how to cluster spectral templates into coherent sources. Musical priors, such as repetition  or harmonicity vs. percussiveness , have also been used to create unsupervised separation algorithms, however such algorithms are limited to separating sources that match the prior (e.g., a backing band) vs those that do not (e.g., a singing voice), and have been surpassed by deep learning-based methods.
Recent work in speech and environmental sound separation has explored unsupervised deep learning. Mixture Invariant Training (MixIT)  is a technique which creates mixtures of mixtures (MoMs) and tasks a network with overseparating each MoM such that when sources are recombined, a mixture reconstruction loss can be used, forgoing the need for isolated source data altogether. While we are unaware of anyone using MixIT for music separation, MixIT makes an implicit presumption that any two sources in a mix are independent , an assumption that may not hold for music. Similarly, Neri et. al  propose a technique for training a variational auto-encoder (VAE) for unsupervised source separation, however in our work we do not train networks at all, rather we use frozen, pretrained models for separation.
Previous works have explored using additional networks for separation instead of directly optimizing a separation net on ground truth sources. For instance, the work of Pishdadian et. al.  is most similar to ours; they explore using a pretrained sound event detection (SED) system and the goal of the separator network is to maximize estimated SED labels during training. Similarly, Hung et. al  use a pretrained transcription network to train a separator. Our work differs from Pishdadian et. al. and Huang et. al. in that we do not not train any networks, instead we repurpose off-the-shelf networks that have never been trained for source separation.
3.1 OpenAI’s Jukebox
OpenAI recently released Jukebox , a generative audio model that creates music. Jukebox is composed of two components: a hierarchical VQ-VAE  that learns to turn raw waveforms into tokens and back, and a language model that learns how to generate new tokens which can be passed through the decoder to create musical audio. Both the VQ-VAE and the language model are unsupervised. In this work, we are interested in the VQ-VAE, specifically.
’s VQ-VAE is a three-level hierarchical VQ-VAE that generates discrete tokens at different sample rates, compressing the 44.1kHz input audio to tokens with sample rates of 5.51kHz, 1.37kHz, and 344Hz for each level, respectively. Each level has a codebook size of 2048 with each code having 64 dimensions. All levels are trained to reconstruct the input waveform and are optimized with a multi-scale spectral loss. The VQ-VAE also uses a codebook loss to ensure that non-discretized latent vectors are close to their nearest neighbor discretized token vectors and a commitment loss, which stabilizes the encoder. The VQ-VAE is trained on 1.2 million songs scraped from the web. We refer the reader to theJukebox paper for further training details . Because we are interested in producing the highest-quality separation results possible, we only focus on the “Bottom” level, which compresses the input audio to tokens at a sample rate of 5.51kHz.
3.2 Automatic Music Tagging
Music tagging is the task of labeling musical audio clips with semantic labels called “tags” [11, 1]. These tags are useful for music search and recommendation systems, enabling automatic labelling of large music corpora. The content that the tags represent can vary, sometimes indicating information about a song’s genre, the song’s mood or theme, or whether particular instruments are audible.
Music tagging systems are designed to predict a set of multi-hot, binary labels (i.e., tags) based on the acoustic contents of an input signal. Many recent works use convolutional neural networks at their core, varying the convolutional filter size and input representation of the audio. Common datasets for music tagging are an order of magnitude larger than source separation datasets: MagnaTagATune (MTAT)  contains 25,877 30-second labeled audio clips (21x more hours of audio than MUSDB18) and MTG-Jamendo (MTG)  contains 55,701 labeled audio clips with a minimum song length of 30 seconds ( 46x more hours of audio than MUSDB18). We refer the reader to Won et. al. for an overview of recent advances in music tagging .
In this work, we use pretrained music taggers provided by Won et. al . We examine using two pretrained music tagging systems, with each having a different input representation: FCN  with Mel spectrogram inputs, and HarmonicCNN , which inputs a variant of a constant-Q transform that has learnable filters. We also explore using taggers trained on different datasets, namely MagnaTagATune (MTAT)  and MTG-Jamendo (MTG) .
4 Proposed System
|Method||Unsupervised?||Neural Network?||MUSDB18 ||Slakh2100 |
|Open-Unmix ||✓||black!5 –||black!5 –||black!5 –|
|Demucs ||✓||black!5 –||black!5 –||black!5 –|
|Cerberus ||✓||black!5 –|
|HPSS ||✓||black!5 –||black!5 –||black!5 –||black!5 –||black!5 –||black!5 –|
|REPET-SIM ||✓||black!5 –||black!5 –||black!5 –||black!5 –||black!5 –||black!5 –||black!5 –|
At the heart of our proposed system are two components: a pretrained generative music model (i.e., Jukebox) and a pretrained music tagging model. Because our system combines music taggers and Jukebox, we call our system TagBox. The core of the idea is simple, given an input audio clip, the generative model iteratively alters that input audio such that, when the altered audio is given to a tagger, the tagger’s output increasingly matches a target set of tags that describe the desired set of sources. An illustration of the proposed approach is shown in Figure 1. Algorithm 1 outlines the approach in pseudocode. We now describe the steps in our method.
We first create a target tag distribution by setting the tags that correspond to the desired instrument sources to (e.g., “guitar” or “drums”) and all other tags to . We then use
, the encoder portion of an autoencoder (in this case, the one inJukebox), to produce an embedding from the input audio mixture . This embedding is then decoded into a waveform by the decoder .
Rather than pass directly to the , we use it as a mask on the input mixture. In this way, the embedding essentially determines what information must be removed from the input mix to produce the desired source as defined by the tags. For an input mixture waveform and a Jukebox-decoded waveform , both with length samples, we convert both to a spectrogram representation, and , with time frames and frequency bins. We then compute a real-valued mask, as follows:
where is an element-wise max function between each time-frequency bin in a pair of spectrograms and a small epsilon, e.g , prevents division by zero. This mask is multiplied by the mixture spectrogram to get an estimate of the audio data that should be removed from the mix like , where indicates element-wise multiplication. is then converted to a waveform of the source estimate using an inverse STFT. This waveform, , is then put into the music tagger to determine the estimated tags. A binary cross-entropy loss is computed between the estimate tags and the predetermined instrument tags. This loss is used to perform gradient ascent step in the Jukebox embedding space, where governs the step size. This approach is similar to adversarial example generation , where the goal is also to optimize the input to produce a desired label. Because the mask made by the Jukebox-decoded audio determines what should be removed from the mix, the final estimate for a target source, , is the difference between the input mixture waveform and the final produced by gradient ascent. The final source estimate is therefore .
We note that neither the generative model nor the music tagger were trained for source separation and that no additional training or alteration of the weights of either model happens at any point. These models were, however, trained on datasets with a wider range of audio than is typical for deep models trained specifically for source separation.
Our system is able to produce separation results for a larger set of sources than any previous deep learning system that we are aware of. This system is limited only by the tags of the music tagging system, of which there are 12 distinct instrument tags in MTG-Jamendo (MTG). MagnaTagATune (MTAT) has 31 tags that could be interpreted as instrument tags, although the tags conceptually overlap somewhat (e.g., MTAT contains distinct tags for “vocals”, “voice”, “male vocals”, etc). Additionally, separating different source types does not require any changes to the system setup other than altering a set of predefined tags. Compare this to typical music separation networks like Open-Unmix  which would require training a whole model for each new source or Demucs  which would require altering the network architecture to add a new source output.
|Tagger Settings||MUSDB18 ||Slakh |
|MTAT||FCN||black!5 –||black!5 –|
|HCNN||black!5 –||black!5 –|
5 Experimental Validation
We conduct a series of experiments to validate our system, aimed at answering two questions. The first and main experiment is intended to compare the proposed system to existing systems, taking special care to try to understand TagBox’s ability to separate many types of sources. The second experiment is designed to determine how the choice of the pretrained, frozen Tagger model affects separation quality.
In our main experiment, we compare our system to existing systems on two established test sets for source separation, namely MUSDB18  and Slakh2100 . In this experiment we compare our proposed system against recent deep learning-based supervised separation systems as well unsupervised separations based on musical priors. We compare our system on a wide variety of source types across both of these datasets.
The first dataset we examine is MUSDB18. MUSDB18 contains 150 mixtures and corresponding sources from real live recording sessions, 100 of these are reserved for training and the remaining 50 are used for testing. For this experiment, we exclude MUSDB18’s “other” source because it could map to many possible tags using TagBox. The supervised systems that we compare against, namely Open-Unmix  and Demucs , are trained using the MUSDB18 training set. Contrast this to the unsupervised systems we test, HPSS  and REPET-SIM , which are run on the test set without any training. Our proposed system falls into this second camp; it is also unsupervised and therefore does not have a training phase, ignoring the MUSDB18 training set.
The main experiment also uses the Slakh2100  dataset. Slakh2100 contains 2100 mixtures with corresponding sources that were synthesized using professional-grade sample-based synthesis engines. We chose 50 songs from the test set to evaluate on. We chose songs that have source data for following five source types: bass, drums, guitar, piano, strings. We select mixes where all 5 sources are active, and we say a source is active if it has 100 or more note onsets throughout the entirety of the song, as determined by the corresponding MIDI data. We create mixes by instantaneously mixing together the sources and use these mixtures as input to the systems. With this setup we compare against Cerberus , which was trained to separate these five instruments, specifically.
For TagBox, we use a pretrained FCN  tagger trained on the MagnaTagATune (MTAT)  dataset. We run gradient ascent with a learning rate of 5.0 using the Adam optimizer for 10 steps (in the interest of brevity), and use a spectrogram with 1024 FFT bins for the mask. Additionally, we use the “foreground” from REPET-SIM as the vocals estimate, following prior work , and use the “percussion” output from HPSS as the drums estimate. We omit the other source outputs of these systems because they are ill-defined (e.g., HPSS’s “harmonic” could be many possible sources).
In the second experiment, we compare four different configurations of our proposed system, varying the architecture and training data of the music tagger. We look at the FCN  and HarmonicCNN  architectures, trained either MagnaTagATune (MTAT)  or MTG-Jamendo . We use the same learning rate and number of steps as the previous experiment.
6 Results and Discussion
Table 1 shows the results of our main experiment. In terms of SDRi, our system is better than or competitive with both of the hand-designed unsupervised algorithms that we test against, HPSS and REPET-SIM. Additionally, while our system does not show as good of performance as the purpose-built supervised separation systems (i.e., Open-Unmix, Demucs, and Cerberus), it still shows a considerable SDRi boost for all sources that we test. Importantly, our system is able to boost performance over a wider array of source types than any other system we compare against.
The results from our second experiment are shown in Table 2. Of the two architectures we test, using FCN always produces better separation results. Interestingly, the opposite trend was observed when the taggers were evaluated for music tagging performance by Won et. al. : HCNN was among the top performing systems and FCN was towards the bottom of the pack.
In many cases, TagBox can leave much to be desired perceptually; in most cases its separation performance is not up to the same level as the purpose-built separation systems we compare against. However, when listening to the output, there is no doubt that TagBox is able to separate the desired source, despite apparent artifacts. We have informally noticed a few tricks for better perceptual performance, like using multiple FFT sizes when making the masks (à la a multi-scale spectral loss) and doing gradient ascent for 100 steps. These perceptual tricks were however not reflected in the SDR evaluation numbers. Furthermore, because producing each output example requires its own gradient ascent, adding more steps increases the computation time linearly, which can be a costly process when run on an entire dataset. However, this might be tolerable for musicians needing a flexible source separation solution on a single song.
There are also a few other variants of the TagBox setup that can lead to fun and unexpected creative results. In the first case, we remove the masking step and allow TagBox to create audio freely, without the constraint of having to only remove information from the mix. With this setup, TagBox performs a kind of style transfer, mapping certain features of the audio to the desired tag. In one example, a mixture had a singer and we selected the “guitar” tag. TagBox made the resultant audio sound like a guitar was performing the melody. Additionally, another variant involves selecting non-instrument tags, like genre tags, and optimizing those.
What we find most impressive is that neither Jukebox nor the music taggers were trained for source separation. Furthermore, the weights of both networks do not change during the gradient ascent process; only the location of the audio in the Jukebox embedding space changes. The combination of Jukebox and the taggers have seen up to 1.25 million songs and combined these systems are able to leverage their shared priors about music and musical sources to isolate individual musical sources in a mixture. We believe that these priors could be leveraged in many ways to overcome the data scarcity problems endemic to many MIR tasks, as has already been investigated to great effect by Castellon et. al . We are excited about future explorations in this area.
In this paper, we have proposed a method for unsupervised source separation by combining pretrained models, called TagBox. We use pretrained music taggers to do gradient ascent in the embedding space of OpenAI’s Jukebox with the goal of maximizing a pre-defined tag corresponding to the source we want to separate. The output of Jukebox is used as a mask on the input audio before being sent to the tagger, which ensures that Jukebox does not generate new, unseen data that is not present in the input mixture. Importantly, neither the tagger nor Jukebox have been trained for source separation and the weights of both models remain fixed during the gradient ascent process. We demonstrate results showing that our system is able to separate a wider variety of source types than many recent purpose-built, supervised separation systems. We are excited by the promise that pretrained systems hold for the future of MIR and source separation research.
The authors would like to thank Ian Simon, Sander Dieleman, Jesse Engel, and Curtis Hawthorne for their fruitful conversations about this work. Additionally, we would like to thank the creators of Jukebox for help with their codebase: Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, and Ilya Sutskever.
-  (2019) The mtg-jamendo dataset for automatic music tagging. Cited by: §3.2, §3.2, §3.2, §5.
-  (2020) Language models are few-shot learners. arXiv preprint arXiv:2005.14165. Cited by: §1.
-  (2021) Codified audio language modeling learns useful representations for music information retrieval. arXiv preprint arXiv:2107.05677. Cited by: §1, §6.
-  (2016) Automatic tagging using deep convolutional neural networks. External Links: Cited by: §3.2, §5, §5.
-  (2019) Music source separation in the waveform domain. arXiv preprint arXiv:1911.13254. Cited by: §2, Table 1, §4, §5.
-  (2020) Jukebox: a generative model for music. External Links: Cited by: §3.1, §3.1.
Taming transformers for high-resolution image synthesis.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12873–12883. Cited by: §1.
-  (2010) Harmonic/percussive separation using median filtering. In Proceedings of the International Conference on Digital Audio Effects (DAFx), Vol. 13. Cited by: §2, Table 1, §5.
-  (2014) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §4.
-  (2021) Transcription is all you need: learning to separate musical mixtures with score as supervision. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 46–50. Cited by: §2.
-  (2009) Evaluation of algorithms using games: the case of music tagging.. In In Proceedings of the 10th International Society for Music Information Retrieval Conference (ISMIR), pp. 387–392. Cited by: §3.2, §3.2, §3.2, §5, §5.
-  (2020) Simultaneous separation and transcription of mixtures with multiple polyphonic and percussive instruments. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 771–775. Cited by: Table 1, §5.
-  (2019) Cutting music source separation some slakh: a dataset to study the impact of training data quality and quantity. In 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 45–49. Cited by: §1, §2, Table 1, Table 2, §5, §5.
-  (2021) Music demixing challenge at ismir 2021. arXiv preprint arXiv:2108.13559. Cited by: §2.
-  (2021) Unsupervised blind source separation with variational auto-encoders. In 29th European Signal Processing Conference (EUSIPCO 2021), Cited by: §2.
-  (2020) Finding strength in weakness: learning to separate sounds with weak supervision. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, pp. 2386–2399. Cited by: §2.
-  (2021) Learning transferable visual models from natural language supervision. External Links: Cited by: §1.
Exploring the limits of transfer learning with a unified text-to-text transformer.
Journal of Machine Learning Research21, pp. 1–67. Cited by: §1.
-  (2017-12) The MUSDB18 corpus for music separation. External Links: Cited by: §1, §2, Table 1, Table 2, §5.
-  (2012) Music/voice separation using the similarity matrix. In In Proceedings of the 13th International Society for Music Information Retrieval Conference (ISMIR), Cited by: §2, Table 1, §5, §5.
-  (2019) Generating diverse high-fidelity images with vq-vae-2. In Advances in neural information processing systems, pp. 14866–14876. Cited by: §3.1.
Blind separation of convolved mixtures in the frequency domain. Neurocomputing 22 (1-3), pp. 21–34. Cited by: §2.
-  (2018) The 2018 signal separation evaluation campaign. In Latent Variable Analysis and Signal Separation: 14th International Conference, LVA/ICA 2018, Surrey, UK, pp. 293–305. Cited by: §5.
Open-unmix-a reference implementation for music source separation.
Journal of Open Source Software4 (41), pp. 1667. Cited by: §2, Table 1, §4, §5.
-  (2006) Performance measurement in blind audio source separation. IEEE transactions on audio, speech, and language processing 14 (4), pp. 1462–1469. Cited by: §5.
-  (2020) Unsupervised sound separation using mixture invariant training. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 3846–3857. External Links: Cited by: §2.
-  (2020) Data-driven harmonic filters for audio representation learning. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 536–540. Cited by: §3.2, §5.
-  (2020) Evaluation of cnn-based automatic music tagging models. In Proc. of 17th Sound and Music Computing, Cited by: §3.2, §3.2, §6.