Unsupervised Source Separation By Steering Pretrained Music Models

10/25/2021
by   Ethan Manilow, et al.
1

We showcase an unsupervised method that repurposes deep models trained for music generation and music tagging for audio source separation, without any retraining. An audio generation model is conditioned on an input mixture, producing a latent encoding of the audio used to generate audio. This generated audio is fed to a pretrained music tagger that creates source labels. The cross-entropy loss between the tag distribution for the generated audio and a predefined distribution for an isolated source is used to guide gradient ascent in the (unchanging) latent space of the generative model. This system does not update the weights of the generative model or the tagger, and only relies on moving through the generative model's latent space to produce separated sources. We use OpenAI's Jukebox as the pretrained generative model, and we couple it with four kinds of pretrained music taggers (two architectures and two tagging datasets). Experimental results on two source separation datasets, show this approach can produce separation estimates for a wider variety of sources than any tested supervised or unsupervised system. This work points to the vast and heretofore untapped potential of large pretrained music models for audio-to-audio tasks like source separation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

08/19/2019

Audio query-based music source separation

In recent years, music source separation has been one of the most intens...
06/27/2018

Independent Deeply Learned Matrix Analysis for Multichannel Audio Source Separation

In this paper, we address a multichannel audio source separation task an...
10/11/2021

Unsupervised Source Separation via Bayesian Inference in the Latent Domain

State of the art audio source separation models rely on supervised data-...
10/23/2020

A Study of Transfer Learning in Music Source Separation

Supervised deep learning methods for performing audio source separation ...
12/15/2019

Breaking Speech Recognizers to Imagine Lyrics

We introduce a new method for generating text, and in particular song ly...
03/26/2021

Modeling the Compatibility of Stem Tracks to Generate Music Mashups

A music mashup combines audio elements from two or more songs to create ...
02/19/2020

Source Separation with Deep Generative Priors

Despite substantial progress in signal source separation, results for ri...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The research area of Music Information Retrieval (MIR) is constrained by a lack of labeled data sets, which limits our ability to train robust systems and evaluate them well. Specifically, the task of musical source separation has been hindered by a dearth of well-labeled data [13]. This leads to severe shortcoming in terms of the range of instrument source classes that current systems can separate. Many systems, in fact, only separate the four classes (voice, bass, drums and “other”) in the widely-used MUSDB18 [19] dataset, making them unsuitable for separating most musical instruments.

Simultaneously, the recent availability of large pretrained models has revolutionized generative and discriminative tasks in the domains of computer vision and natural language processing. The combination of VQGAN 

[7] and CLIP [17]

has captured the attention of many artists, who have been captivated by the system’s ability to use natural language to create generative art. Similarly, researchers have shown how to steer large pretrained language models for downstream discriminative tasks either using transfer learning 

[18] or so-called few-shot “prompt engineering” [2]. Recent work has taken this ethos to the MIR domain, leveraging the representations learned by the large training regime of an unsupervised generative music model for downstream MIR tasks, like key detection and music tagging [3].

Figure 1: Our system performs gradient ascent in the Jukebox VQ-VAE embedding space such that when the audio is input into a music tagger it matches a predefined set of tags. The weights of VQ-VAE and the Music Tagger are frozen. With this setup we can perform unsupervised source separation.

In this work, we further this ethos by exploring how large, pretrained music models can be used for musical source separation, leveraging the vast amounts of unlabeled or weakly labeled data that these models see during training. We combine the VQ-VAE from OpenAI’s Jukebox, a generative model of musical audio, with a music tagger. We task Jukebox with producing audio that matches a predefined set of tags that correspond with the musical source we wish to separate. To do this, we perform gradient ascent in the embedding space of the VQ-VAE and use the decoded audio as a mask on the input mixture. We demonstrate experimentally that this setup is able to separate a wider variety of sources than previous purpose-built separation systems consider, all without updating the weights of Jukebox or the tagger. We provide additional demos and runnable code on our demo site.111https://ethman.github.io/tagbox

2 Prior Work

Recently, many source separation researchers have focused on methods that produce high-quality results on the datasets for which there is sufficient ground truth source data. For instance, the website Papers with Code shows a steady increase in the best performing separation systems on the MUSDB18 [19] dataset over the past few years.222https://bit.ly/pwc_musdb18 Similarly, the recent Music Demixing Challenge [14]

invited people to compete to determine the best performing system on a test set that had the same source definitions as MUSDB18. As a result, the community has produced a large number of deep learning-based

supervised separation systems that are purpose-built to separate sources as defined by MUSDB18. However, the source definitions in MUSDB18 are limiting, [13] including isolated source data for only Vocals, Bass, Drums, and a catchall “Other” source for all other source types. Furthermore, MUSDB18 is relatively small, totalling 150 songs, which leads the authors of many state-of-the-art systems [14, 5, 24] to collect additional data or lean heavily on augmentation.

Prior to the deep learning era, unsupervised source separation was the norm. One of the most popular algorithms was Non-negative Matrix Factorization (NMF) [22]. While NMF is theoretically flexible enough to separate any source, it often required hand-designed algorithms to determine how to cluster spectral templates into coherent sources. Musical priors, such as repetition [20] or harmonicity vs. percussiveness [8], have also been used to create unsupervised separation algorithms, however such algorithms are limited to separating sources that match the prior (e.g., a backing band) vs those that do not (e.g., a singing voice), and have been surpassed by deep learning-based methods.

Recent work in speech and environmental sound separation has explored unsupervised deep learning. Mixture Invariant Training (MixIT) [26] is a technique which creates mixtures of mixtures (MoMs) and tasks a network with overseparating each MoM such that when sources are recombined, a mixture reconstruction loss can be used, forgoing the need for isolated source data altogether. While we are unaware of anyone using MixIT for music separation, MixIT makes an implicit presumption that any two sources in a mix are independent [26], an assumption that may not hold for music. Similarly, Neri et. al [15] propose a technique for training a variational auto-encoder (VAE) for unsupervised source separation, however in our work we do not train networks at all, rather we use frozen, pretrained models for separation.

Previous works have explored using additional networks for separation instead of directly optimizing a separation net on ground truth sources. For instance, the work of Pishdadian et. al. [16] is most similar to ours; they explore using a pretrained sound event detection (SED) system and the goal of the separator network is to maximize estimated SED labels during training. Similarly, Hung et. al [10] use a pretrained transcription network to train a separator. Our work differs from Pishdadian et. al. and Huang et. al. in that we do not not train any networks, instead we repurpose off-the-shelf networks that have never been trained for source separation.

3 Background

3.1 OpenAI’s Jukebox

OpenAI recently released Jukebox [6], a generative audio model that creates music. Jukebox is composed of two components: a hierarchical VQ-VAE [21] that learns to turn raw waveforms into tokens and back, and a language model that learns how to generate new tokens which can be passed through the decoder to create musical audio. Both the VQ-VAE and the language model are unsupervised. In this work, we are interested in the VQ-VAE, specifically.

Jukebox

’s VQ-VAE is a three-level hierarchical VQ-VAE that generates discrete tokens at different sample rates, compressing the 44.1kHz input audio to tokens with sample rates of 5.51kHz, 1.37kHz, and 344Hz for each level, respectively. Each level has a codebook size of 2048 with each code having 64 dimensions. All levels are trained to reconstruct the input waveform and are optimized with a multi-scale spectral loss. The VQ-VAE also uses a codebook loss to ensure that non-discretized latent vectors are close to their nearest neighbor discretized token vectors and a commitment loss, which stabilizes the encoder. The VQ-VAE is trained on 1.2 million songs scraped from the web. We refer the reader to the

Jukebox paper for further training details [6]. Because we are interested in producing the highest-quality separation results possible, we only focus on the “Bottom” level, which compresses the input audio to tokens at a sample rate of 5.51kHz.

1:   Encode the input mixture.
2:  
3:  repeat
4:      Decode the embedding.
5:     
6:      Build the mask.
7:      Mask the mixture.
8:     

Get the probability over tags.

9:      Update the embedding.
10:  until max steps
11:  
Algorithm 1 Our Method

3.2 Automatic Music Tagging

Music tagging is the task of labeling musical audio clips with semantic labels called “tags” [11, 1]. These tags are useful for music search and recommendation systems, enabling automatic labelling of large music corpora. The content that the tags represent can vary, sometimes indicating information about a song’s genre, the song’s mood or theme, or whether particular instruments are audible.

Music tagging systems are designed to predict a set of multi-hot, binary labels (i.e., tags) based on the acoustic contents of an input signal. Many recent works use convolutional neural networks at their core, varying the convolutional filter size and input representation of the audio 

[28]. Common datasets for music tagging are an order of magnitude larger than source separation datasets: MagnaTagATune (MTAT) [11] contains 25,877 30-second labeled audio clips (21x more hours of audio than MUSDB18) and MTG-Jamendo (MTG) [1] contains 55,701 labeled audio clips with a minimum song length of 30 seconds ( 46x more hours of audio than MUSDB18). We refer the reader to Won et. al. for an overview of recent advances in music tagging [28].

In this work, we use pretrained music taggers provided by Won et. al [28]. We examine using two pretrained music tagging systems, with each having a different input representation: FCN [4] with Mel spectrogram inputs, and HarmonicCNN [27], which inputs a variant of a constant-Q transform that has learnable filters. We also explore using taggers trained on different datasets, namely MagnaTagATune (MTAT) [11] and MTG-Jamendo (MTG) [1].

4 Proposed System

Method Unsupervised? Neural Network? MUSDB18 [19] Slakh2100 [13]
Vocals Bass Drums Bass Drums Guitar Piano Strings
Open-Unmix [24] black!5 – black!5 – black!5 –
Demucs [5] black!5 – black!5 – black!5 –
Cerberus [12] black!5 –
HPSS [8] black!5 – black!5 – black!5 – black!5 – black!5 – black!5 –
REPET-SIM [20] black!5 – black!5 – black!5 – black!5 – black!5 – black!5 – black!5 –
TagBox (Ours)
Table 1: Comparison of source separation systems in terms of mean SDR improvement (dB) over the unprocessed mixture. Grey cells indicate that the system is unable to separate that source type. TagBox is the only system that is able to separate all of the sources we test.

At the heart of our proposed system are two components: a pretrained generative music model (i.e., Jukebox) and a pretrained music tagging model. Because our system combines music taggers and Jukebox, we call our system TagBox. The core of the idea is simple, given an input audio clip, the generative model iteratively alters that input audio such that, when the altered audio is given to a tagger, the tagger’s output increasingly matches a target set of tags that describe the desired set of sources. An illustration of the proposed approach is shown in Figure 1. Algorithm 1 outlines the approach in pseudocode. We now describe the steps in our method.

We first create a target tag distribution by setting the tags that correspond to the desired instrument sources to (e.g., “guitar” or “drums”) and all other tags to . We then use

, the encoder portion of an autoencoder (in this case, the one in

Jukebox), to produce an embedding from the input audio mixture . This embedding is then decoded into a waveform by the decoder .

Rather than pass directly to the , we use it as a mask on the input mixture. In this way, the embedding essentially determines what information must be removed from the input mix to produce the desired source as defined by the tags. For an input mixture waveform and a Jukebox-decoded waveform , both with length samples, we convert both to a spectrogram representation, and , with time frames and frequency bins. We then compute a real-valued mask, as follows:

(1)

where is an element-wise max function between each time-frequency bin in a pair of spectrograms and a small epsilon, e.g , prevents division by zero. This mask is multiplied by the mixture spectrogram to get an estimate of the audio data that should be removed from the mix like , where indicates element-wise multiplication. is then converted to a waveform of the source estimate using an inverse STFT. This waveform, , is then put into the music tagger to determine the estimated tags. A binary cross-entropy loss is computed between the estimate tags and the predetermined instrument tags. This loss is used to perform gradient ascent step in the Jukebox embedding space, where governs the step size. This approach is similar to adversarial example generation [9], where the goal is also to optimize the input to produce a desired label. Because the mask made by the Jukebox-decoded audio determines what should be removed from the mix, the final estimate for a target source, , is the difference between the input mixture waveform and the final produced by gradient ascent. The final source estimate is therefore .

We note that neither the generative model nor the music tagger were trained for source separation and that no additional training or alteration of the weights of either model happens at any point. These models were, however, trained on datasets with a wider range of audio than is typical for deep models trained specifically for source separation.

Our system is able to produce separation results for a larger set of sources than any previous deep learning system that we are aware of. This system is limited only by the tags of the music tagging system, of which there are 12 distinct instrument tags in MTG-Jamendo (MTG). MagnaTagATune (MTAT) has 31 tags that could be interpreted as instrument tags, although the tags conceptually overlap somewhat (e.g., MTAT contains distinct tags for “vocals”, “voice”, “male vocals”, etc). Additionally, separating different source types does not require any changes to the system setup other than altering a set of predefined tags. Compare this to typical music separation networks like Open-Unmix [24] which would require training a whole model for each new source or Demucs [5] which would require altering the network architecture to add a new source output.

Tagger Settings MUSDB18 [19] Slakh [13]
Dataset Architecture Vox Bass Drums Bass Drums Guitar Piano Strings
MTAT FCN black!5 – black!5 –
HCNN black!5 – black!5 –
MTG-Jamendo FCN
HCNN
Table 2: Comparison of using different pretrained, frozen taggers for gradient ascent with TagBox in terms of mean SDR improvement (dB) over the unprocessed mixture. Note the MTAT taggers have no “bass” tag.

5 Experimental Validation

We conduct a series of experiments to validate our system, aimed at answering two questions. The first and main experiment is intended to compare the proposed system to existing systems, taking special care to try to understand TagBox’s ability to separate many types of sources. The second experiment is designed to determine how the choice of the pretrained, frozen Tagger model affects separation quality.

In our main experiment, we compare our system to existing systems on two established test sets for source separation, namely MUSDB18 [19] and Slakh2100 [13]. In this experiment we compare our proposed system against recent deep learning-based supervised separation systems as well unsupervised separations based on musical priors. We compare our system on a wide variety of source types across both of these datasets.

The first dataset we examine is MUSDB18. MUSDB18 contains 150 mixtures and corresponding sources from real live recording sessions, 100 of these are reserved for training and the remaining 50 are used for testing. For this experiment, we exclude MUSDB18’s “other” source because it could map to many possible tags using TagBox. The supervised systems that we compare against, namely Open-Unmix [24] and Demucs [5], are trained using the MUSDB18 training set. Contrast this to the unsupervised systems we test, HPSS [8] and REPET-SIM [20], which are run on the test set without any training. Our proposed system falls into this second camp; it is also unsupervised and therefore does not have a training phase, ignoring the MUSDB18 training set.

The main experiment also uses the Slakh2100 [13] dataset. Slakh2100 contains 2100 mixtures with corresponding sources that were synthesized using professional-grade sample-based synthesis engines. We chose 50 songs from the test set to evaluate on. We chose songs that have source data for following five source types: bass, drums, guitar, piano, strings. We select mixes where all 5 sources are active, and we say a source is active if it has 100 or more note onsets throughout the entirety of the song, as determined by the corresponding MIDI data. We create mixes by instantaneously mixing together the sources and use these mixtures as input to the systems. With this setup we compare against Cerberus [12], which was trained to separate these five instruments, specifically.

For TagBox, we use a pretrained FCN [4] tagger trained on the MagnaTagATune (MTAT) [11] dataset. We run gradient ascent with a learning rate of 5.0 using the Adam optimizer for 10 steps (in the interest of brevity), and use a spectrogram with 1024 FFT bins for the mask. Additionally, we use the “foreground” from REPET-SIM as the vocals estimate, following prior work [20], and use the “percussion” output from HPSS as the drums estimate. We omit the other source outputs of these systems because they are ill-defined (e.g., HPSS’s “harmonic” could be many possible sources).

In the second experiment, we compare four different configurations of our proposed system, varying the architecture and training data of the music tagger. We look at the FCN [4] and HarmonicCNN [27] architectures, trained either MagnaTagATune (MTAT) [11] or MTG-Jamendo [1]. We use the same learning rate and number of steps as the previous experiment.

We evaluate the outcome of our experiments using the source-to-distortion ratio improvement (SDRi) over the unprocessed mixture [25] using the museval toolbox [23].

6 Results and Discussion

Table 1 shows the results of our main experiment. In terms of SDRi, our system is better than or competitive with both of the hand-designed unsupervised algorithms that we test against, HPSS and REPET-SIM. Additionally, while our system does not show as good of performance as the purpose-built supervised separation systems (i.e., Open-Unmix, Demucs, and Cerberus), it still shows a considerable SDRi boost for all sources that we test. Importantly, our system is able to boost performance over a wider array of source types than any other system we compare against.

The results from our second experiment are shown in Table 2. Of the two architectures we test, using FCN always produces better separation results. Interestingly, the opposite trend was observed when the taggers were evaluated for music tagging performance by Won et. al. [28]: HCNN was among the top performing systems and FCN was towards the bottom of the pack.

In many cases, TagBox can leave much to be desired perceptually; in most cases its separation performance is not up to the same level as the purpose-built separation systems we compare against. However, when listening to the output, there is no doubt that TagBox is able to separate the desired source, despite apparent artifacts. We have informally noticed a few tricks for better perceptual performance, like using multiple FFT sizes when making the masks (à la a multi-scale spectral loss) and doing gradient ascent for 100 steps. These perceptual tricks were however not reflected in the SDR evaluation numbers. Furthermore, because producing each output example requires its own gradient ascent, adding more steps increases the computation time linearly, which can be a costly process when run on an entire dataset. However, this might be tolerable for musicians needing a flexible source separation solution on a single song.

There are also a few other variants of the TagBox setup that can lead to fun and unexpected creative results. In the first case, we remove the masking step and allow TagBox to create audio freely, without the constraint of having to only remove information from the mix. With this setup, TagBox performs a kind of style transfer, mapping certain features of the audio to the desired tag. In one example, a mixture had a singer and we selected the “guitar” tag. TagBox made the resultant audio sound like a guitar was performing the melody. Additionally, another variant involves selecting non-instrument tags, like genre tags, and optimizing those.

What we find most impressive is that neither Jukebox nor the music taggers were trained for source separation. Furthermore, the weights of both networks do not change during the gradient ascent process; only the location of the audio in the Jukebox embedding space changes. The combination of Jukebox and the taggers have seen up to 1.25 million songs and combined these systems are able to leverage their shared priors about music and musical sources to isolate individual musical sources in a mixture. We believe that these priors could be leveraged in many ways to overcome the data scarcity problems endemic to many MIR tasks, as has already been investigated to great effect by Castellon et. al [3]. We are excited about future explorations in this area.

7 Conclusion

In this paper, we have proposed a method for unsupervised source separation by combining pretrained models, called TagBox. We use pretrained music taggers to do gradient ascent in the embedding space of OpenAI’s Jukebox with the goal of maximizing a pre-defined tag corresponding to the source we want to separate. The output of Jukebox is used as a mask on the input audio before being sent to the tagger, which ensures that Jukebox does not generate new, unseen data that is not present in the input mixture. Importantly, neither the tagger nor Jukebox have been trained for source separation and the weights of both models remain fixed during the gradient ascent process. We demonstrate results showing that our system is able to separate a wider variety of source types than many recent purpose-built, supervised separation systems. We are excited by the promise that pretrained systems hold for the future of MIR and source separation research.

8 Acknowledgements

The authors would like to thank Ian Simon, Sander Dieleman, Jesse Engel, and Curtis Hawthorne for their fruitful conversations about this work. Additionally, we would like to thank the creators of Jukebox for help with their codebase: Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, and Ilya Sutskever.

References

  • [1] D. Bogdanov, M. Won, P. Tovstogan, A. Porter, and X. Serra (2019) The mtg-jamendo dataset for automatic music tagging. Cited by: §3.2, §3.2, §3.2, §5.
  • [2] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. (2020) Language models are few-shot learners. arXiv preprint arXiv:2005.14165. Cited by: §1.
  • [3] R. Castellon, C. Donahue, and P. Liang (2021) Codified audio language modeling learns useful representations for music information retrieval. arXiv preprint arXiv:2107.05677. Cited by: §1, §6.
  • [4] K. Choi, G. Fazekas, and M. Sandler (2016) Automatic tagging using deep convolutional neural networks. External Links: 1606.00298 Cited by: §3.2, §5, §5.
  • [5] A. Défossez, N. Usunier, L. Bottou, and F. Bach (2019) Music source separation in the waveform domain. arXiv preprint arXiv:1911.13254. Cited by: §2, Table 1, §4, §5.
  • [6] P. Dhariwal, H. Jun, C. Payne, J. W. Kim, A. Radford, and I. Sutskever (2020) Jukebox: a generative model for music. External Links: 2005.00341 Cited by: §3.1, §3.1.
  • [7] P. Esser, R. Rombach, and B. Ommer (2021) Taming transformers for high-resolution image synthesis. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    ,
    pp. 12873–12883. Cited by: §1.
  • [8] D. Fitzgerald (2010) Harmonic/percussive separation using median filtering. In Proceedings of the International Conference on Digital Audio Effects (DAFx), Vol. 13. Cited by: §2, Table 1, §5.
  • [9] I. J. Goodfellow, J. Shlens, and C. Szegedy (2014) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §4.
  • [10] Y. Hung, G. Wichern, and J. Le Roux (2021) Transcription is all you need: learning to separate musical mixtures with score as supervision. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 46–50. Cited by: §2.
  • [11] E. Law, K. West, M. I. Mandel, M. Bay, and J. S. Downie (2009) Evaluation of algorithms using games: the case of music tagging.. In In Proceedings of the 10th International Society for Music Information Retrieval Conference (ISMIR), pp. 387–392. Cited by: §3.2, §3.2, §3.2, §5, §5.
  • [12] E. Manilow, P. Seetharaman, and B. Pardo (2020) Simultaneous separation and transcription of mixtures with multiple polyphonic and percussive instruments. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 771–775. Cited by: Table 1, §5.
  • [13] E. Manilow, G. Wichern, P. Seetharaman, and J. Le Roux (2019) Cutting music source separation some slakh: a dataset to study the impact of training data quality and quantity. In 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 45–49. Cited by: §1, §2, Table 1, Table 2, §5, §5.
  • [14] Y. Mitsufuji, G. Fabbro, S. Uhlich, and F. Stöter (2021) Music demixing challenge at ismir 2021. arXiv preprint arXiv:2108.13559. Cited by: §2.
  • [15] J. Neri, R. Badeau, and P. Depalle (2021) Unsupervised blind source separation with variational auto-encoders. In 29th European Signal Processing Conference (EUSIPCO 2021), Cited by: §2.
  • [16] F. Pishdadian, G. Wichern, and J. Le Roux (2020) Finding strength in weakness: learning to separate sounds with weak supervision. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, pp. 2386–2399. Cited by: §2.
  • [17] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever (2021) Learning transferable visual models from natural language supervision. External Links: 2103.00020 Cited by: §1.
  • [18] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2020) Exploring the limits of transfer learning with a unified text-to-text transformer.

    Journal of Machine Learning Research

    21, pp. 1–67.
    Cited by: §1.
  • [19] Z. Rafii, A. Liutkus, F. Stöter, S. I. Mimilakis, and R. Bittner (2017-12) The MUSDB18 corpus for music separation. External Links: Document, Link Cited by: §1, §2, Table 1, Table 2, §5.
  • [20] Z. Rafii and B. Pardo (2012) Music/voice separation using the similarity matrix. In In Proceedings of the 13th International Society for Music Information Retrieval Conference (ISMIR), Cited by: §2, Table 1, §5, §5.
  • [21] A. Razavi, A. van den Oord, and O. Vinyals (2019) Generating diverse high-fidelity images with vq-vae-2. In Advances in neural information processing systems, pp. 14866–14876. Cited by: §3.1.
  • [22] P. Smaragdis (1998)

    Blind separation of convolved mixtures in the frequency domain

    .
    Neurocomputing 22 (1-3), pp. 21–34. Cited by: §2.
  • [23] F. Stöter, A. Liutkus, and N. Ito (2018) The 2018 signal separation evaluation campaign. In Latent Variable Analysis and Signal Separation: 14th International Conference, LVA/ICA 2018, Surrey, UK, pp. 293–305. Cited by: §5.
  • [24] F. Stöter, S. Uhlich, A. Liutkus, and Y. Mitsufuji (2019) Open-unmix-a reference implementation for music source separation.

    Journal of Open Source Software

    4 (41), pp. 1667.
    Cited by: §2, Table 1, §4, §5.
  • [25] E. Vincent, R. Gribonval, and C. Févotte (2006) Performance measurement in blind audio source separation. IEEE transactions on audio, speech, and language processing 14 (4), pp. 1462–1469. Cited by: §5.
  • [26] S. Wisdom, E. Tzinis, H. Erdogan, R. Weiss, K. Wilson, and J. Hershey (2020) Unsupervised sound separation using mixture invariant training. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 3846–3857. External Links: Link Cited by: §2.
  • [27] M. Won, S. Chun, O. Nieto, and X. Serrc (2020) Data-driven harmonic filters for audio representation learning. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 536–540. Cited by: §3.2, §5.
  • [28] M. Won, A. Ferraro, D. Bogdanov, and X. Serra (2020) Evaluation of cnn-based automatic music tagging models. In Proc. of 17th Sound and Music Computing, Cited by: §3.2, §3.2, §6.