Official repo of ISMIR-21 publication, “A Benchmarking Initiative for Audio-domain Music Generation using the FreeSound Loop Dataset”.
This paper proposes a new benchmark task for generat-ing musical passages in the audio domain by using thedrum loops from the FreeSound Loop Dataset, which arepublicly re-distributable. Moreover, we use a larger col-lection of drum loops from Looperman to establish fourmodel-based objective metrics for evaluation, releasingthese metrics as a library for quantifying and facilitatingthe progress of musical audio generation. Under this eval-uation framework, we benchmark the performance of threerecent deep generative adversarial network (GAN) mod-els we customize to generate loops, including StyleGAN,StyleGAN2, and UNAGAN. We also report a subjectiveevaluation of these models. Our evaluation shows that theone based on StyleGAN2 performs the best in both objec-tive and subjective metrics.READ FULL TEXT VIEW PDF
Official repo of ISMIR-21 publication, “A Benchmarking Initiative for Audio-domain Music Generation using the FreeSound Loop Dataset”.
Audio-domain music generation involves generating musical sounds either directly as audio waveforms or as time-frequency representations such as the Mel spectrograms. Besides modeling musical content in aspects such as pitch and rhythm, it has the additional complexity of modeling the spectral-temporal properties of musical sounds, compared to its symbolic-domain music generation counterpart. In recent years, deep learning models have been proposed for audio-domain music generation, starting with simpler tasks such as generating instrumental single notes[18, 17, 35, 38], a task also known as neural audio synthesis. Researchers have also begun to address the more challenging setting of generating sounds of longer duration [48, 32, 8, 49, 31, 30, 13, 47]. For example, Jukebox  aims to generate realistic minutes-long singing voices conditioned on lyrics, genre, and artists; and UNAGAN  aims to generate musical passages of finite yet arbitrary duration for singing voices, violin, and piano, in an unconditional fashion.
The focus of this paper is on the evaluation of audio-domain music generation. We note that, for model training and evaluation, research on generating single notes quite often adopts NSynth , a large public dataset consisting of individual notes from different instruments. The use of a common dataset for evaluation ensures the validity of performance comparison between different models. Such a standardized dataset for benchmarking, however, is not available when it comes to generating longer musical passages, to our best knowledge. Oftentimes private in-house datasets are employed in existing works; for example, both UNAGAN  and Jukebox  employ audio recordings scrapped from the Internet, which cannot be shared publicly. The only exception is MAESTRO, a public dataset with over 172 hours of solo piano performances, employed by MelNet , UNAGAN , and MP3net . However, MAESTRO is piano-only so not diverse enough in timbre.
We see new opportunities to address this gap with the recent release of the FreeSound Loop Dataset (FSLD) , which contains 9,455 production-ready, public-domain loops distributed under Creative Commons licenses.111https://zenodo.org/record/3967852 We therefore propose to use audio-domain loop generation, a task seldom reported in the literature, to set a benchmark for musical audio generation research.
We deem loops as an adequate target for audio generation for their following merits. First, loops are audio excerpts, usually of short duration, that can be played in seamless manner[39, 44]. Hence, the generated loops can be played repeatedly. Second, loops are fundamental units in the production of many contemporary dance music genres. A loop can usually be associated with a single genre or instrument label , and a certain “role” (e.g., percussion, FX, melody, bass, chord, voice) . Third, loops are fairly diverse in their music content and timbre, as sound design has been a central part in making loops.
A primary contribution of this paper is therefore the proposal and implementation of using FSLD as a benchmark for audio generation. In particular, we adapt three recent deep generative adversarial network (GAN)  models and train them on the drum-loop subset of FSLD, and report thorough evaluation of their performance, both objectively and subjectively. This includes UNAGAN  and two state-of-the-art models for image generation, StyleGAN  and StyleGAN2.
Drum loop generation is interesting in its own right due to its applications in automatic creation of loop-based music . As  indicates, drum beats represent one of the most critical and fundamental elements that form the style of EDM. Moreover, drum loops are already fairly diverse in musical content, as demonstrated in Figure 1. Although we only consider drum loops here for the sake of simplicity, this benchmark can be easily extended to cover all the loops from FSLD in the near future.
Our secondary contribution lies in the development of standardized objective metrics for evaluating audio-domain loop generation, which can be equally important as having a standardized dataset. We collect a larger drum loop dataset from an online library called looperman,222https://www.looperman.com/ with roughly 9 times more drum loops than FSLD, and use this looperman dataset to build four model-based metrics (e.g., inception score )333We refer to them as model-based metrics because we need to build a classifier or a clustering model to calculate the metrics; see Section 4. to evaluate the acoustic quality and diversity of the loops generated by the GAN models. While this looperman dataset cannot be released publicly due to copyright concerns, we release the metrics and the trained GAN models for drum loop generation at the following GitHub repo: https://github.com/allenhung1025/LoopTest.
Moreover, we put some of the generated drum loops on an accompanying demo website,444https://loopgen.github.io/ which we recommend readers to visit and listen to. We also present the result where we use the method of style-mixing
of StyleGAN2 to generate “interpolated” versions of loops.
Existing work on audio-domain music generation can be categorized in many ways. First, an unconditional
audio generation model takes as input a vector
of a fixed number of random variables (or a sequence of such vectors; see below) and generates an audio piece from scratch. When side information of the target audio to be generated is available, we can feed such prior information as another vectorand use it as an additional input to the generative model, making it a conditional generation model. For example, GANSynth  uses the pitch of the target audio as a condition. While we focus on unconditional generation in our benchmarking experiments presented in Section 6, it is straightforward to extend all the models presented in Section 5 to take additional conditions.
Second, some existing models can only generate fixed-length output, while others can do variable-length generation. One approach to realize variable-length generation is by using as input to the generative model a sequence of latent vectors , instead of just one latent vector . This is the approach taken by UNAGAN , Jukebox , and VQCPC-GAN .
Third, existing models for generating single notes are typically non-autoregressive models [17, 35, 38], i.e., the target is generated at one shot. When it comes to generating longer phrases, autoregressive models, that generate the target piece one frame or one time sample at a time in the chronological order, might perform better [48, 13], as the output of such models depends explicitly on the previous frames (or samples) that have been generated.
Existing models have been trained and evaluated to generate different types of musical audio, including singing voice[31, 30, 13], drum [35, 38, 2, 16], violin , and piano[48, 49, 31, 47]. The only work addressing loop generation is the very recent LoopNet model from Chandna et al. . They also use loops from looperman but not anything from FSLD or other public datasets, hence not constituting a benchmark for audio generation.
For drum generation in particular, work has been done in the symbolic domain to generate drum patterns [20, 45, 1, 46] and a drum track as part of a symbolic multi-track composition [15, 43, 40]. For example, DeepDrummer  employs human-in-the-loop to produce drum patterns preferred by a user. In the audio domain, DrumGAN  and the model proposed by Ramires et al.  both work on only single hits, i.e., one-shot drum sounds. They both use the Audio Commons models  to extract high-level timbral features to condition the generation process. DrumNet  is a model that generates a sequence of (monophonic) kick drum hits, not the sounds of an entire drum kit.
Two datasets are employed in this work. The first one is a subset of drum loops from the public dataset FSLD , which is used to train the generative models for benchmarking. FSLD comes with detailed manual labeling of the loops with tags such as instrumentation, rhythm, tone and genre. As stated in the FSLD paper , FSLD is balanced in terms of musical genre. By picking loops which are tagged with the keywords “drum”, “drums” or “drum-loop”, we are able to find 2,608 drum loops out of the 9,455 loops available in FSLD. We do not need to hold out any of them as test data but use all these bars for training our generative models, since we focus on unconditional generation in this paper; i.e., each generative model will generate a set of loops randomly for evaluation.
The second dataset is a larger, private collection of drum loops we collect from looperman, a website hosting free music loops.555As stated on https://www.looperman.com/help/terms, “All samples and loops are free to use in commercial and non commercial projects.” But, “You may NOT use or re-distribute any media from the loops section of looperman.com as is either for free or commercially on any other web site.” (Accessed August 1, 2021) We are able to collect in total 23,983 drum loops, which is much more than the drum loops in FSLD. We use the looperman dataset mainly for establishing the model-based objective metrics for evaluation (see Section 4). For instance, we train an audio-based genre classifier using looperman to set up the drum-loop version of the “inception score” [42, 3] to measure how likely a machine-generated loop sounds like a drum loop. Figure 2 shows the number of tracks per genre tag in looperman, which exhibits a typical long-tail distribution. We can see that “Trap” is the most frequent genre, with 5,903 loops.
We use looperman instead of FSLD to set up such objective metrics, since a larger dataset increases the validity and generalizability of the metrics. Moreover, although we cannot re-distribute the loops from looperman according to its terms, we can share checkpoints of the pre-trained models for computing the proposed objective metrics.
As we are interested in benchmarking the performance of one-bar loop generation, we perform downbeat tracking using the state-of-the-art recurrent neural network (RNN) model available in the Madmom library[5, 6] to slice every audio file into multiple one-bar loops.666The downbeat tracker in Madmom is fairly accurate for percussive audio such as the drum loops. For example, it reaches F1-score of 0.863 on the Ballroom dataset , according to . After this processing, we have in total 13,666 and 128,122 one-bar samples from FSLD and looperman, respectively. We refer to these two collections of one-bar drum loops as the freesound and looperman datasets hereafter. We note that all these one-bar samples are of four beats.
As shown in Figure 3, the one-bar samples in either the freesound or looperman datasets have different tempos and hence different lengths. To unify their length to facilitate benchmarking, we use pyrubberband777https://pypi.org/project/pyrubberband/ to temporally stretch each of them to 2-second long, namely to have 120 BPM (beat-per-minute) as their tempo. We listened to some of the stretched samples in both datasets and found most sounded plausible with little perceptible artifacts.888This, however, may not be the case if the loops are not drum loops. Some data filtering might be needed then, e.g., to remove those whose tempo are much away from 120 BPM.
All the loops are in 44,100 Hz sampling rate. We down-mix the stereo ones into mono. After that, we follow the setting of UNAGAN 
to compute the Mel spectrograms of these samples, with 1,024-point window size hann window and 275-point hop size for short-time Fourier Transform (STFT), and 80 Mel channels.
We consider four metrics in our benchmark, developing the drum-loop version of them using the looperman dataset.
measures the quality of the generated data and detects whether there is a mode collapse by using a pre-trained domain-specific classifier. It is computed as the KL divergence between the conditional probabilityand marginal probability ,
where denotes a data example (e.g., a generated loop), and is a pre-defined class. Specifically, the calculation of IS involves building a classifier over the type of data of interest, and it achieves high score (namely, the higher
the better) score when 1) each of the generated data can be classified to any of the predefined classes with high confidence, and 2) the generated data as a whole has close to uniform distribution over the predefined classes.
We use looperman to establish such a classifier, using its genre labels for training a 66-class classifier over the Mel spectrograms of one-bar samples. Specifically, we split the data by 100,000/10,000/18,111 as the training, validation, and test sets, and use the state-of-the-art music auto-tagging model short-chunk CNN 999github.com/minzwon/sota-music-tagging-models for model training. The classifier achieves 0.748 accuracy on the test set.
The idea of FAD, as proposed by Kilgour et al. , is to measure the closeness of the data distribution of the real data versus that of the generated data, in a certain embedding space. Specifically, they pre-train a VGGish-based audio classifier on a large collection of YouTube videos for classifying 300 audio classes and sound events, and then use the second last 128-dimension layer (i.e., prior to the final classification layer) for this embedding space . The data distributions of eal and
enerated data in this space are modeled as a multi-variate normal distribution characterized byand respectively. The FAD score is then computed by the following equation,
and is the lower
the better (down to zero). We use the open source code and pre-trained classifier101010github.com/google-research/google-research/tree/master/frechet_audio_distance to compute the FAD, using the looperman data as the real data and the output of a generative model as the generated data.
Following , we measure diversity with the number of statistically-different bins (NDB) and Jensen-Shannon divergence (JSD) metrics proposed by Richardson et al. , via the official open source code.111111github.com/eitanrich/gans-n-gmms We firstly run -means clustering over normalized Mel spectrograms of 10 thousands one-bar samples randomly picked from looperman to get clusters, and count the number of samples per cluster, , for each . Then, given a collection of loops randomly generated by a generative model, we fit the loops into the clustering and also count the number of fitted samples per cluster, . We can then measure the difference between the two distributions and by either the number of statistically-different bins (among the bins; the lower the better) and their JSD (the lower the better; down to zero). Richardson et al.  recommend reporting the value of NDB divided by , saying that if the two samples do come from the same distribution, NDB should be equal to the significance level of the statistical test, which we set to 0.05.
We develop and evaluate in total three recent deep generative models, all of which happen to be GAN-based . The first model is StyleGAN2 , which represents the state-of-the-art in image generation, included here intending to test its applicability for musical audio generation (which has not been reported elsewhere, to our best knowledge). The second model, StyleGAN , is a precursor of StyleGAN2, tested on spoken digit generation before (akin to single note generation in music)  but not on musical audio generation. Both StyleGAN and StyleGAN2 generate only fixed-length output, which is fine here since our samples have constant length. The last model, UNAGAN , represents a state-of-the-art in musical audio generation, capable of generating variable-length output. For fair comparison, we only require UNAGAN to generate two-second samples as the other two. Schematic plots of the three models can be found in Figure 4.
All these three models are trained to generate Mel spectrograms, with phase information missing. But, the Mel spectrograms can later be converted into audio waveforms by a separate neural vocoder, such as WaveNet , WaveGlow , DiffWave , or MelGAN . We are in favor of MelGAN for it is non-autoregressive and therefore fast in inference time, and for there is official open source code that is easy to use.121212github.com/descriptinc/melgan-neurips We train MelGAN on the looperman dataset and use it in all our experiments.
The input tensor of StyleGAN and StyleGAN2 can be interpreted as 512 tiny images. This tensor is learned and then fixed during the inference stage while generating new images, using different each time. We modify it to be a tensor in our work, to generate a Mel spectrogram through four upsampling blocks.
Our implementation of StyleGAN is based on an open source code.131313github.com/rosinality/style-based-gan-pytorch For model training, StyleGAN employs the non-saturating loss with regularization and a progressive-growing training strategy . We use 0.9 mixing regularization ratio , and set the batch size to 32, 16, 8, 4 in the respective scale, from low to high resolution. In every scale, we train with 1.2M samples. We deployed Adam optimization algorithm and set the learning rate to 1e–3. The total training time is 120 hr on an NVIDIA GTX1080 GPU with 8GB memory.
StyleGAN2  is an improved version of StyleGAN with many structural changes, including replacing AdaIN by a combination of “modulation” and “demodulation” layers, processing the input tensor differently, adding the Gaussian noise outside of the style blocks etc. The weights in the convolution layers are scaled with in the Modulation block and normalized by L2 norm in the DeModulation block. We refer readers to the original paper  for details. Our implementation of StyleGAN2 is based on another open source code,141414github.com/rosinality/stylegan2-pytorch with similar training strategies as the StyleGAN case, but two times larger learning rate, no progressive growing, and a constant batch size of 8 for 1M samples. The total training time is 100 hr on a GTX1080.
UNAGAN  is a non-autoregressive model originally designed for generating variable-length singing voices in an unconditional fashion. The authors also demonstrate its effectiveness in learning to generate passages of violin, piano, and speech. What makes UNAGAN different from existing models such as StyleGAN, WaveGAN , DrumGAN , and GANSynth  is that UNGAN takes a sequence of latent vectors as input, instead of just a single one. This sequence of latent vectors, together with the recurrent units inside its ‘Gblocks’ [30, 31] (see Figure 4(c)), facilitates UNAGAN to generate variable-length audio with length proportional to the length of the input latent sequence. UNAGAN adopts a hierarchical architecture that generates Mel spectrograms in a coarse-to-fine fashion similar to the progressive upsampling blocks in StyleGAN and StyleGAN2. UNAGAN uses the BEGAN-based adversarial loss , and an additional cycle consistency loss  to stabilize training and for increasing diversity. Our implementation of UNAGAN is based on the official open source code.151515https://github.com/ciaua/unagan We fix the number of input latent vectors to 20 and train the model with Adam, 1e–4 learning rate, and a batch size of 16 for 100k iterations, amounting to 40 hr on a GTX1080.
Table 1 presents the objective evaluation result of models trained on the freesound dataset. Each model generates 2,000 random loops to compute the scores. We also compute these metrics on the two real datasets and add the results to Table 1, to offer an oracle reference. We see that the IS of StyleGAN2 is the closest to that of the freesound dataset, followed by UNAGAN and then StyleGAN. Student’s -test shows that the performance edge of StyleGAN2 over either UNAGAN or StyleGAN is statistically significant (-value0.01). This reveals the efficacy of StyleGAN2 for generating fixed-length audio.
The scores in JS and NDB further support the superiority of StyleGAN2, showing that its output is the most diverse among the three.
The scores in FAD, however, shows that UNAGAN performs better than StyelGAN2 here. The contrast between IS and FAD suggests that UNAGAN learns to generate samples whose embeddings have similar distribution as the real data, but its output cannot be easily associated with a genre class by the short-chunk CNN classifier. We also see that StyleGAN has fairly high FAD, showing that its generation hardly resemble the real data distribution.
Out of curiosity, we also train the models on the private, yet larger, looperman dataset and redo the evaluation. Table 2 shows that StyleGAN2 achieves even higher IS and much lower NDB here. Furthermore, its FAD is now lower than that of UNAGAN. Together with the result in JS and NDB, we see from this table that StyleGAN2 is more effective in learning to cover the modes in a large dataset. Figure 1 demonstrates the mel-spectrograms of some random drum loops generated by this StyleGAN2 model.
We run additionally an online listening test to evaluate the models subjectively. Each subject is presented with the a randomly-picked human-made loop from the freesound dataset, and one randomly-generated loop by each of the three models trained on freesound, with the ordering of these four loops randomized. Then, the subject is asked to rate each of these one-bar loops in terms of the following metrics, the first three on a three-point scale, and the last one on a five-point Likert scale:
Drumness: whether the sample contains drum sounds (‘no’/‘yes but vague’/‘yes and clear’);
Loopness: whether the sample can be played repeatedly in a seamless manner (‘no’/‘yes but not so good’/‘yes’);
Audio quality: whether the sample is free of unpleasant noises or artifacts (‘no’/‘no but not so bad’/‘yes’);
Preference: how much you like it (1–5).
To evaluate loopness, we actually repeat each sample four times in the audio recording presented to the subjects. And, since the output of the models go through the MelGAN vocoder to become waveforms, we compute the Mel spectrograms of the human-made loops and render them to audio with the same vocoder for fair comparison.
140 anonymous subjects from Taiwan participated in this test,161616The subjects have no ideas about our models beforehand; they neither know that one of the loops they hear is human-made. with in total six unique samples by each model evaluated. Overall, the responses indicated an acceptable level of reliability (Cronbach’s ). We see from Figure 5 that the result of this subjective evaluation is well aligned with that of the objective evaluation, with StyleGAN2 performing the best and StyleGAN the worst, demonstrating the effectiveness of the objective metrics to some extent. Interestingly, we see no statistical difference in the ratings of the StyleGAN2 loops and the (MelGAN-vocoded) freesound loops in Drumness and Preference.
Finally, we correlate the scores of the objective metrics and subjective metrics for the 18 samples evaluated in the listening test (i.e., six samples by each GAN model). We found 0.25–0.37 correlation between IS and the four subjective metrics, and 0.01–0.16 negative correlation between FAD and the subjective metrics. The strongest correlation (0.37) is found between IS and Preference.
In this paper, we have proposed using loop generation as a benchmarking task to provide a standardized evaluation of audio-domain music generation models, taking advantage of the public availability of the large collection of loops in FSLD. Moreover, we developed customized metrics to objectively evaluate the performance of such generative models for the particular case of one-bar drum loops with 120 BPM. As references, we implemented and evaluated three recent model architectures using the dataset, and discovered that StyleGAN2 works quite well. The list of models we have evaluated is short and by no means exhaustive. We wish researchers can find this benchmark useful and consider it as part of their evaluation of new models.
This work can be extended in many other directions. First and foremost, we can extend the benchmark to cover all the loops in FSLD (and looperman). The major complexity here could be the challenge to build a model that fits it all; we may need separate generative models and vocoders for different types of loops.
Second, we are certainly interested in the case of generating loops that have different tempos, rather than a fixed tempo at 120. This will require the generative models to be capable of generating variable-length output, which seems more realistic in musical audio applications.
We can also extend the benchmark to generate four-bar loops (which are not simply repeating a one-bar loop quadruple times), as there are actually a big collection of 6,656 four-bar drum loops in the looperman dataset. We do not evaluate this in this paper, as the public freesound dataset does not contain many such four-bar loops.
We also want to include more objective metrics in the future, such as using the Audio Commons Audio Extractor  to evaluate the “loopness” of the generated samples, or using an automatic drum transcription model [52, 12, 7] to assess the plausibility of the created percussive patterns.
Besides the benchmarking initiative, we are interested in further improving audio-domain loop generation itself and exploring new use cases, e.g., to have a conditional generation model that gives users some control (in similar veins to [35, 38, 9]), or to aim at generating novel loops by means of a creative adversarial network (CAN) .
This research work is supported by the Ministry of Science and Technology (MOST), Taiwan, under grant number 109-2628-E-001-002-MY2.
Joint beat and downbeat tracking with recurrent neural networks. In Proc. Int. Soc. Music Information Retrieval Conf., Cited by: §3.1, footnote 6.
Proc. AAAI Conf. Artificial Intelligence, Cited by: §2.
Neural audio synthesis of musical notes with WaveNet autoencoders. In
Proc. Int. Conf. Machine Learning, Cited by: §1, §1.
MelNet: a generative model for audio in the frequency domain. arXiv preprint arXiv:1906.01083. Cited by: §1, §1, §2.
Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, Cited by: §5.3.