Recent years have seen progress on music generation, thanks largely to advances in machine learning . A music generation pipeline usually consists of several steps—data collection, data preprocessing, model creation, model training and model evaluation, as illustrated in fig:pipeline. While some components need to be customized for each model, others can be shared across systems. For symbolic music generation in particular, a number of datasets, representations and metrics have been proposed in the literature . As a result, an easy-to-use toolkit that implements standard versions of such routines could save a great deal of time and effort and might lead to increased reproducibility. However, such tools are challenging to develop for a variety of reasons.
First, though there are a number of publicly-available symbolic music datasets, the diverse organization of these collections and the various formats used to store them presents a challenge. These formats are usually designed for different purposes. Some focus on playback capability (e.g., MIDI), some are developed for music notation softwares (e.g., MusicXML  and LilyPond ), some are designed for organizing musical documents (e.g., Music Encoding Initiative (MEI) ), and others are research-oriented formats that aim for simplicity and readability (e.g., MuseData  and Humdrum . Oftentimes researchers have to implement their own preprocessing code for each different format. Moreover, while researchers can implement their own procedures to access and process the data, issues of reproducibility due to the inconsistency of source data have been raised in  for audio datasets.
Second, music has hierarchy and structure, and thus different levels of abstraction can lead to different representations . Moreover, a number of music representations designed specially for generative modeling of music have also been proposed in prior art, for example, as a sequence of pitches [25, 12, 3, 31], events [26, 18, 8, 19], notes  or a time-pitch matrix (i.e., a piano roll) [34, 10].
Finally, efforts have been made toward more robust objective evaluation metrics for music generation systems as these metrics provide not only an objective way for comparing different models but also indicators for monitoring training progress in machine learning-based systems. Given the success of mir_eval  in evaluating common MIR tasks, a library providing implementations of commonly used evaluation metrics for music generation systems could help improve reproducibility.
To manage the above challenges, we find a toolkit dedicated for music generation a timely contribution to the MIR community. Hence, we present in this paper a new Python library, MusPy, for symbolic music generation. It provides essential tools for developing a music generation system, including dataset management, data I/O, data preprocessing and model evaluation.
With MusPy, we provide a statistical analysis on the eleven datasets currently supported by MusPy, with an eye to unveiling statistical differences between them. Moreover, we conduct three experiments to analyze their relative diversities and cross-dataset domain compatibility of the various datasets. These results, along with the statistical analysis, together provide a guide for choosing proper datasets for future research. Finally, we also show that combining multiple heterogeneous datasets could help improve generalizability of a music generation system.
2 Related Work
Few attempts, to the best of our knowledge, have been made to develop a dedicated library for music generation. The Magenta project 
represents the most notable example. While MusPy aims to provide fundamental routines in data collection, preprocessing and analysis, Magenta comes with a number of model instances, but is tightly bound with TensorFlow. In MusPy, we leave the model creation and training to dedicated machine learning libraries, and design MusPy to be flexible in working with different machine learning frameworks.
There are several libraries for working with symbolic music. music21  is one of the most representative toolkits and targets studies in computational musicology. While music21 comes with its own corpus, MusPy does not host any dataset. Instead, MusPy provides functions to download datasets from the web, along with tools for managing different collections, which makes it easy to extend support for new datasets in the future. jSymbolic  focuses on extracting statistical information from symbolic music data. While jSymbolic can serve as a powerful feature extractor for training supervised classification models, MusPy focuses on generative modeling of music and supports different commonly used representations in music generation. In addition, MusPy provides several objective metrics for evaluating music generation systems.
Related cross-dataset generalizability experiments  show that pretraining on a cross-domain data can improve music generation results both qualitatively and quantitatively. MusPy’s dataset management system makes it easier for us to thoroughly verify this hypothesis by examining pairwise generalizabilities between various datasets.
|Lakh MIDI Dataset (LMD) ||MIDI||>9000||174,533||misc|
|MAESTRO Dataset ||MIDI||201.21||1,282||classical|
|Wikifonia Lead Sheet Dataset ||MusicXML||198.40||6,405||misc||✓||✓|
|Essen Folk Song Database ||ABC||56.62||9,034||folk||✓||✓|
|NES Music Database||MIDI||46.11||5,278||game||✓||✓|
|Hymnal Tune Dataset ||MIDI||18.74||1,756||hymn||✓|
|Hymnal Dataset ||MIDI||17.50||1,723||hymn|
|Nottingham Database (NMD) ||ABC||10.54||1,036||folk||✓||✓|
|music21 JSBach Corpus ||MusicXML||3.46||410||classical||✓|
|JSBach Chorale Dataset ||MIDI||3.21||382||classical||✓|
MusPy is an open source Python library dedicated for symbolic music generation. fig:system presents the system diagram of MusPy. It provides a core class, MusPy Music class, as a universal container for symbolic music. Dataset management system, I/O interfaces and model evaluation tools are then built upon this core container. We provide in fig:pipeline_example examples of data preparation and result writing pipelines using MusPy.
3.1 MusPy Music class and I/O interfaces
We aim at finding a middle ground among existing formats for symbolic music and design a unified format dedicated for music generation. MIDI, as a communication protocol between musical devices, uses velocities to indicate dynamics, beats per minute (bpm) for tempo markings, and control messages for articulation, but it lacks the concepts of notes, measures and symbolic musical markings. In contrast, MusicXML, as a sheet music exchanging format, has the concepts of notes, measures and symbolic musical markings and contains visual layout information, but it falls short on playback-related data. For a music generation system, however, both symbolic and playback-specific data are important. Hence, we follow MIDI’s standard for playback-related data and MusicXML’s standard for symbolic musical markings.
|Note beams and slurs||✓|
|Song/source meta data||✓||✓|
|Concept of notes||✓||✓|
In fact, the MusPy Music class naturally defines a universal format for symbolic music, which we will refer to as the MusPy format, and can be serialized into a human-readable JSON/YAML file. tab:comparison summarizes the key differences among MIDI, MusicXML and the proposed MusPy formats. Using the proposed MusPy Music class as the internal representation for music data, we then provide I/O interfaces for common formats (e.g., MIDI, MusicXML and ABC) and interfaces to other symbolic music libraries (e.g., music21 , mido , pretty_midi  and Pypianoroll ). fig:pipeline_example(b) provides an example of result writing pipeline using MusPy.
3.2 Dataset management
. tab:datasets presents the list of datasets currently supported by MusPy and their comparisons. Each supported dataset comes with a class inherited from the base MusPy Dataset class. The modularized and flexible design of the dataset management system makes it easy to handle local data collections or extend support for new datasets in the future. fig:dataset_modes illustrates the two internal processing modes when iterating over a MusPy Dataset object. In addition, MusPy provides interfaces to PyTorch and TensorFlow  for creating input pipelines for machine learning (see fig:pipeline_example(a) for an example).
|Pitch-based||note-ons, hold, rest (support only monophonic music)|
|Event-based||note-ons, note-offs, time shifts, velocities|
|Piano-roll||or||for binary piano rolls; for piano rolls with velocities|
|Note-based||or||List of tuples|
Music has multiple levels of abstraction, and thus can be expressed in various representations. For music generation in particular, several representations designed for generative modeling of symbolic music have been proposed and used in the literature . These representations can be broadly categorized into four types—the pitch-based [25, 12, 3, 31], the event-based [26, 18, 8, 19], the note-based  and the piano-roll [34, 10] representations. tab:representations presents a comparison of them. We provide in MusPy implementations of these representations and integration to the dataset management system. fig:pipeline_example(a) provides an example of preparing training data in the piano-roll representation from the NES Music Database using MusPy.
3.4 Model evaluation tools
Model evaluation is another critical component in developing music generation systems. Hence, we also integrate into MusPy tools for audio rendering as well as score and piano-roll visualizations. These tools could also be useful for monitoring the training progress or demonstrating the final results. Moreover, MusPy provides implementations of several objective metrics proposed in the literature [24, 10, 33]. These objective metrics, as listed below, could be used to evaluate a music generation system by comparing the statistical difference between the training data and the generated samples, as discussed in .
Pitch-related metrics—polyphony, polyphony rate, pitch-in-scale rate, scale consistency, pitch entropy and pitch class entropy.
Rhythm-related metrics—empty-beat rate, drum-in-pattern rate, drum pattern consistency and groove consistency.
To summarize, MusPy features the following:
Dataset management system for commonly used datasets with interfaces to PyTorch and TensorFlow.
Data I/O for common symbolic music formats (e.g., MIDI, MusicXML and ABC) and interfaces to other symbolic music libraries (e.g., music21, mido, pretty_midi and Pypianoroll).
Implementations of common music representations for music generation, including the pitch-based, the event-based, the piano-roll and the note-based representations.
Model evaluation tools for music generation systems, including audio rendering, score and piano-roll visualizations and objective metrics.
All source code and documentation can be found at https://github.com/salu133445/muspy.
4 Dataset Analysis
Analyzing datasets is critical in developing music generation systems. With MusPy’s dataset management system, we can easily work with different music datasets. Below we compute the statistics of three key elements of a song—length, tempo and key using MusPy, with an eye to unveiling statistical differences among these datasets. First, fig:length_dist shows the distributions of song lengths for different datasets. We can see that they differ greatly in their ranges, medians and variances.
Second, we present in fig:tempo_dist the distributions of initial tempo for datasets that come with tempo information. We can see that all of them are generally bell-shaped but with different ranges and variances. We also note that there are two peaks, and quarter notes per minute (qpm), in Lakh MIDI Dataset (LMD), which is possibly because these two values are often set as the default tempo values in music notation programs and MIDI editors/sequencers. Moreover, in Hymnal Tune Dataset, only around ten percent of songs have an initial tempo other than qpm.
Finally, fig:key_hist shows the histograms of keys for different datasets. We can see that the key distributions are rather imbalanced. Moreover, only less than 3% of songs are in minor keys for most datasets except the music21 Corpus. In particular, LMD has the most imbalanced key distributions, which might be due to the fact that C major is often set as the default key in music notation programs and MIDI editors/sequencers.111Note that key information is considered as a meta message in a MIDI file. It does not affect the playback and thus can be unreliable sometimes. These statistics could provide a guide for choosing proper datasets in future research.
5 Experiments and Results
In this section, we conduct three experiments to analyze the relative complexities and the cross-dataset generalizabilities of the eleven datasets currently supported by MusPy (see tab:datasets). We implement four autoregressive models—a recurrent neural network (RNN), a long short-term memory (LSTM) network
, a gated recurrent unit (GRU) network
and a Transformer network.
5.1 Experiment settings
For the data, we use the event representation as specified in tab:representations and discard velocity events as some datasets have no velocity information (e.g., datasets using ABC format). Moreover, we also include an end-of-sequence event, leading to in total possible events. For simplicity, we downsample each song into four time steps per quarter note and fix the sequence length to , which is equivalent to four measures in time. In addition, we discard repeat information in MusicXML data and use only melodies in Wikifonia dataset. We split each dataset into train–test–validation sets with a ratio of . For the training, the models are trained to predict the next event given the previous events. We use the cross entropy loss and the Adam optimizer . For evaluation, we randomly sample sequences of length from the test split, and compute the perplexity of these sequences. We implement the models in Python using PyTorch. For reproducibility, source code and hyperparmeters are available at https://github.com/salu133445/muspy-exp.
5.2 Autoregressive models on different datasets
In this experiment, we train the model on some dataset and test it on the same dataset . We present in fig:exp_perplexity the perplexities for different models on different datasets. We can see that all models have similar tendencies. In general, they achieve smaller perplexities for smaller, homogeneous datasets, but result in larger perplexities for larger, more diverse datasets. That is, the test perplexity could serve as an indicator for the diversity of a dataset. Moreover, fig:exp_hour_perplexity shows perplexities versus dataset sizes (in hours). By categorizing datasets into multi-pitch (i.e., accepting any number of concurrent notes) and monophonic datasets, we can see that the perplexity is positively correlated to the dataset size within each group.
5.3 Cross-dataset generalizability
In this experiment, we train a model on some dataset , while in addition to testing it on the same dataset , we also test it on each other dataset . We present in fig:exp_cross_datasets the perplexities for each train–test dataset pair. Here are some observations:
Cross dataset generalizability is not symmetric in general. For example, a model trained on LMD generalizes well to all other datasets, while not all models trained on other datasets generalize to LMD, which is possibly due to the fact that LMD is a large, cross-genre dataset.
Models trained on multi-pitch datasets generalize well to monophonic datasets, while models trained on monophonic datasets do not generalize to multi-pitch datasets (see the red block in fig:exp_cross_datasets).
The model trained on JSBach Chorale Dataset does not generalize to any of the other datasets (see the orange block in fig:exp_cross_datasets). This is possibly because its samples are downsampled to a resolution of quarter note, which leads to a distinct note duration distribution.
Most datasets generalize worse to NES Music Database compared to other datasets (see the green block in fig:exp_cross_datasets). This is possibly due to the fact that NES Music Database contains only game soundtracks.
5.4 Effects of combining heterogeneous datasets
From fig:exp_cross_datasets we can see that LMD has the best generalizability, possibly because it is large, diverse and cross-genre. However, a model trained on LMD does not generalize well to NES Music Database (see the brown block in the close-up of fig:exp_cross_datasets). We are thus interested in whether combing multiple heterogeneous datasets could help improve generalizability.
We combine all eleven datasets listed in tab:datasets into one large unified dataset. Since these datasets differ greatly in their sizes, simply concatenating the datasets might lead to severe imbalance problem and bias toward the largest dataset. Hence, we also consider a version that adopts stratified sampling during training. Specifically, to acquire a data sample in the stratified dataset, we uniformly choose one dataset out of the eleven datasets, and then randomly pick one sample from that dataset. Note that stratified sampling is disabled at test time.
We also include in Figures 8, 9 and 10 the results for these two datasets. We can see from fig:exp_cross_datasets that combining datasets from different sources improves the generalizability of the model. This is consistent with the finding in  that models trained on certain cross-domain datasets generalize better to other unseen datasets. Moreover, stratified sampling alleviates the source imbalance problem by reducing perplexities in most datasets with a sacrifice of an increased perplexity on LMD.
We have presented MusPy, a new toolkit that provides essential tools for developing music generation systems. We discussed the designs and features of the library, along with data pipeline examples. With MusPy’s dataset management system, we conducted a statistical analysis and experiments on the eleven currently supported datasets to analyze their relative diversities and cross-dataset generalizabilities. These results could help researchers choose appropriate datasets in future research. Finally, we showed that combining heterogeneous datasets could help improve generalizability of a machine learning model.
-  (2016) TensorFlow: a system for large-scale machine learning. In Proc. of the 12th USENIX Symp. on Operating Systems Design and Implementation (OSDI), Cited by: §2, §3.2.
-  (2019) Mirdata: software for reproducible usage of datasets. In Proc. of the 20th International Society for Music Information Retrieval Conference (ISMIR), Cited by: §1.
-  (2012) Modeling temporal dependencies in high-dimensional sequences: application to polyphonic music generation and transcription. In Proc. of the 29th International Conference on Machine Learning (ICML), Cited by: §1, §3.3, Table 1.
-  (2017) Deep learning techniques for music generation: a survey. arXiv preprint arXiv:1709.01620. Cited by: §1, §3.3.
Learning phrase representations using rnn encoder-decoder for statistical machine translation.
Proc. of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: §5.
-  (2010) Music21: a toolkit for computer-aided musicology and symbolic music data. In Proc. of the 11th International Society for Music Information Retrieval Conference (ISMIR), Cited by: §2, §3.1, Table 1.
-  (1993) A brief survey of music representation issues, techniques, and systems. Computer Music Journal 17 (3), pp. 20–30. Cited by: §1.
-  (2019) LakhNES: improving multi-instrumental music generation with cross-domain pre-training. In Proc. of the 20th International Society for Music Information Retrieval Conference (ISMIR), Cited by: §1, §2, §3.3, §5.4.
-  (2018) The NES music database: a multi-instrumental dataset with expressive performance attributes. In Proc. of the 19th International Society for Music Information Retrieval Conference (ISMIR), Cited by: Table 1.
MuseGAN: multi-track sequential generative adversarial networks for symbolic music generation and accompaniment. In
Proc. of the 32nd AAAI Conference on Artificial Intelligence (AAAI), Cited by: §1, §3.3, §3.4.
-  (2018) Pypianoroll: open source Python package for handling multitrack pianorolls. In Late-Breaking Demos of the 19th International Society for Music Information Retrieval Conference (ISMIR), Cited by: §3.1.
-  (2002) Finding temporal structure in music: Blues improvisation with LSTM recurrent networks. In Proc. of the IEEE Workshop on Neural Networks for Signal Processing, pp. 747–756. Cited by: §1, §3.3.
-  (2001) MusicXML for notation and analysis. In The Virtual Score: Representation, Retrieval, Restoration, W. B. Hewlett and E. Selfridge-Field (Eds.), pp. 113–124. Cited by: §1.
-  (2011) The music encoding initiative as a document-encoding framework. In Proc. of the 12th International Society for Music Information Retrieval Conference (ISMIR), Cited by: §1.
-  (2019) Enabling factorized piano music modeling and generation with the MAESTRO dataset. In Proc. of the 7th International Conference on Learning Representations (ICLR), Cited by: Table 1.
-  (1997) MuseData: multipurpose representation. In Beyond MIDI: The Handbook of Musical Codes, E. Selfridge-Field (Ed.), pp. 402–447. Cited by: §1.
-  (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §5.
-  (2019) Music transformer: generating music with long-term structure. In Proc. of the 7th International Conference for Learning Representations (ICLR), Cited by: §1, §3.3.
-  (2020) Pop music transformer: generating music with rhythm and harmony. arXiv preprint arXiv:2002.00212. Cited by: §1, §3.3.
-  (1997) Humdrum and Kern: selective feature encoding. In Beyond MIDI: The Handbook of Musical Codes, E. Selfridge-Field (Ed.), pp. 375–401. Cited by: §1.
-  (2014) Adam: a method for stochastic optimization. In Proc. of the 3rd International Conference for Learning Representations (ICLR), Cited by: §5.1.
Torchvision the machine-vision package of torch. In Proc. of the 18th ACM International Conference on Multimedia, Cited by: §3.2.
-  (2006) JSymbolic: a feature extractor for MIDI files. In Proc. of the 2006 International Computer Music Conference (ICMC), Cited by: §2.
-  (2016) C-RNN-GAN: continuous recurrent neural networks with adversarial training. In NeuIPS Worshop on Constructive Machine Learning, Cited by: §1, §3.3, §3.4.
-  (1994) Neural network music composition by prediction: exploring the benefits of psychoacoustic constraints and multi-scale processing. Connection Science 6, pp. 247–280. Cited by: §1, §3.3.
-  (2018) This time with feeling: learning expressive musical performance. Neural Computing and Applications 32. Cited by: §1, §3.3.
-  (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32 (NeurIPS), pp. 8024–8035. Cited by: §3.2.
-  (2014) Intuitive analysis, creation and manipulation of MIDI data with pretty_midi. In Late-Breaking Demos of the 15th International Society for Music Information Retrieval Conference (ISMIR), Cited by: §3.1.
-  (2014) Mir_eval: a transparent implementation of common MIR metrics. In Proc. of the 15th International Society for Music Information Retrieval Conference (ISMIR), Cited by: §1.
-  (2016) Learning-based methods for comparing sequences, with applications to audio-to-MIDI alignment and matching. Ph.D. Thesis, Columbia University. Cited by: Table 1.
A hierarchical latent vector model for learning long-term structure in music. In Proc. of the 35th International Conference on Machine Learning (ICML), Cited by: §1, §3.3.
-  (2017) Attention is all you need. In Advances in Neural Information Processing Systems 30 (NeurIPS), Cited by: §5.
-  (2020) The jazz transformer on the front line: exploring the shortcomings of ai-composed music through quantitative measures. In Proc. of the 21st International Society for Music Information Retrieval Conference (ISMIR), Cited by: §3.4.
-  (2017) MidiNet: a convolutional generative adversarial network for symbolic-domain music generation. In Proc. of the 18th International Society for Music Information Retrieval Conference (ISMIR), Cited by: §1, §3.3.
-  (2018) On the evaluation of generative models in music. Neural Computing and Applications 32, pp. 4773–4784. Cited by: §1, §3.4.
-  Essen folk song database. Note: https://ifdo.ca/ seymour/runabc/esac/esacdatabase.html Cited by: Table 1.
-  Hymnal. Note: https://www.hymnal.net/ Cited by: Table 1.
-  LilyPond. Note: https://lilypond.org/ Cited by: §1.
-  Magenta. Note: https://magenta.tensorflow.org/ Cited by: §2.
-  Mido: midi objects for python. Note: https://github.com/mido/mido Cited by: §3.1.
-  Nottingham database. Note: https://ifdo.ca/ seymour/nottingham/nottingham.html Cited by: Table 1.
-  TensorFlow datasets. Note: https://www.tensorflow.org/datasets Cited by: §3.2.
-  Wikifonia. Note: http://www.wikifonia.org/ Cited by: Table 1.