Deep learning has rapidly become the state-of-the-art approach for music generation (Briot et al., 2017). However, training a deep model typically requires a large training set, and it is well known that the performance of deep learning systems scales up with more data, even when the data is noisy. However, when training a model to emulate the style of a particular composer, the size of the dataset is inherently limited to the number of compositions by that musician.
In this paper, we present augmentative generation (Aug-Gen), a method of dataset augmentation for any music generation system trained on a resource-constrained domain. The key intuition of this method is that the training data for a generative system can be augmented by examples the system produces during the course of training, provided these examples are of sufficiently high quality and variety. To the best of our knowledge, our paper is the first to introduce a framework in which a generative system continuously generates output and adds it to its training dataset.
We apply Aug-Gen to chorale generation in the style of J.S. Bach. We perform experiments using a Transformer model (Vaswani et al., 2017) as the music generation system. To select model output to include in the training data, we use the grading function in (Fang et al., 2020), which evaluates a given chorale along nine musical features important for Bach chorales. This allows us to incorporate non-differentiable domain knowledge (e.g. 18th century counterpoint rules) into the training procedure.
. However, these works aim to train a classifier rather than a generative system like ours. Moreover, the informativeness of candidate training examples is measured relative to the model, whereas in our work, selection of training examples is done by an external critic that does not depend on the model’s current performance.
2 Augmentative Generation
In Aug-Gen, the true dataset is first split into training and validation data. During model training, the training dataset is continuously augmented by model output, while the validation data is fixed and is used to determine when to terminate training. Each epoch of training includes a generation step and a training step. In the generation step, examples are generated and graded by the grading function . We add all generated output that (1) passes a pre-determined quality threshold , and (2) is sufficiently diverse, to the training dataset. In this work, we use a simple uniqueness criterion (i.e. the chorale must not have been seen before) as the diversity requirement. Next, in the training step of the epoch, we train the model on the augmented dataset. This consists of training on batches of size , so that the amount of training in each epoch is independent of the size of the entire training dataset. These steps continue until the validation loss ceases to improve. In Algorithm 1, the Aug-Gen algorithm is shown in pseudocode. Note that the grading function used returns a measure of distance from Bach chorales, so lower grades represent better chorales.
In our experiments111https://github.com/asdfang/constraint-transformer-bach, we evaluated the effectiveness of Aug-Gen in improving the output quality of a Transformer model trained to generate Bach-style chorales. We encoded Bach chorales in XML notation using music21 (Cuthbert and Ariza, 2010), and used the same data representation as in (Hadjeres et al., 2017)
. Our generative model consists of a Transformer network with relative attention(Huang et al., 2019).
3.1 Comparison of Training Methods
We compared three training methods on the same model architecture. In each method, we initialized the dataset to the set of 351 Bach chorales, and split the dataset into training and validation. In the generation step of each training epoch, we generated chorales, and include a generated example in the training dataset if it passes a threshold and is unique. In the training step, we trained on randomly selected batches of size from the full training dataset. We trained each model for epochs.
The three models have equivalent hyperparameters and differ only in the thresholdused for including generated chorales in the training dataset: (1) Aug-Gen (
, i.e. the third quartile of Bach grades), which includes only generated chorales that receive a better grade thanof Bach chorales; (2) Baseline-none (), which includes no generated chorales, equivalent to training a model on only Bach chorales; (3) Baseline-all (), which includes all generated chorales, regardless of quality.
3.2 Analysis of Training
Figure 1 shows the grade distribution of output generated by Aug-Gen
during each epoch of training. We see that model improvement is clearly reflected in the grading function, suggesting that the grading function can be used to assess model quality independently from the loss function. The first chorale is added to the dataset at epoch; by epoch , generated examples are of the training dataset. The lowest validation loss achieved by Baseline-none is at epoch , and for Aug-Gen at epoch . This suggests Aug-Gen allows a model to be trained for longer without overfitting.
3.3 Grade Distribution
To compare the three methods, we used each model’s epoch that achieved the lowest validation loss. Figure 2 shows the grade distribution of the 351 Bach chorales and 351 generations from each method. We see that a threshold that selects only high-quality generations results in a tighter grade distribution that more closely resembles Bach’s, compared to thresholds that select all or no generations for training.
Our experimental evidence suggests Aug-Gen, paired with an effective grading function, allows for longer training and results in better generative output. Listening to generated chorales indicates that remaining errors tend to be along dimensions not measured by the grading function, including excessive modulation, weak metric structure, and unmusical repetition. Future work includes improving the grading function to account for these issues, exploring richer measures of diversity within a dataset, applying Aug-Gen to different models and musical domains, and devising other training methods that utilize generated music data.
Not enough data? deep learning to the rescue!.
Association for the Advancement of Artificial Intelligence. Cited by: §1.
- Deep learning techniques for music generation - a survey. ArXiv abs/1709.01620. Cited by: §1.
- music21: A toolkit for computer-aided musicology and symbolic music data. In Conference of the International Society of Music Information Retrieval (ISMIR), pp. 637–642. Cited by: §3.
- Bach or mock? A grading function for chorales in the style of J.S. Bach. In Machine Learning for Media Discovery (ML4MD) Workshop at the International Conference on Machine Learning (ICML), Cited by: §1.
- Vector quantized contrastive predictive coding for template-based music generation. arXiv preprint arXiv:2004.10120. Cited by: §4.
- DeepBach: a steerable model for Bach chorales generation. In Proceedings of the 34th International Conference on Machine Learning, pp. 1362–1371. Cited by: §3.
- An improved relative self-attention mechanism for transformer with application to music generation. In Proceedings of the 35th International Conference on Machine Learning (ICML), Cited by: §3.
Active generative adversarial network for image classification. In Association for the Advancement of Artificial Intelligence, Cited by: §1.
- Attention is all you need. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Cited by: §1.