Deep generative systems that learn probabilistic models from a corpus of existing music do not explicitly encode knowledge of a musical style, compared to traditional rule-based systems (Ebcioğlu, 1988; Nierhaus, 2009). Thus, it can be difficult to determine whether deep models generate stylistically correct output without expert evaluation. However, human evaluation is expensive and time-consuming, limiting when it can be performed during the research cycle. Moreover, the experimental setup and execution vary greatly across human subject studies, hindering comparability of results. Therefore, there is a need for automatic, interpretable, and musically-motivated evaluation measures of generated music. Such grading functions can allow researchers to efficiently evaluate their models, shed insight into the musical strengths and limitations of generated output, and serve as a consistent benchmark for comparing different models.
In this paper, we introduce a grading function that evaluates four-part chorales in the style of J.S. Bach along important musical features. The Bach chorales represent a canonical dataset for music generation models that has been used in multiple prior works (Liang et al., 2017; Huang et al., 2017; Hadjeres et al., 2017), due to the dataset’s size and stylistic consistency. We use the grading function to evaluate the output of a Transformer model, and show that the function is both interpretable and outperforms human experts at discriminating Bach chorales from model-generated ones.
|S Intervals||A Intervals||T Intervals||B Intervals||
|Bach||0.24 (0.15)||0.23 (0.14)||0.0 (0.69)||0.41 (0.2)||0.47 (0.28)||0.49 (0.23)||0.53 (0.24)||0.69 (0.4)||1.29 (0.88)||4.91 (1.63)|
|Generated||0.37 (0.22)||0.26 (0.14)||2.16 (3.22)||0.54 (0.31)||0.53 (0.35)||0.71 (0.34)||0.73 (0.38)||0.89 (0.68)||1.86 (2.81)||8.94 (4.64)|
The median value (standard deviation) for every feature in the grading function, as well as the overall grade, for Bach chorales and generated chorales. Lower values represent better chorales. We can see that the model struggles with avoiding parallelisms.
2 Grading Function for Four-Part Chorales
Given a four-part chorale, our grading function111https://github.com/asdfang/constraint-transformer-bach/tree/master/Grader outputs a real-valued grade. We represent a chorale as a set of distributions, each corresponding to a musical feature important for evaluating Bach-style chorales. We implement our grading function using music21 (Cuthbert and Ariza, 2010).
For each feature (described in Section 2.1), we use the Wasserstein metric (Rüschendorf, 1985) to measure the distance between the distribution of the given chorale and the ground-truth distribution over the set of true Bach chorales. By taking a weighted sum of the Wasserstein distances over all the features (Eq. 1), we obtain the overall grade for a chorale. Note the output of the grading function is positive, and a lower grade represents a better chorale.
In this section, we describe each feature used to represent a chorale (or set of chorales). The weight unless stated otherwise.
The pitch distribution is the distribution of a chorale’s pitches in scale degrees. We consider enharmonic spellings as distinct, but not octave displacements. For a concrete example, if a chorale in C Major had 60 C’s, 25 F’s, and 15 G
’s, the probabilities for(“scale degree ”), , and are , , and , respectively. The pitch distribution feature evaluates a Bach-like usage of tonality, distinguishing pieces that are too chromatic (e.g. twelve-tone pieces) and ones that are too stagnant (e.g. never uses any chromaticism).
The rhythm distribution is the distribution of note lengths in units of quarter notes, e.g. eighth notes are units, quarter notes are . This feature serves to measure whether chorales use rhythm like Bach does: eighths and quarters as the main body, and others for decoration and variety.
The interval distribution is the distribution of directed melodic interval sizes, i.e. ascending and descending intervals of the same distance are different. Each voice (soprano, alto, tenor, bass) serves a different musical function; specifically, melodies in soprano parts have the most intervallic variety, bass parts leap more frequently for harmony, and tenor and alto parts tend to employ mostly small intervals. Therefore, we measure the interval distribution separately for each voice, for a total of four interval distributions.
2.1.4 Harmonic qualities
The harmonic qualities distribution describes the usage of vertical harmony by keeping only the quality, e.g. “D Major” would be reduced to “major.” This feature also helps encourage a Bach-like usage of 18th century tonality by majority of major, minor, and dominant-seventh chords.
2.1.5 Parallel errors
The parallel errors distribution is the distribution of occurrences of the hallmark part-writing errors: parallel fifths and octaves (including unisons) in similar and contrary motion. Observe that what matters is not only the distribution between parallel fifths and octaves, but also the count of these errors relative to the total number of notes. Therefore, the Wasserstein distance for this feature is multiplied by . This weight is large if the given chorale has a large error to note ratio compared to real Bach chorales, thereby penalizing the chorale.
2.1.6 Repeated sequences
The repeated sequence distribution is the distribution of the length (in units of quarter notes) of sequences containing at least two notes and appearing at least twice in the chorale, in order to promote a Bach-like handling of recurring motifs and intentional musical repetition. To identify repeated sequences, we use the dynamic programming algorithm in (Hsu et al., 1998).
We now show that the grading function provides interpretable output and is a promising substitute for human evaluation. We used the grading function to evaluate the output of a Transformer model (Vaswani et al., 2017) with relative attention (Huang et al., 2018) trained on a corpus of 351 Bach chorales, using the same data representation as in (Hadjeres et al., 2017).
The grade distribution for Bach chorales and generated chorales is very well-separated with a Kolmogorov–Smirnov test p-value of e. In Table 1, we compare the median value of every feature in the grading function. Generated chorales do worse than Bach chorales in every feature.
To further show the grading function’s interpretability, we display a badly graded generated chorale in Figure 1. We see especially large distances for its parallel error and repeated sequence features. Indeed, the grading function automatically found six total parallel errors and identified an abnormally long sequence of repeated quarter notes (the repeated E’s in measures 1–3 of the alto voice).
To compare our grading function to human performance, we performed a paired discrimination test with responses. We assessed the musical expertise of our participants through a series of pre-test questions, and assigned them to one of three groups: novice (), intermediate (), and expert (). In the paired discrimination test, we presented three pairs of audio examples representing complete chorales, one Bach and one generated, and asked participants to select the one composed by Bach. In Figure 2, we compare the human pick to selecting the chorale that receives a better grade. We find that the grading function achieves 92.6% accuracy, outperforming human experts at 86.7%.
- Harmonising chorales by probabilistic inference. In International Conference on Neural Information Processing Systems (NIPS), Cited by: §3.
- Modeling temporal dependencies in high-dimensional sequences: application to polyphonic music generation and transcription. Cited by: §3.
- music21: A toolkit for computer-aided musicology and symbolic music data. In Conference of the International Society of Music Information Retrieval (ISMIR), pp. 637–642. Cited by: §2.
- An expert system for harmonizing four-part chorales. Computer Music Journal 12. Cited by: §1.
DeepBach: a steerable model for Bach chorales generation.
Proceedings of the 34th International Conference on Machine Learning, pp. 1362–1371. Cited by: §1, §3.
- Efficient repeating pattern finding in music databases. In International Conference on Information and Knowledge Management, New York, NY, USA, pp. 281–288. Cited by: §2.1.6.
- Counterpoint by convolution. In Conference of the International Society of Music Information Retrieval (ISMIR), Cited by: §1.
- An improved relative self-attention mechanism for transformer with application to music generation. CoRR abs/1809.04281. External Links: Cited by: §3.
- Automatic stylistic composition of bach chorales with deep lstm. In Conference of the International Society of Music Information Retrieval (ISMIR), Cited by: §1.
- Algorithmic composition: paradigms of automated music generation. Mathematics and Statistics, Springer Vienna. External Links: Cited by: §1.
- The wasserstein distance and approximation theorems. Probability Theory and Related Fields 70 (1), pp. 117–129. Cited by: §2.
- Attention is all you need. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Cited by: §3.