Bach or Mock? A Grading Function for Chorales in the Style of J.S. Bach

06/23/2020 ∙ by Alexander Fang, et al. ∙ 0

Deep generative systems that learn probabilistic models from a corpus of existing music do not explicitly encode knowledge of a musical style, compared to traditional rule-based systems. Thus, it can be difficult to determine whether deep models generate stylistically correct output without expert evaluation, but this is expensive and time-consuming. Therefore, there is a need for automatic, interpretable, and musically-motivated evaluation measures of generated music. In this paper, we introduce a grading function that evaluates four-part chorales in the style of J.S. Bach along important musical features. We use the grading function to evaluate the output of a Transformer model, and show that the function is both interpretable and outperforms human experts at discriminating Bach chorales from model-generated ones.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep generative systems that learn probabilistic models from a corpus of existing music do not explicitly encode knowledge of a musical style, compared to traditional rule-based systems (Ebcioğlu, 1988; Nierhaus, 2009). Thus, it can be difficult to determine whether deep models generate stylistically correct output without expert evaluation. However, human evaluation is expensive and time-consuming, limiting when it can be performed during the research cycle. Moreover, the experimental setup and execution vary greatly across human subject studies, hindering comparability of results. Therefore, there is a need for automatic, interpretable, and musically-motivated evaluation measures of generated music. Such grading functions can allow researchers to efficiently evaluate their models, shed insight into the musical strengths and limitations of generated output, and serve as a consistent benchmark for comparing different models.

In this paper, we introduce a grading function that evaluates four-part chorales in the style of J.S. Bach along important musical features. The Bach chorales represent a canonical dataset for music generation models that has been used in multiple prior works (Liang et al., 2017; Huang et al., 2017; Hadjeres et al., 2017), due to the dataset’s size and stylistic consistency. We use the grading function to evaluate the output of a Transformer model, and show that the function is both interpretable and outperforms human experts at discriminating Bach chorales from model-generated ones.

Note Rhythm
Parallel
Errors
Harmonic
Quality
S Intervals A Intervals T Intervals B Intervals
Repeated
Sequence
Overall
Grade
Bach 0.24 (0.15) 0.23 (0.14) 0.0 (0.69) 0.41 (0.2) 0.47 (0.28) 0.49 (0.23) 0.53 (0.24) 0.69 (0.4) 1.29 (0.88) 4.91 (1.63)
Generated 0.37 (0.22) 0.26 (0.14) 2.16 (3.22) 0.54 (0.31) 0.53 (0.35) 0.71 (0.34) 0.73 (0.38) 0.89 (0.68) 1.86 (2.81) 8.94 (4.64)
Table 1:

The median value (standard deviation) for every feature in the grading function, as well as the overall grade, for Bach chorales and generated chorales. Lower values represent better chorales. We can see that the model struggles with avoiding parallelisms.

2 Grading Function for Four-Part Chorales

Given a four-part chorale, our grading function111https://github.com/asdfang/constraint-transformer-bach/tree/master/Grader outputs a real-valued grade. We represent a chorale as a set of distributions, each corresponding to a musical feature important for evaluating Bach-style chorales. We implement our grading function using music21 (Cuthbert and Ariza, 2010).

For each feature (described in Section 2.1), we use the Wasserstein metric (Rüschendorf, 1985) to measure the distance between the distribution of the given chorale and the ground-truth distribution over the set of true Bach chorales. By taking a weighted sum of the Wasserstein distances over all the features (Eq. 1), we obtain the overall grade for a chorale. Note the output of the grading function is positive, and a lower grade represents a better chorale.

(1)

2.1 Features

In this section, we describe each feature used to represent a chorale (or set of chorales). The weight unless stated otherwise.

2.1.1 Pitch

The pitch distribution is the distribution of a chorale’s pitches in scale degrees. We consider enharmonic spellings as distinct, but not octave displacements. For a concrete example, if a chorale in C Major had 60 C’s, 25 F’s, and 15 G

’s, the probabilities for

(“scale degree ”), , and are , , and , respectively. The pitch distribution feature evaluates a Bach-like usage of tonality, distinguishing pieces that are too chromatic (e.g. twelve-tone pieces) and ones that are too stagnant (e.g. never uses any chromaticism).

2.1.2 Rhythm

The rhythm distribution is the distribution of note lengths in units of quarter notes, e.g. eighth notes are units, quarter notes are . This feature serves to measure whether chorales use rhythm like Bach does: eighths and quarters as the main body, and others for decoration and variety.

2.1.3 Intervals

The interval distribution is the distribution of directed melodic interval sizes, i.e. ascending and descending intervals of the same distance are different. Each voice (soprano, alto, tenor, bass) serves a different musical function; specifically, melodies in soprano parts have the most intervallic variety, bass parts leap more frequently for harmony, and tenor and alto parts tend to employ mostly small intervals. Therefore, we measure the interval distribution separately for each voice, for a total of four interval distributions.

2.1.4 Harmonic qualities

The harmonic qualities distribution describes the usage of vertical harmony by keeping only the quality, e.g. “D Major” would be reduced to “major.” This feature also helps encourage a Bach-like usage of 18th century tonality by majority of major, minor, and dominant-seventh chords.

2.1.5 Parallel errors

The parallel errors distribution is the distribution of occurrences of the hallmark part-writing errors: parallel fifths and octaves (including unisons) in similar and contrary motion. Observe that what matters is not only the distribution between parallel fifths and octaves, but also the count of these errors relative to the total number of notes. Therefore, the Wasserstein distance for this feature is multiplied by . This weight is large if the given chorale has a large error to note ratio compared to real Bach chorales, thereby penalizing the chorale.

2.1.6 Repeated sequences

The repeated sequence distribution is the distribution of the length (in units of quarter notes) of sequences containing at least two notes and appearing at least twice in the chorale, in order to promote a Bach-like handling of recurring motifs and intentional musical repetition. To identify repeated sequences, we use the dynamic programming algorithm in (Hsu et al., 1998).

Figure 1: A generated chorale receiving an overall grade of with a parallel error distance of and repeated sequence distance of . The features with the largest values represent weaknesses of the composition. In the figure, P1 is parallel unisons, P5 is parallel fifths, and P8 is parallel octaves.
Figure 2: Results of the paired discrimination experiment carried out on human listeners. The grading function “picks” the chorale that receives the better grade and achieves accuracy, outperforming human experts at .

3 Experiments

We now show that the grading function provides interpretable output and is a promising substitute for human evaluation. We used the grading function to evaluate the output of a Transformer model (Vaswani et al., 2017) with relative attention (Huang et al., 2018) trained on a corpus of 351 Bach chorales, using the same data representation as in (Hadjeres et al., 2017).

The grade distribution for Bach chorales and generated chorales is very well-separated with a Kolmogorov–Smirnov test p-value of e. In Table 1, we compare the median value of every feature in the grading function. Generated chorales do worse than Bach chorales in every feature.

To further show the grading function’s interpretability, we display a badly graded generated chorale in Figure 1. We see especially large distances for its parallel error and repeated sequence features. Indeed, the grading function automatically found six total parallel errors and identified an abnormally long sequence of repeated quarter notes (the repeated E’s in measures 1–3 of the alto voice).

To compare our grading function to human performance, we performed a paired discrimination test with responses. We assessed the musical expertise of our participants through a series of pre-test questions, and assigned them to one of three groups: novice (), intermediate (), and expert (). In the paired discrimination test, we presented three pairs of audio examples representing complete chorales, one Bach and one generated, and asked participants to select the one composed by Bach. In Figure 2, we compare the human pick to selecting the chorale that receives a better grade. We find that the grading function achieves 92.6% accuracy, outperforming human experts at 86.7%.

References

  • M. Allan and C. K. I. Williams (2004) Harmonising chorales by probabilistic inference. In International Conference on Neural Information Processing Systems (NIPS), Cited by: §3.
  • N. Boulanger-Lewandowski, Y. Bengio, and P. Vincent (2012) Modeling temporal dependencies in high-dimensional sequences: application to polyphonic music generation and transcription. Cited by: §3.
  • M. S. Cuthbert and C. Ariza (2010) music21: A toolkit for computer-aided musicology and symbolic music data. In Conference of the International Society of Music Information Retrieval (ISMIR), pp. 637–642. Cited by: §2.
  • K. Ebcioğlu (1988) An expert system for harmonizing four-part chorales. Computer Music Journal 12. Cited by: §1.
  • G. Hadjeres, F. Pachet, and F. Nielsen (2017) DeepBach: a steerable model for Bach chorales generation. In

    Proceedings of the 34th International Conference on Machine Learning

    ,
    pp. 1362–1371. Cited by: §1, §3.
  • J. Hsu, A. L. P. Chen, and C.-C. Liu (1998) Efficient repeating pattern finding in music databases. In International Conference on Information and Knowledge Management, New York, NY, USA, pp. 281–288. Cited by: §2.1.6.
  • C. A. Huang, T. Cooijmans, A. Roberts, A. Courville, and D. Eck (2017) Counterpoint by convolution. In Conference of the International Society of Music Information Retrieval (ISMIR), Cited by: §1.
  • C. A. Huang, A. Vaswani, J. Uszkoreit, N. Shazeer, C. Hawthorne, A. M. Dai, M. D. Hoffman, and D. Eck (2018) An improved relative self-attention mechanism for transformer with application to music generation. CoRR abs/1809.04281. External Links: 1809.04281 Cited by: §3.
  • F. T. Liang, M. Gotham, M. Johnson, and J. Shotton (2017) Automatic stylistic composition of bach chorales with deep lstm. In Conference of the International Society of Music Information Retrieval (ISMIR), Cited by: §1.
  • G. Nierhaus (2009) Algorithmic composition: paradigms of automated music generation. Mathematics and Statistics, Springer Vienna. External Links: ISBN 9783211755402, LCCN 2008937485 Cited by: §1.
  • L. Rüschendorf (1985) The wasserstein distance and approximation theorems. Probability Theory and Related Fields 70 (1), pp. 117–129. Cited by: §2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Cited by: §3.