1 Introduction
For over two centuries, scholars have observed that tonal harmony, like language, is characterized by the logical ordering of successive events, what has commonly been called harmonic syntax. In Western music of the commonpractice period (17001900), pitch events group (or cohere) into discrete, primarily tertian sonorities, and the succession of these sonorities over time produces meaningful syntactic progressions. To characterize the passage from the first two measures of Bach’s “Aus meines Herzens Grunde”, for example, theorists and composers developed a chord typology that specifies both the scale steps on which tertian sonorities are built (Stufentheorie), and the functional (i.e., temporal) relations that bind them (Funktionstheorie). Shown beneath the staff in fig:bach_example, this Roman numeral
system allows the analyst to recognize and describe these relations using a simple lexicon of symbols.
In the presence of such languagelike design features, music scholars have increasingly turned to stringbased methods from the natural language processing (NLP) community for the purposes of pattern discovery [6], classification [7]
, similarity estimation
[18], and prediction [19]. In sequential prediction tasks, for example, probabilistic language models have been developed to predict the next event in a sequence — whether it consists of letters, words, DNA sequences, or in our case, chords.Although corpus studies of tonal harmony have become increasingly commonplace in the music research community, applications of language models for chord prediction remain somewhat rare. This is likely because language models take as their starting point a sequence of chords, but the musical surface is often a dense web of chordal and nonchordal tones, making automatic harmonic analysis a tremendous challenge. Indeed, such is the scope of the computational problem that a number of researchers have instead elected to start with a particular chord typology right from the outset (e.g., Roman numerals, figured bass nomenclature, or pop chord symbols), and then identify chord events using either human annotators [3]
, or rulebased computational classifiers
[25]. As a consequence, language models for tonal harmony frequently train on relatively small, heavily curated datasets ( chords) [3], or use data augmentation methods to increase the size of the corpus [15]. And since the majority of these corpora reflect pop, rock, or jazz idioms, vocabulary reduction is a frequent preliminary step to ensure improved model performance, with the researcher typically including specific chord types (e.g., major, minor, seventh, etc.), thus ignoring properties of tonal harmony relating to inversion [15] or chordal extension [11].Given the state of the annotation bottleneck, we propose a complementary method for the implementation and evaluation of language models for chord prediction. Rather than assume a particular chord typology a priori and train our models on the chord classes found therein, we will instead propose a datadriven method for the construction of harmonic corpora using chord onsets derived from the musical surface. It is our hope that such a bottomup approach to chord prediction could provide a springboard for the implementation of chord class models in future studies [2], the central purpose of which is to use predictive methods to reduce the musical surface to a sequence of syntactic progressions by discovering a small vocabulary of chord types.
We begin in Section 2 by describing the datasets used in the present research and then present the tonal encoding scheme that reduces the combinatoric explosion of potential chord types to a vocabulary consisting of roughly two hundred types for each scaledegree in the lowest instrumental part. Next, Section 3 describes the two most stateoftheart architectures employed in the NLP community: Finite Context (or ngram) models and Recurrent Neural Networks (RNNs). Section 4 presents the experiments, which (1) evaluate the two aforementioned model architectures in a chord prediction task; (2) compare predictive accuracy from the bestperforming models for each dataset; (3) attempt to explain the differences between the two models using a regression analysis. We conclude in Section 5 by considering limitations of the present approach, and offering avenues for future research.
2 Corpus
This section presents the datasets used in the present research and then describes the chord representation scheme that permits model comparison across datasets.
2.1 Datasets
Shown in tab:corpus, this study includes nine datasets of Western tonal music (1710–1910) featuring symbolic representations of the notated score (e.g., metric position, rhythmic duration, pitch, etc.). The Chopin dataset consists of 155 works for piano that were encoded in MusicXML format [10]. The Assorted symphonies dataset consists of symphonic movements by Beethoven, Berlioz, Bruckner, and Mahler that were encoded in MATCH format [26]. All other datasets were downloaded from the KernScores database in MIDI format.^{1}^{1}1http://kern.ccarh.org/. In total, the composite corpus includes the complete catalogues for Beethoven’s string quartets and piano sonatas, Joplin’s rags, and Chopin’s piano works, and consists of over 1,000 compositions containing more than 1 million chord tokens.
Composer  Genre  N  N  N 

Bach  Chorale  
Haydn  Quartet  
Mozart  Quartet  
Beethoven  Quartet  *  
Mozart  Piano  
Beethoven  Piano  *  
Chopin  Piano  *  
Joplin  Piano  *  
Assorted  Symphony  
Total 

Note. * denotes the complete catalogue.
Datasets and descriptive statistics for the corpus.
2.2 Chord Representation Scheme
To derive chord progressions from symbolic corpora using datadriven methods, music analysis software frameworks typically perform a full expansion of the symbolic encoding, which duplicates overlapping note events at every unique onset time. Shown in fig:bach_example2, expansion identifies 9 unique onset times in the first two measures of Bach’s chorale harmonization, “Aus meines Herzens Grunde.”
Previous studies have represented each chord according to the simultaneous relations between its noteevent members (e.g., vertical intervals) [23], the sequential relations between its chordevent neighbors (e.g., melodic intervals) [6], or some combination of the two [22]. For the purposes of this study, we have adopted a chord typology that models every possible combination of note events in the corpus. The encoding scheme consists of an ordered tuple () for each chord onset in the sequence, where is a set of up to three intervals above the bass in semitones modulo the octave (i.e., 12), resulting in (or 2197) possible combinations;^{2}^{2}2The value of each vertical interval is either undefined (denoted by ), or represents one of twelve possible interval classes, where 0 denotes a perfect unison or octave, 7 denotes a perfect fifth, and so on. and is the chromatic scale degree (again modulo the octave) of the bass, where 0 represents the tonic, 7 the dominant, and so on.
Because this encoding scheme makes no distinction between chord tones and nonchord tones, the syntactic domain of chord types is still very large. To reduce the domain to a more reasonable number, we have excluded pitch class repetitions in (i.e., voice doublings), and we have allowed permutations. Following [22], the assumption here is that the precise location and repeated appearance of a given interval are inconsequential to the identity of the chord. By allowing permutations, the major triads and therefore reduce to . Similarly, by eliminating repetitions, the chords and reduce to . This procedure restricts the domain to unique chord types in (i.e., when is undefined).
To determine the underlying tonal context of each chord onset, we employ the keyfinding algorithm in [1], which tends to outperform other distributional methods (with an accuracy of around 90% for both major and minor keys). Since the movements in this dataset typically feature modulations, we compute the Pearson correlation between the distributional weights in the selected keyfinding algorithm and the pitchclass distribution identified in a moving window of 16 quarternote beats and centered around each chord onset in the sequence. The algorithm interprets the passage in Figure 2 in G major, for example, so the bass note of the first harmony is 0 (i.e., the tonic).
3 Language Models
The goal of language models is to estimate the probability of event
given a preceding sequence of events to , notated here as . In principle, these models predict by acquiring knowledge through unsupervised statistical learning of a training corpus, with the model architecture determining how this learning process takes place. For this study we examine the two most common and bestperforming language models in the NLP community: (1) Markovian finitecontext (orgram) models using the PPM algorithm, and (2) recurrent neural networks (RNNs) using both long shortterm memory (LSTM) layers and gated recurrent units (GRUs).
3.1 Finite Context Models
Context models estimate the probability of each event in a sequence by stipulating a global order bound (or deterministic context) such that depends only on the previous events, or . For this reason, context models are also sometimes called gram models, since the sequence is an ngram consisting of a context , and a singleevent prediction
. These models first acquire the frequency counts for a collection of sequences from a training set, and then apply these counts to estimate the probability distribution governing the identity of
in a test sequence using maximum likelihood (ML) estimation.Unfortunately, the number of potential grams decreases dramatically as the value of increases, so highorder models often suffer from the zerofrequency problem, in which grams encountered in the test set do not appear in the training set [27]. The most common solution to this problem has been the Prediction by Partial Match (PPM) algorithm, which adjusts the ML estimate for by combining (or smoothing) predictions generated at higher orders with less sparsely estimated predictions from lower orders [5]. Specifically, PPM assigns some portion of the probability mass to accommodate predictions that do not appear in the training set using an escape method. The bestperforming smoothing method is called mixtures (or interpolated smoothing), which computes a weighted combination of higher order and lower order models for every event in the sequence.
3.1.1 Model Selection
To implement this model architecture, we apply the variableorder Markov model (called
IDyOM) developed in [19].^{3}^{3}3The model is available for download: http://code.soundsoftware.ac.uk/projects/idyomprojectThe model accommodates many possible configurations based on the selected global order bound, escape method, and training type. Rather than select a global order bound, researchers typically prefer an extension to PPM called PPM*, which uses simple heuristics to determine the optimal highorder context length for
, and which has been shown to outperform the traditional PPM scheme in several prediction tasks (e.g., [21]), so we apply that extension here. Regarding the escape method, recent studies have demonstrated the potential of method C to minimize model uncertainty in melodic and harmonic prediction tasks [12, 21], so we also employ that method here.To improve model performance, Finite Context models often separately estimate and then combine two subordinate models trained on differed subsets of the corpus: a longterm model (LTM+), which is trained on the entire corpus; and a shortterm (or cache) model (STM), which is initially empty for each individual composition and then is trained incrementally (e.g., [8]
). As a result, the LTM+ reflects interopus statistics from a large corpus of compositions, whereas the STM only reflects intraopus statistics, some of which may be specific to that composition. Finally, the model implemented here also includes a model that combines the LTM+ and STM models using a weighted geometric mean (BOTH+)
[20]. Thus, we report the LTM+, STM, and BOTH+ models for the analyses that follow.^{4}^{4}4The models featuring the + symbol represent both the statistics from the training set and the statistics from that portion of the test set that has already been predicted.3.2 Recurrent Neural Networks
Recurrent Neural Networks (RNNs) are powerful models designed for sequential modelling tasks. RNNs transform an input sequence to an output sequence through a nonlinear projection into a hidden layer , parameterised by weight matrices , and :
(1)  
(2) 
where and
are the activation functions for the hidden layer (e.g. the sigmoid function), and the output layer (e.g. the softmax), respectively. We excluded bias terms for simplicity.
RNNs have become popular models for natural language processing due to their superior performance compared to Finite Context models [17]. Here, the input at each time step
is a (learnable) vector representation of the preceding symbol,
. The network’s output is interpreted as the conditional probability over the next symbol, . As outlined in Figure 3, this probability depends on all preceding symbols through the recurrent connection in the hidden layer.During training, the categorical crossentropy between the output and the true chord symbol is minimised by adapting the weight matrices in Eqs. 1 and 2
using stochastic gradient descent and backpropagation through time. However, this training procedure suffers from vanishing and exploding gradients because of the recursive dot product in Eq.
1. The latter problem can be averted by clipping the gradient values; the former, however, is trickier to prevent, and necessitates more complex recurrent structures such as the long shortterm memory unit (LSTM) [13] or the gated recurrent unit (GRU) [4]. These units have become standard features of RNNbased language modeling architectures [16].3.2.1 Model Selection
Selecting good hyperparameters is crucial for neural networks to perform well. To this end, we performed a number of preliminary experiments to tune the networks. Our final architecture comprises two layers of 128 recurrent units each (either LSTM or GRU), a learnable input embedding of 64 dimensions (i.e. maps each chord class to a vector in ), and skip connections between the input and all other layers.
RNNs are prone to overfit the training data. We use the network’s performance on heldout data to identify this issue. Since we employ 4fold crossvalidation (see Sec. 4
for details), we hold out one of the three training folds as a validation set. If the results on these data do not improve for 10 epochs, we stop training and select the model with the lowest crossentropy on the validation data.
We trained the networks for a maximum of 200 epochs, using stochastic gradient descent with a minibatch size of 4. Each of these 4 data points is a sequence of at most 300 chords. The gradient updates are scaled using the Adam update rule [14] with standard parameters. To prevent exploding gradients, we clip gradient values larger than 1.
4 Experiments
4.1 Evaluation
To evaluate performance using a more refined method than one simply based on the accuracy of the model’s prediction, we use a statistic called corpus crossentropy, denoted by .
(3) 
represents the average information content for the model probabilities estimated by over all in the sequence . That is, crossentropy provides an estimate of how uncertain a model is, on average, when predicting a given sequence of events [21], regardless of whether the correct symbol for each event was assigned the highest probability in the distribution.
Finally, we employ 4fold crossvalidation stratified by dataset for both model architectures, using crossentropy as a measure of performance.
4.2 Results
We first compare the average crossentropy estimates across the entire corpus using Finite Context models and RNNs, and then examine the estimates across datasets for the best performing model configuration from each architecture. We conclude by examining the differences between these models in a regression analysis.
4.2.1 Comparing Models
tab:modelcomparison presents the average crossentropy estimates for each model configuration. For the purposes of statistical inference, we also include the 95% bootstrap confidence interval using the biascorrected and accelerated percentile method [9]. For the Finite Context models, BOTH+ produced the lowest crossentropy estimates on average, though the difference between BOTH+ and LTM+ was negligible. STM was the worst performing model overall, which is unsurprising given the restrictions placed on the model’s training parameters (i.e., that it only trains on the alreadypredicted portion of the test set).
Of the RNN models, LSTM slightly outperformed GRU, but again this difference was negligible. What is more, the longterm Finite Context models (BOTH+ and LTM+) significantly outperformed both RNNs. This finding could suggest that context models are better suited to music corpora, since the datasets for melodic and harmonic prediction are generally miniscule relative to those in the NLP community [15]. The encoding scheme for this study also produced a large vocabulary (2590 symbols), so the PPM* algorithm might be useful when the model is forced to predict particularly rare types in the corpus.
4.2.2 Comparing Datasets
To identify the differences between these models for each of the datasets in the corpus, fig:barplot presents the bar plots for the bestperforming model configurations from each model architecture: BOTH+ from the Finite Context model, and LSTM from the RNN model. On average, BOTH+ produced the lowest crossentropy estimates for the piano datasets (Mozart, Beethoven, Joplin), but much higher estimates for the other datasets. This effect was not observed for LSTM, however, with the datasets’ genre — chorale, piano work, quartet, and symphony — apparently playing no role in the model’s overall performance.
The difference between these two model architectures for the Joplin and Mozart piano datasets is particularly striking. Given the degree to which piano works generally consist of fewer homorhythmic textures relative to the other genres in this corpus, it could be the case that the piano datasets feature a larger proportion of rare, monophonic chord types relative to the other datasets. The next section examines this hypothesis using a regression model.
4.2.3 A Regression Model
Given the complexity of the corpus, a number of factors might explain the performance of these models. Thus, we have included the following five predictors in a multiple linear regression (MLR) model to explain the average crossentropy estimates for the compositions in the corpus (
):^{5}^{5}5Four of the 1116 compositions were further subdivided in the selected datasets, producing an additional 20 sequences in the analyses: Beethoven, Quartet No. 6, Op. 18, iv (2); Chopin, Op. 12 (2); Mozart, Piano Sonata No. 6, K. 284, iii (13); Mozart, Piano Sonata No. 11, K. 331, i (7).
Cache (i.e., STM) and RNNbased language models often benefit from datasets that feature longer sequences by exploiting statistical regularities in the portion of the test sequence that was already predicted. Thus, represents the number of tokens in each sequence. Compositions featuring more tokens should receive lower crossentropy estimates on average.

Language models struggle with data sparsity as increases (i.e., the zerofrequency problem). One solution is to select corpora for which the vocabulary of possible distinct types is relatively small. Thus, represents the number of types in each sequence. Compositions with larger vocabularies should receive higher crossentropy estimates on average.
 Improbable

Events that occur with low probability in the zerothorder distribution are particularly difficult to predict due to the data sparsity problem just mentioned. Thus, Improbable represents the proportion of tokens in each sequence that appear in the bottom 10% of types in the zerothorder probability distribution. Compositions with a large proportion of these particularly rare types should receive higher crossentropy estimates on average.
 Monophonic

Chorales feature homorhythmic textures in which each temporal onset includes multiple coincident pitch events. The chord types representing these tokens should be particularly common in this corpus, but some genres might also feature polyphonic textures in which the number of coincident events is potentially quite low (e.g., piano). Thus, Monophonic represents the proportion of tokens in each sequence that consist of only one pitch event. Compositions with a large proportion of these monophonic events should receive higher crossentropy estimates on average.
 Repetition

Compared to chordclass corpora, datadriven corpora are far more likely to feature adjacent repetitions of tokens. Thus, Repetition represents the proportion of tokens in each sequence that feature adjacent repetitions. Compositions with a large proportion of repetitions should receive lower crossentropy estimates on average.
tab:regression presents the results of a stepwise regression analysis predicting the average crossentropy estimates with the aforementioned predictors.
refers to the fit of the model, where a value of 1 indicates that the model accounts for all of the variance in the outcome variable (i.e., a perfectly linear relationship between the predictors and the crossentropy estimates). The slope of the line measured for each predictor, denoted by
, represents the change in the outcome resulting from a unit change in the predictor.For the Finite Context model (BOTH+), four of the five predictors explained 53% of the variance in the crossentropy estimates. As predicted, crossentropy decreased as the number of tokens increased, suggesting that the model learned from past tokens in the sequence. What is more, crossentropy increased as the vocabulary increased, as well as when the proportion of monophonic or improbable tokens increased, though the latter two predictors had little effect on the model.
For the RNN model, the effect of these predictors was strikingly different. In this case, crossentropy increased with the proportion of improbable events. Note that this predictor played only a minor role for the Finite Context model, which suggests PPM* may be responsible for the model’s superior performance. For the remaining predictors, crossentropy estimates decreased when the proportion of adjacent repeated tokens increased. Like the Finite Context model, the RNN model also struggled when the proportion of monophonic tokens increased, but benefited from longer sequences featuring smaller vocabularies.
5 Conclusion
This study examined the potential for language models to predict chords in a largescale corpus of tonal compositions from the commonpractice period. To that end, we developed a flexible chord representation scheme that (1) made minimal a priori assumptions about the chord typology underlying tonal music, and (2) allowed us to create a much larger corpus relative to those based on chord annotations. Our findings demonstrate that Finite Context models outperform RNNs, particularly in piano datasets, which suggests PPM* is responsible for the superior performance, since it assigns a portion of the probability mass to potentially rare, asyetunseen types. A regression analysis generally confirmed this hypothesis, with LSTM struggling to predict the improbable types from the piano datasets.
To our knowledge, this is the first languagemodeling study to use such a large vocabulary of chord types, though this approach is far more common in the NLP community, where the selected corpus can sometimes contain millions of distinct word types. Our goal in doing so was to bridge the gulf between the most current datadriven methods for melodic and harmonic prediction on the one hand [24], and applications of chord typologies for the creation of corpora using expert analysts on the other [3]. Indeed, despite recent efforts to determine the efficacy of language models for annotated corpora [15, 11], relatively little has been done to develop unsupervised methods for the discovery of tonal harmony in predictive contexts.
One serious limitation of the architectures examined in this study is their unwavering commitment to the surface. Rather than skipping seemingly inconsequential onsets, such as those containing embellishing tones or repetitions, these models predict every onset in their path. As a result, the model configurations examined here attempted to predict tonal (pitch) content rather than tonal harmonic progressions per se. In our view, word class models could provide the necessary bridge between the bottomup and topdown approaches just described by reducing the vocabulary of surface simultaneities to its most essential harmonies [2]. Along with prediction tasks, these models could then be adapted for sequence generation and automatic harmonic analysis, and in so doing, provide converging evidence that the statistical regularities characterizing a tonal corpus also reflect the order in which its constituent harmonies occur.
6 Acknowledgments
This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement 670035).
References
 [1] J. Albrecht and D. Shanahan. The use of large corpora to train a new type of keyfinding algorithm: An improved treatment of the minor mode. Music Perception, 31(1):59–67, 2013.
 [2] P. F. Brown, V. J. Della Pietra, P. V. deSouza, J. C. Lai, and R. L. Mercer. Classbased ngram models of natural language. Computational Linguistics, 18(4):467–479, 1992.
 [3] J. A. Burgoyne, J. Wild, and I. Fujinaga. An Expert Ground Truth Set for Audio Chord Recognition and Music Analysis. In Proceedings of the 12th International Society for Music Information Retrieval Conference (ISMIR), Miami, USA, 2011.
 [4] K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio. On the Properties of Neural Machine Translation: EncoderDecoder Approaches. arXiv:1409.1259 [cs, stat], September 2014.
 [5] J. G. Cleary and I H. Witten. Data compression using adaptive coding and partial string matching. IEEE Transactions on Communications, 32(4):396–402, 1984.

[6]
D. Conklin.
Representation and discovery of vertical patterns in music.
In C. Anagnostopoulou, M. Ferrand, and A. Smaill, editors,
Music and Artifical Intelligence: Lecture Notes in Artificial Intelligence 2445
, volume 2445, pages 32–42. SpringerVerlag, 2002.  [7] D. Conklin. Multiple viewpoint systems for music classification. Journal of New Music Research, 42(1):19–26, 2013.
 [8] D. Conklin and I. H. Witten. Multiple viewpoint systems for music prediction. Journal of New Music Research, 24(1):51–73, 1995.
 [9] T. J. DiCiccio and B. Efron. Bootstrap confidence intervals. Statistical Science, 11(3):189–228, 1996.
 [10] S. Flossmann, W. Goebl, M. Grachten, B. Niedermayer, and G. Widmer. The Magaloff project: An interim report. Journal of New Music Research, 39(4):363–377, 2010.
 [11] B. Di Giorgi, S. Dixon, M. Zanoni, and A. Sarti. A datadriven model of tonal chord sequence complexity. IEEE/ACM Transactions on Audio, Speech and Language Processing, 25(11):2237–2250, 2017.
 [12] T. Hedges and G. A. Wiggins. The prediction of merged attributes with multiple viewpoint systems. Journal of New Music Research, 2016.
 [13] S. Hochreiter and J. Schmidhuber. Long ShortTerm Memory. Neural Computing, 9(8):1735–1780, November 1997.
 [14] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [15] F. Korzeniowski, D. R. W. Sears, and G. Widmer. A largescale study of language models for chord prediction. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, Canada, 2018.
 [16] G. Melis, C. Dyer, and P. Blunsom. On the state of the art of evaluation in neural language models. In Sixth International Conference on Learning Representations (ICLR), Vancouver, Canada, April 2018.
 [17] T. Mikolov, M. Karafiát, L. Burget, J. Cernocký, and S. Khudanpur. Recurrent neural network based language model. In INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, September 2630, 2010, pages 1045–1048, Chiba, Japan, 2010.
 [18] D. Müllensiefen and M. Pendzich. Court decisions on music plagiarism and the predictive value of similarity algorithms. Musicæ Scientiæ, Discussion Forum 4B:257–295, 2009.
 [19] M. T. Pearce. The Construction and Evaluation of Statistical Models of Melodic Structure in Music Perception and Composition. Phd thesis, City University, London, 2005.
 [20] M. T. Pearce, D. Conklin, and G. A. Wiggins. Methods for Combining Statistical Models of Music, pages 295–312. Springer Verlag, Heidelberg, Germany, 2005.
 [21] M. T. Pearce and G. A. Wiggins. Improved methods for statistical modelling of monophonic music. Journal of New Music Research, 33(4):367–385, 2004.
 [22] I. Quinn. Are pitchclass profiles really “key for key”? Zeitschrift der Gesellschaft der Musiktheorie, 7:151–163, 2010.
 [23] D. R. W. Sears. The Classical Cadence as a Closing Schema: Learning, Memory, and Perception. Phd thesis, McGill University, Montreal, Canada, 2016.
 [24] D. R. W. Sears, M. T. Pearce, W. E. Caplin, and S. McAdams. Simulating melodic and harmonic expectations for tonal cadences using probabilistic models. Journal of New Music Research, 47(1):29–52, 2018.
 [25] D. Temperley and D. Sleator. Modeling meter and harmony: A preferencerule approach. Computer Music Journal, 23(1):10–27, 1999.

[26]
G. Widmer.
Using AI and machine learning to study expressive music performance: Project survey and first report.
AI Communications, 14(3):149–162, 2001.  [27] I. H. Witten and T. C. Bell. The zerofrequency problem: Estimating the probabilities of novel events in adaptive text compression. IEEE Transactions on Information Theory, 37(4):1085–1094, 1991.
Comments
There are no comments yet.