Automatic melody harmonization, a sub-task of automatic music generation (fernandez13jair), refers to the task of creating computational models that can generate a harmonic accompaniment for a given melody (chuan07; simon08). Here, the term harmony, or harmonization, is used to refer to chordal accompaniment, where an accompaniment is defined relative to the melody as the supporting section of the music. Figure 1 illustrates the inputs and outputs for a melody harmonization model.
Melody harmonization is a challenging task as there are multiple ways to harmonize the same melody; what makes a particular harmonization pleasant is subjective, and often dependent on musical genre and other contextual factors. Tonal music, which encompasses most of Western music, defines specific motivic relations between chords based on scales such as those defined in functional harmony (riemann1893). While these relations still stand and are taught today, their application towards creating pleasant music often depends on subtleties, long term dependencies and cultural contexts which may be readily accessible to a human composer, but very difficult to learn and detect for a machine. While a particular harmonization may be deemed technically correct in some cases, it can also be seen as uninteresting in a modern context.
There have been several efforts made towards this task in the past (makris2016automatic)
. Before the rise of deep learning, the most actively employed approach is based on hidden Markov models (HMMs). For example,(paiement2006pmh)
proposed a tree-structured HMM that allows for learning the non-local dependencies of chords, and encoded probabilities for chord substitution taken from psycho-acoustics. They additionally presented a novel representation for chords that encodes relative scale degrees rather than absolute note values, and included a sub-graph in their model specifically for processing it.(tsushima17ismir) similarly presented a hierarchical tree-structured model combining probabilistic context-free grammars (PCFG) for chord symbols and HMMs for chord rhythms. (temperley2009unified) presented a statistical model that would generate and analyze music along three sub-structures: metrical structure, harmonic structure, and stream structure. In the generative portion of this model, a metrical structure defining the emphasis of beats and sub-beats is first generated, and then harmonic structure and progression are generated conditioned on that metrical structure.
There are several previous works which attempt to formally and probabilistically analyze tonal harmony and harmonic structure. For example, (rohrmeier2008statistical) applied a number of statistical techniques to harmony in Bach chorales in order to uncover a proposed underlying harmonic syntax that naturally produces common perceptual and music theoretic patterns including functional harmony. (jacoby2015information) attempted to categorize common harmonic symbols (scale degrees, roman numerals, or sets of simultaneous notes) into higher level functional groups, seeking underlying patterns that produce and generalize functional harmony. (tsushima2018generative)
uses unsupervised learning in training generative HMM and PCFG models for harmonization, showing that the patterns learned by these models match the categorizations presented by functional harmony.
More lately, people have begun to explore the use of deep learning for a variety of music generation tasks (briot17survey). For melody harmonication, (lim17) proposed a model that employed two bidirectional long short-term memory (BiLSTM) recurrent layers (hochreiter97LSM) and one fully-connected layer to learn the correspondence between pairs of melody and chord sequences. The model architecture is depicted in Figure 1. According the experiments reported in (lim17), this model outperforms a simple HMM model and a more complicated DNN-HMM model (hinton12spm) for melody harmonization with major and minor triad chords.
We note that, while many new models are being proposed for melody harmonization, at present there is no comparative study evaluating a wide array of different approaches for this task, using the same training set and test set. Comparing models trained on different training sets is problematic as it is hard to have a standardized definition of improvement and quality. Moreover, as there is to date no standardized test set for this task, it is hard to make consistent comparison between different models.
In this paper, we aim to bridge this gap with the following three contributions:
We implement a set of melody harmonization models which span a number of canonical approaches to the task, including template matching, hidden Markov model (HMM) (simon08), genetic algorithm (GA) (kitahara2018)
, and two variants of deep recurrent neural network models(lim17). We then present a comparative study comparing the performance of these models. To our best knowledge, a comparative study that considers such a diverse set of approaches for melody harmonization using a standardized dataset has not been attempted before.
We compile a new dataset, called the Hooktheory Pianoroll Triad Dataset (HTPD3), to evaluate the implemented models over well-annotated lead sheet samples of music. A lead sheet is a form of musical notation that specifies the essential elements of a song—the melody, harmony, and where present, lyrics (liu18icmla). HTPD3 provides melody lines and accompanying chords specifying both chord symbol and harmonic function useful for our study. We consider 48 triad chords in this study, including major, minor, diminished, and augmented triad chords. We use the same training split of HTPD3 to train the implemented models and evaluate them on the same test split.
We employ six objective metrics for evaluating the performance of melody harmonization models. These metrics consider either the distribution of chord labels in a chord sequence, or how the generated chord sequence fits with the given melody. In addition, we conduct an online user study and collect the feedback from 202 participants around the world to assess the quality of the generated chordal accompaniment.
We discuss the findings of comparative study, hoping to gain insights into the strength and weakness of the evaluated methods. Moreover, we show that incorporating the idea of functional harmony (chen18ismir) into account while harmonizing melodies greatly improves the result of the model presented by (lim17).
In what follows, we present in Section 2 the models we consider and evaluate in this comparative study. Section 3 provides the details of the HTPD3 dataset we build for this study, and Section 4 the objective metrics we consider. Section 5 presents the setup and result of the study. We discuss the findings and limtiations of this study in Section 6, and then conclude the paper in Section 7.
2 Automatic Melody Harmonization Models
A melody harmonization model takes a melody sequence of bars as input and generates a corresponding chord sequence as output. Chord Sequence is defined here as a series of chord labels , where denotes the length of the sequence. In this work, each model predicts a chord label for every half bar, i.e. . Each label is chosen from a finite chord vocabulary . To reduce the complexity of this task, we consider here only the triad chords, i.e., chords composed of three notes. Specifically, we consider major, minor, diminished, and augmented triad chords, all in root position. We also consider No Chord (N.C.), or rest, so the size of the chord vocabulary is . Melody Sequence is a time-series of monophonic musical notes in MIDI format. We compute a sequence of features as to represent the melody and use them as the inputs to our models. Unless otherwise specified, we set
, computing a feature vector for each half bar.
Given a set of melody and corresponding chord sequences, a melody harmonization model can be trained by minimizing the loss computed between the ground truth and the model output , where is the input melody.
We consider three non-deep learning based and two deep learning based models in this study. While the majority are adaptation of existing methods, one (deep learning based) is a novel method which we introduce in this paper (see Section 2.5). All models are carefully implemented and trained using the training split of HTPD3. We present the technical details of these models below.
2.1 Template Matching-based Model
This model is based on an early work on audio-based chord recognition (fujishima99).
The model segments training melodies into half-bars, and constructs a pitch profile for each segment. The chord label for a new segment is then selected based on the label for the training segment whose pitch profile it most closely matches.
When there is more than one possible chord template that has the highest matching score, we choose a chord randomly based on uniform distribution among the possibilities.
When there is more than one possible chord template that has the highest matching score, we choose a chord randomly based on uniform distribution among the possibilities.We refer to this model as template matching-based as the underlying method compares the profile of a given melody segment with those of the template chords.
We use Fujishima’s pitch class profile (PCP) (fujishima99) as the pitch profile representing respectively the melody and chord for each half-bar. A PCP is a 12-dimensional feature vector where each element corresponds to the activity of a pitch class. The PCP for each of the chord labels is constructed by setting the elements corresponding to the pitch classes that are part of the chord to one, and all the others to zero. Because we consider only triad chords in this work, there will be exactly three one’s in the PCP of a chord label for each half bar. The PCP for melody is constructed similarly, but additionally considering the duration of notes. Specifically, the activity of the -th pitch class, i.e., , is set by the ratio of time the pitch class is active during the corresponding half bar.
The result of this model are more conservative by design, featuring intensive use of chord tones. And, this model sets the chord label independently for each half bar, without considering the neighboring chord labels, or the chord progression over time.
We note that, to remove the effect of the input representations on the harmonization result, we use PCP as the model input representation for all the other models we implement for melody harmonizationm.
2.2 HMM-based Model
HMM is a probabilistic framework for modeling sequences with latent or hidden variables. Our HMM-based harmonization model regards chord labels as latent variables and estimates the most likely chord sequence for a given set of melody notes. Unlike the template matching-based model, this model considers the relationship between neighboring chord labels. HMM-based models similar to this one were widely used in chord generation and melody harmonization research before the current era of deep learning(simon08; raczynski13).
We adopt a simple HMM architecture employed in (lim17). This model makes the following assumptions:
The observed melody sequence is statistically biased due to the hidden chord sequence , which is to be estimated.
depends on only , .
depends on only , .
The task is to estimate the most likely hidden sequence given
. This amounts to maximizing the posterior probability:
where is equal to . The term is also called the emission probability, and the term is called the transition probability. This optimization problem can be solved by the Viterbi algorithm (viterbi).
Departing from the HMM in (lim17), our implementation uses the PCPs described in Section 2.1 to represent melody notes, i.e., to compute
. Accordingly, we use multivariate Gaussian distributions to model the emission probabilities, as demonstrated by Fujishima(sheh03)
. For each chord label, we set the covariance matrix of the corresponding Gaussian distribution to be a diagonal matrix, and calculate the mean and variance for each dimension from the PCP features of melody segments that are associated with that chord label in the training set.
To calculate the transition probabilities, we count the number of transitions between successive chord labels (i.e., bi-grams), then normalize those counts to sum to one for each preceding chord label. A uniform distribution is used when there is no bi-gram count for the preceding chord label. To avoid zero probabilities, we smooth the distribution by interpolating
with the prior probabilityas follows,
yielding the revised transition probability
. The hyperparameteris empirically set to 0.08 via experiments on a random 10% subset of the training set.
2.3 Genetic Algorithm (GA)-based Model
A GA is a flexible algorithm that generally maximizes an objective function or fitness function. GAs have been used for melody generation and harmonization in the past (phon99; leo2016), justifying their inclusion in this study. A GA can be used in both rule-based and probabilistic approaches. In the former case, we need to design a rule set of what conditions must be satisfied for musically acceptable melodies or harmonies—the fitness function is formulated based on this rule set. In the latter, the fitness function is formulated based on statistics of a data set.
Here, we design a GA-based melody harmonization model by adapting the GA-based melody generation model proposed by (kitahara2018). Unlike the other implemented models, the GA-based model takes as input a computed feature vector for every 16-th note (i.e., 1/4 beats). Thus, the melody representation has a temporal resolution 8 times that of the chord progression (i.e., ). This means that and point to the same temporal position.
Our model uses a probabilistic approach, determining a fitness function based on the following elements. First, the (logarithmic) conditional probability of the chord progression given the melody is represented as:
where is the ceiling function. The chord transition probability is computed as:
The conditional probability of each chord given its temporal position is defined as:
where is the temporal position of the chord . For simplicity, we defined , where is the modulo function. With this term, the model may learn that the tonic chord tends to appear at the first half of the first bar, while the dominant () chord tends to occur at the second half of the second bar.
Finally, we use the entropy to evaluate a chord sequence’s complexity, which should not be too low as to avoid monotonous chord sequences. The entropy is defined as . In the fitness function, we evaluate how likely this entropy is in a given data set.
The fitness function is calculated as:
We simply set all the weights to 1.0 here.
2.4 Deep BiLSTM-based Model
This first deep learning model is adapted from the one proposed by (lim17), which uses BiLSTM layers. This model extracts contextual information from the melody sequentially from both the positive and negative time directions. The original model makes chord prediction for every bar, using a vocabulary of only the major and minor triad chords (i.e., ). We slightly extend this model such that the harmonic rhythm is a half bar, and the output chord vocabulary includes diminished and augmented chords, and the N.C. symbol (i.e., ).
As shown in Figure 1, this model has two BiLSTM layers, followed by a fully-connected layer. Dropout (dropout)
is applied with probability 0.2 at the output layer. This dropout rate, as well as the number of hidden layers and hidden units, are empirically chosen by maximizing the chord prediction accuracy on a random held-out subset of the training set. We train the model using minibatch gradient descent with categorical cross entropy as the the cost function. We use Adam as the optimizer and regularize by early stopping at the 10-th epoch to prevent over-fitting.
2.5 Deep Multitask Model: MTHarmonizer
From our empirical observation on the samples generated by the aforementioned BiLSTM model, we find that the model has two main defects for longer phrases:
overuse of common chords—common chords like C, F, and G major are repeated and overused, making the chord progression monotonous.
incorrect phrasing—non-congruent phrasing between the melody and chords similarly results from the frequent occurrence of common chords. The resulting frequent occurrence of progressions like FC or GC in generated sequences implies a musical cadence in an unfit location, potentially bringing an unnecessary sense of ending in the middle of a chord sequence.
We propose an extension of the BiLSTM model to address these two defects. The core idea is to train the model to predict not only the chord labels but also the chord functions (chen18ismir), as illustrated in Figure 2. We call the resulting model a deep multitask model, or MTHarmonizer, since it deals with two tasks at the same time. We note that the use of the chord functions for melody harmonization has been found useful by (tsushima2018generative), using an HMM-based model.
Functional harmony elaborates the relationship between chords and scales, and describes how harmonic motion guides musical perception and emotion (chen18ismir). While a chord progression consisting of randomly selected chords generally feels aimless, chord progressions which follow the rules of functional harmony establish or contradict a tonality. Music theorists annotate each scale degree into tonal, subdominate, dominate functions based on what chord is associated with that degree in a particular scale. This function explains what role a given scale degree, and its associated chord relative to the scale, plays in musical phrasing and composition. We briefly describe each of these functions below:
the tonal function serves to stabilize and reinforce the tonal center.
the subdominate function pulls a progression out of the tonal center.
the dominate function provides a strong sense of motion back to tonal center. For example, a progression that moves from a dominant function scale degree chord to a tonal scale degree chord first creates tension, then resolves it.
As will be introduced in Section 3, all the pieces in HTPD3 are in either C Major or c minor. Therefore, all chords share the same tonal center. We can directly map the chords into ‘tonal,’ ‘dominate,’ and ‘others’ (which includes the subdominate) functional groups, by name, without worrying about their relative functions in other keys, for other tonal centers. Specifically, we consider C, Am, Cm, A as tonal chords, G and B diminished as dominate chords, and the others as subdominate chords.
We identify two potential benefits of adding chord functions to the target output. First, in contrast to the distribution of chord labels, the distribution of chord functions is relatively balanced, making it easier for the model to learn the chord functions. Second, as the chord functions and chord labels are interdependent, adding the chord functions as a target informs the model which chord labels share the same function and may therefore be interchangeable. We hypothesize that this multi-task learning will help our model learn proper functional progression, which in turn will produce better harmonic phrasing relative to the melody.
Specifically, the loss function of MTHarmonizer is defined as:
where denotes the categorical cross entropy function, the chord label prediction branch, and the chord function prediction branch. When , the model reduces to the uni-task model proposed by (lim17), and we can simply write as . In our work, we set to ensure the loss value from and are equally scaled. The two branches and share the two BiLSTM layers but not the fully-connected layer. Empirically, we found that if is too small, the model will tend to harmonize the melody with the chords with tonal and dominate functions; the resulting chord sequences would therefore lack diversity.
The outputs of and are likelihood values for each chord label and chord function given an input melody. As Figure 2 shows, in predicting the final chord sequence, we rely on a weighted combination of the outputs of and in the following way:
where is simply a look-up table that maps the three chord functions to the chord labels, and is a pre-defined hyperparameter that allows us to boost the importance of correctly predicting the chord function over that of correctly predicting the chord label, for each chord. In our implementation, we set for the subdominate chords, and for all the other chords, to encourage the model to select chord labels that have lower likelihood to increase the overall diversity, yet without degrading the phrasing. This is because, in the middle of a musical phrase, the likelihood to observe a subdominate chord is more likely to be close to that of a tonal chord or a dominate chord. Emphasizing the subdominate chords by using a larger would therefore have the chance to replace a tonal chord or a dominate chord by a subdominate chord. This is less likely to occur in the beginning or the end of a phrase, as the likelihood of observing subdominate chords there would tend to be low. As we will mainly “edit” the middle part of a chord sequence with subdominate chords, we would not compromise the overall chord progression and phrasing.
3 Proposed Dataset
For the purpose of this study, we firstly collect a new dataset called the Hooktheory Lead Sheet Dataset (HLSD), which consists of lead sheet samples scraped from the online music theory forum called TheoryTab, hosted by Hooktheory (https://www.hooktheory.com/theorytab), a company that produces pedagogical music software and books. The majority of lead sheet samples found on TheoryTab are user-contributed. Each piece contains high-quality, human-transcribed melodies alongside their corresponding chord progressions, which are specified by both literal chord symbols (e.g., Gmaj7), and chord functions (e.g., VI7) relative to the provided key.Chord symbols specify inversion if applicable, and the full set of chord extensions (e.g., #9, b11). The metric timing/placement of the chords is also provided. Due to copyright concerns, TheoryTab prohibits uploading full length songs. Instead, users upload snippets of a song (here referred to as lead sheet samples), which they voluntarily annotate with structural labels (e.g. “Intro,” “Verse,” and “Chorus”) and genre labels. A music piece can be associated with multiple genres.
HLSD contains 11,329 lead sheets samples, all in 4/4 time signature. It contains up to 704 different chord classes, which is deemed too many for the current study. We therefore take the following steps to process and simplify HLSD, resulting in the final HTPD3 dataset employed in the performance study.
We remove lead sheet samples that do not contain a sufficient number of notes. Specifically, we remove samples whose melodies comprise of more than 40% rests (relative to their lengths). One can think of this as correcting class imbalance, another common issue for machine learning models—if the model sees too much of a single event, it may overfit and only produce or classify that event.
We then filter out lead sheets that are less than 4 bars and longer than 32 bars, so that . This is done because 4 bars is commonly seen as the minimum length for a complete musical phrase in 4/4 time signature. At the other end, 32 bars is a common length for a full lead sheet, one that is relatively long. Hence, as the majority of our dataset consists of mere song sections, we are inclined for not including samples longer than 32 bars.
The HLSD provides the key signatures of every samples. We transpose every samples to either C major or c minor based on the provided key signatures.
In general, a chord label can be specified by the pitch class of its root note (among 12 possible pitch classes, i.e., C, C#, , B
, in a chromatic scale), and its chord quality, such as ‘triad’, ‘sixths’, ‘sevenths’, and ‘suspended.’ HLSD contains 704 possible chord labels, including inversions. However, the distribution of these labels is highly skewed. In order to even out the distribution and simplify our task, we reduce the chord vocabulary by converting each label to its root position triad form, i.e., the major, minor, diminished, and augmented chords without 7ths or additional extensions. Suspended chords are mapped to the major and minor chords. As a result, only 48 chord labels (i.e., 12 root notes by 4 qualities) and N.C. are considered (i.e.,).
We standardize the dataset so that a chord change can occur only every bar or every half bar.
We do admit that this simplification can decrease the chord color and reduce the intensity of tension/release patterns, and can sometimes convert a vibrant, subtle progression into a monotonous one (e.g., because both CMaj7 and C7 are mapped to C chord). We plan to make full use of the original chord vocabulary in future works.
Having pre-defined train and test splits helps to facilitate the use of HTPD3 for evaluating new models of melody harmonization via the standardization of training procedure. As HTPD3 includes paired melody and chord sequences, it can also be used to evaluate models for chord-conditioned melody generation as well. With these use cases in mind, we split the dataset so that the training set contains 80% of the pieces, and the test set contains 10% of the pieces. There are in total 923 lead sheet samples in the test set. The remaining 10% is reserved for future use. When splitting, we imposed the additional requirement that lead sheet samples from the same song are in the same subset.
4 Proposed Objective Metrics
To our knowledge, there are at present no standardized, objective evaluation metrics for the melody harmonization task. The only objective metric adopted by(lim17), in evaluating the models they built is a categorical cross entropy-based chord prediction error, representing the discrepancy between the ground truth chords and predicted chords. . The chord prediction error is calculated for each half bar individually and then got averaged, not considering the chord sequence as a whole. In addition, it does not directly measure how the generated chord sequence fits with the given melody .
For the comparative study, we introduce here a set of six objective metrics defined below. These metrics are split into two categories, namely three chord progression metrics and three chord/melody harmonicity metrics. Please note that we do not evaluate the melody itself, as the melody is provided by the ground truth data.
Chord progression metrics evaluate each chord sequence as a whole, independent from the melody, and relate to the distribution of chord labels in a sequence.
Chord histogram entropy (CHE): Given a chord sequence, we create a histogram of chord occurrences with bins. Then, we normalize the counts to sum to 1, and calculate its entropy:
where is the relative probability of the -th bin. The entropy is greatest when the histogram follows a uniform distribution, and lowest when the chord sequence uses only one chord throughout.
Chord coverage (CC): The number of chord labels with non-zero counts in the chord histogram in a chord sequence.
Chord tonal distance (CTD): The tonal distance proposed by (tonalDist) is a canonical way to measure the closeness of two chords. It is calculated by firstly calculating the PCP features of two chords, projecting the PCP features to a derived 6-D tonal space, and finally calculating the Euclidean distance between the two 6-D feature vectors. CTD is the average value of the tonal distance computed between every pair of adjacent chords in a given chord sequence. The CTD is highest when there are abrupt changes in the chord progression (e.g., from C chord to B chord).
Chord/melody harmonicity metrics, on the other hand, aims to evaluate the degree to which a generated chord sequence successfully harmonizes a given melody sequence.
Chord tone to non-chord tone ratio (CTnCTR): In reference to the chord sequence, we count the number of chord tones, and non-chord tones in the melody sequence. Chord tones are defined as melody notes whose pitch class are part of the current chord (i.e., one of the three pitch classes that make up a triad) for the corresponding half bar. All the other melody notes are viewed as non-chord tones. One way to measure the harmonicity is to simply computing the ratio of the number of the chord tones () to the number of the non-chord tones (). However, we find it useful to further take into account the number of a subset of non-chord tones () that are two semitones within the notes which are right after them, where subscript p denotes a “proper” non-chord tone. We define CTnCTR as
CTnCTR equals one when there are no non-chord tones at all, or when .
Pitch consonance score (PCS): For each melody note, we calculate a consonance score with each of the three notes of its corresponding chord label. The consonance scores are computed based on the musical interval between the pitch of the melody notes and the chord notes, assuming that the pitch of the melody notes is always higher. This is always the case in our implementation, because we always place the chord notes lower than the melody notes. The consonance score is set to 1 for consonance intervals including unison, major/minor 3rd, perfect 5th, major/minor 6th, set to 0 for a perfect 4th, and set to –1 for other intervals, which are considered dissonant. PCS for a pair of melody and chord sequences is computed by averaging these consonance scores across a 16th-note windows, excluding rest periods.
Melody-chord tonal distance (MCTD): Extending the idea of tonal distance, we represent a melody note by a PCP feature vector (which would be a one-hot vector) and compare it against the PCP of a chord label in the 6-D tonal space (tonalDist) to calculate the closeness between a melody note and a chord label. MCTD is the average of the tonal distance between every melody note and corresponding the chord label calculated across a melody sequence, with each distance weighted by the duration of the corresponding melody note.
5 Comparative Study
We train all the five models described in Section 2 using the training split of HTPD3 and then apply them to the test split of HTPD3 to get the predicted chord sequences for each melody sequence. Examples of the harmonization result of the evaluated models can be found in Figures 3 and 4.
The chord accuracy for the template matching-based, HMM-based, GA-based, BiLSTM-based, MTHarmonizer models is 29%, 31%, 20%, 35%, and 38%, respectively. We note that, since one cannot judge the full potential of each algorithm only from our simplified setting of melody harmonization, we do not intend to find what method is the best in general. We rather attempt a challenge to compare different harmonization method which have not been directly compared because of the different context that each approach assumes.
In what follows, we use the harmonization result for a random subset of the test set comprising 100 pieces in a user study for subjective evaluation. The result of this subjective evaluation is presented in Section 5.1. Then, in Section 5.2, we report the results of an objective evaluation wherein we compute the mean values of the chord/melody harmonicity and chord progression metrics presented in Section 4 for the harmonization results for each test set piece.
5.1 Subjective Evaluation
We conducted an online survey where we invited human subjects to listen to and assess the harmonization results of different models. The subjects evaluated the harmonizations in terms of the following criteria:
Harmonicity: The extent to which a chord progression successfully or pleasantly harmonizes a given melody. This is designed to correspond to what the melody/chord harmonicity metrics described in Section 4 aim to measure.
Interestingness: The extent to which a chord progression sounds exciting, unexpected and/or generates “positive” stimulation. This criterion corresponds to the chord-related metrics described in Section 4. Please note that we use a less technical term “interestingness” here since we intend to solicit feedback from people either with or without musical backgrounds.
The Overall quality of the given harmonization.
Given a melody sequence, we have in total six candidate chord sequences to accompany it: those generated by the five models presented in Section 2, and the human-composed, ground-truth progression retrieved directly from the test set. We intend to compare the results of the automatically generated progression with the original human-composed progression. Yet, given the time and cognitive load required, it was not possible to ask each subject to evaluate the results of every model for every piece of music in the test set (there are sequences in total). We describe below how our user study is designed to make the evaluation feasible.
5.1.1 Design of the User Study
First, we randomly select 100 melodies from the test set of HTPD3. For each human subject, we randomly select three melody sequences from this pool, and present to the subject the harmonization results of two randomly selected models for each melody sequence. For each of the three melodies, the subject listens to the melody without accompaniment first, and then the sequence with two different harmonizations. Thus, the subject has to listen to nine music pieces in total: three melody sequences and the six harmonized ones. As we have six methods for melody harmonization (including the original human-composed harmonization), we select methods for each set of music such that each method is presented once and only once to each subject. The subjects are not aware of which harmonization is generated by which method, but are informed that at least one of the harmonized sequence is human-composed.
In each set, the subject has to listen to the two harmonized sequences and decide which version is better according to the three criteria mentioned earlier. This ranking task is mandatory. In addition, the subject can choose to further grade the harmonized sequences in a five-point Likert scale with respect to the criteria mentioned earlier. Here, we break “harmonicity” into the following two criteria in order to get more feedback from subjects:
Coherence: the coherence between the melody and the chord progression in terms of harmonicity and phrasing.
Chord Progression: how coherent, pleasant, or reasonable the chord progression is on its own, independent of the melody.
This optional rating task thus has four criteria in total.
The user study opens to an “instructions” page, that informs the subjects that we consider only root-positioned triad chords in the survey. Moreover, they are informed that there is no “ground truth” in melody harmonization—the task is by nature subjective. After collecting a small amount of relevant personal information from the subjects, we present them with a random audio sample and encourage them to put on their headsets and adjust the volume to a comfortable level. After that, they are prompted to begin evaluating the three sets (i.e., one set for each melody sequence), one-by-one on consecutive pages.
We spread the online survey over the Internet openly, without restriction, to solicit voluntary, non-paid participation. The webpage of the survey can be found at [URL removed for double-blind review].
|(a) Harmonicities||(b) Interestingness|
5.1.2 User Study Results
In total, 202 participants from 16 countries took part in the survey. We had more male participants than female (ratio of 1.82:1), and the average age of participants was 30.8 years old. 122 participants indicated that they have music background, and 69 of them are familiar with or expertise in the harmonic theory. The participants took on average 14.2 minutes to complete the survey.
We performed the following two data cleaning steps: First, we discarded both the ranking and rating results from participants who spent less than 3 minutes to complete the survey, which is considered too short. Second, we disregarded rating results when the relative ordering of the methods contradicted that from the ranking results. As a result, 9.1% and 21% of the ranking and rating records were removed, respectively.
We first discuss the results of the pairwise ranking task, which is shown in Figure 5. The following observations are made:
The human-composed progressions have the highest “win probabilities” on average in all the three ranking criteria. It performs particularly well in Harmonicity.
In general, the deep learning methods have higher probabilities to win over the non-deep learning methods in Harmonicity and Overall.
For Interestingness, GA performs the best among the five automatic methods, which we suspect stems from its entropy term (Eq. (6)).
Among the two deep learning methods, the MTHarmonizer consistently outperforms the BiLSTM in all ranking criteria, especially for Interestingness. We (subjectively) observe that MTHarmonizer indeed generates more diverse chord progressions compared to the vanilla BiLSTM, perhaps due to the consideration of functions.
The results of the rating task shown in Figure 6, on the other hand, lead to the following observations:
Congruent with the results of the ranking task, the MTHarmonzer model achieves the second best performance here, only losing out to the original human-composed chord progressions. The MTHarmonzier consistently outperforms the other four automatic methods in all the four metrics. With a paired t-test, we find that there is significant performance difference between the MTHarmonzer progressions and the original human-composed progressions in terms ofCoherence and Chord Progression (p-value0.005), but no significant difference in terms of Interestingness and Overall.
Among the four metrics, the original human-composed progressions score higher in Coherence (3.81) and Overall (3.78), and the lowest in Interestingness (3.43). This suggests that the way we simplify the data (e.g., using only root-positioned triad chords) may have limited the perceptual qualities of the music, in particular its diversity.
Generally speaking, the results in Chord Progression (i.e., the coherence of the chord progression on its own) seems to correlate better with the results in Coherence (i.e., the coherence between the melody and chord sequences) than the Interestingness of the chord progression. This suggests that a chord progression rated as being interesting may not sound coherent.
Although the GA performs worse than the MTHarmonizer on all the four metrics, it actually performs fairly well in Interestingness (3.23), as we have observed from the ranking result. A paired t-test showed no significant performance difference between the GA generated progressions and original human-composed progressions in Interestingness. A hybrid model that combines GA and deep learning may be a promising direction for future research.
|Melody/chord harmonicity metrics||CTnCTR||PCS||MCTD|
|HMM (adapted from (lim17))||0.89||1.93||0.85|
|GA-based (adapted from (kitahara2018))||0.74||0.43||1.31|
|BiLSTM (adapted from (lim17))||0.87||1.84||0.91|
|MTHarmonizer (proposed here)||0.82||1.77||0.94|
|Chord progression metrics||CHE||CC||CTD|
|HMM (adapted from (lim17))||0.88||1.89||0.56|
|GA-based (adapted from (kitahara2018))||1.58||2.47||0.96|
|BiLSTM (adapted from (lim17))||1.07||2.07||0.71|
|MTHarmonizer (proposed here)||1.29||2.31||1.02|
From the rating and ranking tasks, we see that, in terms of harmonicity, automatic methods still fall behind the human composition. However, the results of the two deep learning based methods are closer to that of the human-composed ones.
5.2 Objective Evaluation
The results are displayed in Table 1. We discuss the result of the melody/chord harmonicity metrics first. We can see that the results for the two deep learning methods are in general closer to the results for the original human-composed progressions than those of the three non-deep learning methods for all three harmonicity metrics, most significantly on the latter two. The template matching-based and HMM-based methods scores high in PCS and low in MCTD, indicating that the harmonization these two methods generate may be too conservative. In contrast, the GA scores low in PCS and high in MCTD, indicating overly low harmonicity. These results are consistent with the subjective evaluation, suggesting that these metrics can perhaps reflect human perception of the harmonicity between melody and chords.
From the result of the chord progression metrics, we also see from CHE and CC that the progressions generated by the template matching-based and HMM-based methods seem to lack diversity. In contrast, the output of GA features high diversity.
As the GA based method was rated lower than the template matching and HMM methods in terms of the Overall criterion in our subjective evaluation, it seems that the subjects care more about the harmonicity than the diversity of chord progressions.
Comparing the two deep learning methods, we see that the MTHarmonizer uses more non-chord tones (smaller CTnCTR) and uses a greater number of unique chords (larger CC) than the BiLSTM model. The CHE of the MTHarmonizer is very close to that of the original human-composed progressions.
In general, the results of the objective evaluation appear consistent with those of the subjective evaluation. It is difficult to quantify which metrics are better for what purposes, and how useful and accurate these metrics are overall. Therefore, our suggestion is to use them mainly to gain practical insights into the results of automatic melody harmonization models, rather than to judge their quality. As pointed out by (musegan), objective metrics can be used to track the performance of models during development, before committing to running the user study. Yet, human evaluations are still needed to evaluate the quality of the generated music.
We admit that the comparative study presented above has some limitations. First, because of the various preprocessing steps taken for data cleaning and for making the melody harmonization task manageable (cf. Section 3), the “human-composed” harmonizations are actually simplified versions of those found on TheoryTab. We considered triad chords only, and we did not consider performance-level attributes such as velocity and rhythmic pattern of chords. This limits the perceptual quality of the human-composed chord progression, and therefore also limits the results that can be achieved by automatic methods. The reduction from extended chords to triads reduces the “color” of the chords and creates many innacurate chord repetitions in the dataset (e.g., both the alternated CMaj7 and C7 will be reduced to C triad chord). We believe it is important to properly inform the human subjects of such limitations as we did in the instruction phase of our user study. We plan to compile other datasets from HLSD to extend the comparative study in the future.
Second, in our user study we asked human subjects to rank and rate the results of two randomly chosen methods in each of the three presented sets. After analyzing the results, we found that the subject’s ratings are in fact relative. For example, the MTHarmonizer’s average score in Overall is 3.04 when presented alongside the human-composed progressions, and 3.57 when confronted with the genetic algorithm-based model. We made sure in our user study that all the methods are equally likely to be presented together with every other method, so the average rating scores presented in Figure 6 do not favor a particular method. Still, caution is needed when interpreting the rating scores. Humans may not have a clear idea of how to consistently assign a score to a harmonization. While it is certainly easier to objectively compare multiple methods with the provided rating scores, we still recommended asking human subjects to make pairwise rankings in order to make the result more reliable.
Third, we note that the HMM used in this study only equips essential functions and does not include extensions to improve the model, such as tying the probabilities, using tri-grams or extending the hidden layers which have been vastly discussed in the literature (paiement2006pmh; temperley2009unified; tsushima17ismir). This is to observe how the essential functions of the HMM characterize the harmonization results rather than to explore the full potential of the HMM-based models.
Reviewing the properties of harmonization algorithms which imitate styles in a dataset as in our research still holds its importance, although recent music generation research is shifting towards measuring how systems can generate content that extrapolates meaningfully from what the model have learned (zacharakis18me). Extrapolation could be based on the model which also achieves interpolation or maintaining particular styles among data points. We believe we can further discuss extrapolation based on the understanding of how methods imitate data.
In this paper, we have presented a comparative study implementing and evaluating a number of canonical methods and one new method for melody harmonization, including deep learning and non-deep learning based approaches. The evaluation has been done using a lead sheet dataset we newly collected for training and evaluating melody harmonization. In addition to conducting a subjective evaluation, we employed in total six objective metrics with which to evaluate a chord progression given a melody. Our evaluation shows that deep learning models indeed perform better than non-deep learning ones in a variety of aspects, including harmonicity and interestingness. Moreover, a deep learning model that takes the function of chords into account reaches the best result among the evaluated models.