Log In Sign Up

Classical Music Composition Using State Space Models

Algorithmic composition of music has a long history and with the development of powerful deep learning methods, there has recently been increased interest in exploring algorithms and models to create art. We explore the utility of state space models, in particular hidden Markov models (HMMs) and variants, in composing classical piano pieces from the Romantic era and consider the models' ability to generate new pieces that sound like they were composed by a human. We find that the models we explored are fairly successful at generating new pieces that have largely consonant harmonies, especially when trained on original pieces with simple harmonic structure. However, we conclude that the major limitation in using these models to generate music that sounds like it was composed by a human is the lack of melodic progression in the composed pieces. We also examine the performance of the models in the context of music theory.


Music Composition with Deep Learning: A Review

Generating a complex work of art such as a musical composition requires ...

Learning from History: Recreating and Repurposing Sister Harriet Padberg's Computer Composed Canon and Free Fugue

Harriet Padberg wrote Computer-Composed Canon and Free Fugue as part of ...

A Novel 1D State Space for Efficient Music Rhythmic Analysis

Inferring music time structures has a broad range of applications in mus...

Adaptive music: Automated music composition and distribution

Creativity, or the ability to produce new useful ideas, is commonly asso...

MUSICNTWRK: data tools for music theory, analysis and composition

We present the API for MUSICNTWRK, a python library for pitch class set ...

Towards a Deep Improviser: a prototype deep learning post-tonal free music generator

Two modest-sized symbolic corpora of post-tonal and post-metric keyboard...

Learning a Predictive Model for Music Using PULSE

Predictive models for music are studied by researchers of algorithmic co...

1 Introduction

The objective of constructing algorithms to compose music or composing music with minimal human intervention has a long history. Indeed in the 18th century a musical game called Musikalisches Würfelspiel was developed that took small music fragments and combined them randomly by chance, often by tossing dice (Cope, 1996). The first musical composition completely generated by a computer was the “Illiac Suite”, produced from 1955 - 1956 (Nierhaus, 2009)

. A wide variety of probabilistic models and learning algorithms have been utilized for automated composition, including hidden Markov models (HMMs) to compose jazz improvisations, generative grammars to generate folk song rounds and artificial neural networks to harmonize existing melodies

(Nierhaus, 2009). Additional methods of algorithmic composition include the use of transition networks by Cope (1991) and multiple viewpoint systems by Whorley and Conklin (2016). Dirst and Weigend (1993) explored data driven approaches to analyzing and completing Bach’s last, unfinished fugue and Meehan (1980)

discussed efforts to develop models for tonal music theory using artificial intelligence.

Artificial neural networks and deep learning methods have recently been used for automated music generation. Specifically, recurrent neural networks (RNNs) were used for algorithmic composition by

Mozer (1994) and Eck and Schmidhuber (2002) and Boulanger-Lewandowski et al. (2012) used variants of RNNs for modeling polyphonic (multiple-voice) music. Hadjeres and Pachet (2016)

developed the interactive DeepBach system, that used Long-Short Term Memory (LSTM) units to harmonize Bach chorales with no musical knowledge built into the system.

Johnson (2015) explored the use of RNNs for composing classical piano pieces. There was recently a concert in London featuring work composed by RNNs that sought to emulate the styles of known composers (Vincent, 2016). Lastly, Google’s Magenta project (Magenta, 2016; Google Developers, 2017)

has been focused on generating art and media through deep learning and machine learning and maintains an open source code repository and blog detailing their efforts.

In this paper we explore using probabilistic time series models, specifically variations of hidden Markov models and time varying autoregression models, for algorithmic composition of piano pieces from the Romantic era. The decision to focus on the specific probabilistic time series models over RNNs is the ease of implementing the probabilistic time series models as well as their ability to model complex time series. The decision to focus on piano pieces from the Romantic era allows us to develop automated metrics on how well the compositions generated compare to the original pieces with respect to originality, musicality, and temporal structure. It would be very challenging to develop metrics that would make sense across the wide variety of musical genres.

While HMMs have also been utilized for a variety of musical analysis and algorithmic composition tasks, most applications have focused on music classification and the harmonization of melodic lines, with applications to the composition of melodies more limited. A survey of previous work using Markov models for music composition appears in Ames (1989). More recently, Suzuki and Kitahara (2014)

used Bayesian networks to harmonize soprano melodies,

Weiland et al. (2005) modeled the pitch structure of Bach chorales with Hierarchical HMMs, Pikrakis et al. (2006)

used variable duration HMMs to classify musical patterns,

Allan and Williams (2004) used HMMs to harmonize Bach chorales and Ren et al. (2010) explored the use of HMMs with time varying parameters to model and analyze musical pieces.

The main goal and contribution of this paper is to use classical probabilistic time series models to generate piano pieces from the Romantic era that sound like they were composed by a human, and to develop metrics to compare the generated compositions with the originals with respect to originality, musicality, and temporal structure. Specifically, we explore the dissonance or consonance of the harmonies produced by the generated piece as compared to the original piece and the extent of global structure and melodic progression in the generated pieces.

The paper is organized as follows. In Section 2, we discuss the representation of the music used in our models, specify the probabilistic models, and describe the musical pieces used to train the models. In Section 3, we describe the metrics we developed to assess the quality of the automated compositions. Finally, in Section 4, we discuss the quality of the automated compositions using both the quantitative metrics we developed and a qualitative survey of listeners, including musicians. We also relate the performance of the generated compositions to some ideas in music theory. We close with conclusions and future work.

2 Methods and Models

Automated generation of compositions requires a representation of the music that can be generated by time series models, specifically variations on hidden Markov models (HMMs). In this section we specify the representation of the music we used, the 14 time series models we used to generate compositions, and the 10 piano pieces from the Romantic period we used as training data.

2.1 Music Representation

Two standard representations of compositions are sheet music and the Musical Instrument Digital Interface (MIDI) format. MIDI is a data communications protocol that allows for the exchange of information between software and musical equipment through a digital representation of the data that is needed to generate the musical sounds (Rothstein, 1992). Sheet music is the standard representation used by composers and musicians, by which musical pieces are actually performed. The automated compositions we generate can be output in both MIDI as well as sheet music formats. However, we do not use the MIDI or sheet music formats in our statistical models, either in generating compositions or for learning compositions from example pieces. We consider a composition as a symbolic sequence of note pitches over discrete fixed time steps where the discrete time steps are usually one sixteenth note or one eighth note in length and the pitches are encoded as an integer from . This representation allows us to model both single notes as well as chords. See Figure 1 for an example of transforming sheet music to our pitch representation. There is open source code (Walker, 2008) to convert from MIDI formats to our pitch representation and back. Walker (2008) provides additional explanation about the exact format of the MIDI and CSV converted files. GarageBand (Apple, Apple) software is used to convert pieces in MIDI format to the equivalent representation in sheet music.

50 62 66 50 50 62 66 64 67 50 50
Figure 1: The first bar of sheet music from Beethoven’s Ode to Joy is displayed at the top. The equivalent symbolic representation for the first bar of Beethoven’s Ode to Joy is displayed below. A specific note’s duration is the difference in time stamps from the first occurrence of that pitch (the note is “turned on”) to when the same note pitch next occurs (the note is “turned off”). For example, the first note pitch is 50 (corresponding to a D) and next occurs three pitches and one time stamp later, indicating that this D has a duration of one quarter note (highlighted in red above).

2.2 Time Series Models

The time series models considered in this paper are all variations of HMMs (Rabiner, 1989) with the observable variables given by our pitch representation. We will denote an observed sequence as . The likelihood for the basic HMM is


where are the hidden states and the model parameters

specify the emission probabilities

, specify the transition probabilities , and the initial distribution is . The standard approach to infer model parameters is the Baum-Welch Algorithm (Baum, 1972).

In addition to the standard HMM, we considered a variety of statistically richer Markov models. More complicated and longer range dependence structures were modeled using higher-order Markov models (k-HMMs), where k specifies the order of dependence, and Autoregressive HMMs (ARHMMs). Time varying transition probabilities were modeled via Left-Right HMMs (LRHMMs), hidden semi-Markov models (HSMMs), and Non-stationary HMMs (NSHMMs). More complex hierarchical state structure was modeled using two hidden state Markov models (THSMMs), factorial HMMs (FHMMs), and Layered HMMs (LHMMs). Time-Varying Autoregression (TVAR) models were explored to model the non-stationarity and periodicity of the note pitches for the original pieces. We provide a brief summary of each of the models below.

k-HMMs specify the state transition probabilities based on the present and past (k-1)-states rather than just the current state. For example the likelihood of a 2-HMM can be specified as

The Baum-Welch algorithm for parameter inference can be easily extended to these higher order models (Mari and Schott, 2001).

ARHMMs are extremely common in time series applications (Rabiner, 1989; Guan et al., 2016) and in our setting allow each note to depend not just on the hidden state at time but also on the previous observed note . The likelihood for the ARHMM model is

LRHMMs adapt standard HMMs by constraining the state transition matrix to be upper triangular. This constraint allows for time varying transitions as the chain starts at an initial state and traverses a set of intermediate states and terminates at a final state without being able to go “backwards.”


explicitly model time varying transitions by modeling the duration for each hidden state. This relaxes the stationarity condition of standard HMMs and specifies a semi-Markov chain

(Yu, 2010; Murphy, 2002). The motivation for HSMMs in music is that we might expect for a piece to remain in the same state over the course of a few bars, for example over a motif, and the HSMM allows us to model this variable state duration. The likelihood for a hidden semi-Markov model can be specified as

here is a random count that determines the duration of state , is length of the observed sequence, and is the index of the first observation generated from hidden state with for . Note in this model the sequence generating process also needs to be specified and its parameters need to be inferred.

NSHMMs also model time varying state durations explicitly by specifying the duration of state transition probabilities as functions of time (Djuric and Chun, 1999; Sin and Kim, 1995). The model specification and properties of NSHMMs are very similar to that of HSMMs, however inference in NSHMMs can be more tractable than for HSMMs (Djuric and Chun, 1999; Sin and Kim, 1995; Djuric and Chun, 2002). We will use the Markov chain Monte Carlo (MCMC) algorithm specified in Djuric and Chun (2002) for parameter inference.

TSHMMs impose a hierarchical structure on the hidden states by considering two hidden states where is conditionally dependent on both its previous state and the current state of , while is only dependent on its previous state . The likelihood for this model is

A derivation of the Baum-Welch Algorithm for parameter inference for this model is given in Appendix A.


allow for a distributed representation for the hidden states

(Ghahramani and Jordan, 1997). An example of a factorial HMM is independent Markov chains to update the hidden states with the observation distribution specified by a function of the hidden states. The FHMM model we specified is

here is the sequence of the hidden states for the -th chain and the observation process is based on the average of the states of the hidden states

In our case and the number of states in each chain may vary. In our application the FHMM allows us to model different dynamic processes independently and combine them to form a generated piece of music.

LHMMs are complex models based on stacking HMMs and at each layer a standard HMM is learned on top of the previous layers (Oliver et al., 2004). The lowest level is a standard HMM. Once this HMM is trained the most likely sequence of hidden states is calculated and then used as observations for the HMM in the next layer. We considered three layers of HMMs, all with the same number of possible hidden states.

TVAR models (Prado and West, 2010) are able to model non-stationarity and longer dependence between note pitches over time. From both a musical and modeling perspective, we expect the note pitches for each piece to be volatile over time. The TVAR model for a sequence is specified as

where the variance

is time varying, can also be time varying, and the parameters are also time varying. We used forward filtering to fit each of the TVAR, and backwards sampling (Prado and West, 2010) was used to generate new pieces. The order and discount factors of the TVAR models were selected by a grid-search to maximize the marginal likelihood. The orders of the TVAR models ranged from 7 to 14, depending on the training piece. Since the note pitch values were discrete (rather than continuous) and only certain note pitches occurred in the original piece, the generated note pitches were binned to the nearest pitch that occurred in the original piece.

We considered 15 models, M1-M15, to generate compositions: model M1 was a standard HMM with 25 hidden states, model M2 was a 2-HMM with 25 hidden states, M3 was a 3-HMM with 10 hidden states, model M4 was a LRHMM with 25 hidden states, model M5 was a 2-LRHMM with 25 hidden states, model M6 was a 3-LRHMM with 10 hidden states, model M7 was a ARHMM with 25 hidden states, model M8 was a HSMM with 25 hidden states, model M9 was a NSHMM with 25 hidden states, M10 was a TSHMM with 10 hidden states in the first layer and 5 in the second layer, M11 was a TSHMM with 5 hidden states in the first layer and 10 in the second layer, model M12 was a FHMM with three independent HMMS each with 15, 10 and 5 hidden states respectively, model M13 was a LHMM with three layers and 25 hidden states in each layer, model M14 was a TVAR with order between 7 and 14 and model M15 was a baseline model specified by a HMM with randomly assigned parameters. Parameter inference for all the models except M14 and M15 was executed by the appropriate Baum-Welch algorithm, run until convergence.

2.3 Romantic Era Compositions

All 14 models were trained on ten piano pieces from the Romantic period, downloaded from (2017); Krueger (2016) and MIDIworld (2009). The pieces were in MIDI format which was converted to our music representation of note pitches. In addition to the note pitch, there was a relative time-stamp associated with each observation. The time stamps at which note pitch observations occurred were largely regular in time. However, to simplify modeling, we assumed all observations equally spaced in time. Additionally, some time stamps had multiple note pitches occurring at the same time, indicating a musical chord. However, we assumed that the note pitches were a univariate time series and treated each observation as sequential in time, whether or not subsequent observations occurred at the same time stamp.

All training pieces considered were either originally composed for piano or were arranged for piano and were selected to have a range of keys, forms, meters, tempos and composers. The Romantic period of music lasted from the end of the eighteenth century to the beginning of the twentieth century. This period was marked by a large amount of change in politics, economics and society. In general, music from the Romantic era tended to value emotion, novelty, technical skill and the sharing of ideas between the arts and other disciplines (Warrack, 1983). The ten original training pieces, as well as their keys and time signatures are listed in Table 1 below.

Composer Piece Key Time Signature
Beethoven Piano Sonata No. 14 (Moonlight Sonata), 1st Movement C# minor 3/4
Chopin Piano Sonata No. 2, 3rd Movement (Marche funebre) B minor 4/4
Mussorgsky Pictures at an Exhibition, Promenade - Gnomus B major 6/4
Mendelssohn Song without Words Book 1, No. 6 G minor 6/8
Mendelssohn Song without Words Book 5, No. 3 A minor 4/4
Liszt Hungarian Rhapsody, No. 2 C# minor 4/4
Tchaikovsky The Seasons, November - Troika Ride E major 4/4
Beethoven Ode to Joy (Hymn Tune) D major 4/4
Hopkins We Three Kings (Hymn Tune) G major 3/4
Mendelssohn Hark! The Herald Angels Sing (Hymn Tune) F major 4/4
Table 1: Summary of the Romantic era piano training pieces modeled, including the composer, key and time signature of each piece.

3 Evaluation Metrics

In this section we will propose a set of metrics to compare the similarity of the generated compositions with the original compositions in terms of originality, musicality, and temporal structure. There are two concepts in music theory that our metrics will need to capture: the concept of harmony and the concept of melody. The two main concepts in music theory that will be relevant to our metrics are melody and tonality, see Laitz (2003) and Gauldin (2004) for details on musical theory. Melody is a sequence of tones that are perceived as a single entity and is often considered a combination of pitch and rhythm. We will not summarize the music theory behind melody in this paper as our metric to capture melody or temporal structure will not require theory. On the other hand we will require some knowledge of the music theory underlying harmony to describe the metrics we use for musicality. Harmony refers to the arrangement of chords and pitches and the relation between chords and pitches.

3.1 Harmony

Romantic era music is tonal, meaning that the music is oriented around and pulled towards the tonic pitch. For this paper, we primarily consider a high-level view of the intervals and chords present in the generated pieces. Briefly, there are two types of musical intervals, the melodic interval, where two notes are sounded sequentially in time, and the harmonic interval, where two or more notes are sounded simultaneously. Intervals can be either consonant or dissonant and consonant intervals can further be categorized as perfect or imperfect consonances.

Consonant intervals are stable intervals that do not require a resolution. Perfect consonances are the most stable intervals, while imperfect consonances are only moderately stable. However, neither type of consonance requires a resolution. Dissonant intervals, on the other hand, are unstable and in Romantic era music need to be resolved to a stable, consonant interval. Dissonant intervals sound incomplete and transient, and through the resolution to a stable, consonant interval, the phrase sounds complete.

In Romantic era music, very few, if any, dissonant intervals are left unresolved. Dissonance serves to add interest and variation to the music and is often used to build tension in a phrase or melody. The resolution to a consonant interval relieves this tension and restores the phrase to stability.

A chord is a combination of three or more different pitches, and the predominant chords occurring in the Romantic era were the triad and the seventh chord. The triad is composed of a root pitch and the third and fifth above the root, while a seventh chord is the root, third, fifth and seventh. Chords can be closed or open depending on the spacing between pitches in the higher and lower registers.

For examples of dissonant and consonant intervals, as well as various chords, see

3.2 Metrics

The metrics we used in this paper capture how well the generated pieces conform to characteristics of the Romantic era. The metrics fall into three broad categories: originality, musicality and temporal structure. The originality metrics measured how complex the generated pieces were, as well as how different the generated pieces were from the original training pieces. The musicality metrics primarily attempted to measure the harmonic aspects of the generated pieces, while the temporal structure metrics attempted to measure the melodic qualities of the generated pieces.

Originality Metrics: The three originality metrics are based on information theory and distances between strings. The first metric is the empirical entropy of a musical piece, ether a training piece or a generated piece. Empirical entropy was proposed by Coffman (1992) as a measure of musical predictability. Simple hymn tunes like Beethoven’s Ode to Joy have lower empirical entropy than complex, unpredictable pieces like Liszt’s Hungarian Rhapsody, No. 2. The entropy of the training pieces are given in Table 2.

The second metric was mutual information (Cover and Thomas, 2006), which captured the relative entropy of the generated piece with respect to the training piece. The idea is that generated pieces that have greater mutual information with respect to the training pieces are considered less original. For a given statistical model M we compute the mutual information between the posterior of the training sequence and the generated sequence . The minimum edit distance is also a metric of dissimilarity between the generated sequence and the training sequence; in this case the metric is not based on a probability model. The minimum edit distance is the minimum number of insertions, deletions and substitutions necessary to transform one string into another (Jurafsky and Martin, 2009).

Composer Piece Entropy
Beethoven Ode to Joy (Hymn Tune) 2.328
Hopkins We Three Kings (Hymn Tune) 2.521
Chopin Piano Sonata No. 2, 3rd Movement (Marche funebre) 2.759
Mendelssohn Hark! The Herald Angels Sing (Hymn Tune) 2.780
Mendelssohn Song without Words Book 5, No. 3 2.897
Beethoven Piano Sonata No. 14 (Moonlight Sonata), 1st Movement 3.000
Tchaikovsky The Seasons, November - Troika Ride 3.063
Mendelssohn Song without Words Book 1, No. 6 3.227
Liszt Hungarian Rhapsody, No. 2 3.436
Mussorgsky Pictures at an Exhibition, Promenade - Gnomus 3.504
Table 2: Empirical entropy for the ten original training pieces considered, ordered from lowest to highest entropy.

Musicality Metrics: We considered three metrics to capture harmonic aspects of the generated pieces. The first metric was a normalized count of dissonant melodic and harmonic intervals. The minor second, major second, minor seventh and major seventh intervals are considered dissonant (Gauldin, 2004). The number of dissonant melodic and harmonic intervals in the generated piece were counted and normalized by the length of the piece. We expected the amount of dissonance in the generated pieces to be similar to the amount of dissonance in the original training piece. In particular, if the original piece contained no dissonance or very little dissonance (as was the case for all of the pieces considered, particularly the hymn tunes), we did not want a lot of dissonance in the generated piece, as unresolved dissonance was not common in Romantic era pieces.

The next musicality metric was a normalized count of large intervals. Following the metrics considered in Whorley and Conklin (2016) and Suzuki and Kitahara (2014) and general composition practice of the Romantic era, we did not expect to see many notes in either the melodic line or the bass line that were more than an octave apart from the previous note in their line. We counted the number of such large interval jumps that occurred in the generated piece and normalized the count by the length of the piece. The original pieces tended to have few, if any, melodic or harmonic intervals larger than an octave.

Finally we considered the distribution of note pitches. The occurrence of each unique pitch throughout the length of the piece was counted and normalized by the number of total notes in the piece. We expected pitches that were used less in the original piece to also be less prevalent in the generated pieces.

Temporal Structure Metrics: We used measures of decay correlations in a time series to capture the amount of temporal structure in a musical piece, a metric that captured a sense of melody. The first metric we considered was the autocorrelation function (ACF) of a sequence (Prado and West, 2010). The ACF is the correlation coefficient between an observation at time and at a lag

The idea behind the second metric considered, the partial autocorrelation function (PACF) (Prado and West, 2010), is to measure the correlation between and once the influence of have been removed:

A generated piece with a high degree of global structure and melody would be expected to have ACF and PACF plots with some structure out to high lags. The ACF and PACF was calculated for each generated piece out to lag 40. ACF and PACF plots for Beethoven’s Ode to Joy are given in Figure 4.


After the considered model converged, 1000 new pieces were sampled from the learned model using the appropriate generative description of each model and the root mean squared error (RMSE) for each metric (except mutual information and edit distance) was calculated. The RMSE was primarily used to rank the generated pieces, to select the top generated pieces for evaluation by human listeners and to gain insight into some general trends observed in the generated pieces. The RMSE can be calculated as


where is the considered metric, is the value of the metric for the original piece and is the number of generated pieces, in this case .

The RMSE was calculated for the musicality and temporal structure metrics, as well as for the empirical entropy, but not for the mutual information or minimum edit distance. For the mutual information and minimum edit distance, both metrics were calculated for each of the 1000 generated pieces with respect to the original training piece, then the average of these 1000 values was taken to use for comparison between models and training pieces, in lieu of the RMSE.

4 Results on Compositions

Our overarching goal was to generate new compositions that recapitulated the several well-defined harmonic and melodic characteristics of Romantic music. We were also interested in two questions about automated compositions: (a) which time series models are the best at generating Romantic era compositions and (b) which original compositions are more amenable to generating realistic Romantic era pieces and does music theory offer some insight into what properties of these pieces may drive variance. Ultimately, judging the quality of a generated composition requires evaluation by human listeners. However, it is not feasible to provide pieces generated by all models and all original pieces to listeners, so we used the quantitative metrics described in the previous section to select the top compositions that were presented to human listeners.

The Romantics valued emotion and virtuosity in their music so the generated compositions needed to elicit some emotional response in human listeners. To evaluate subjective elements of the generated pieces, such as Romantic era style and the human-like qualities of the generated pieces, several human listeners of varying musical backgrounds listened to and evaluated several generated pieces. This was the ultimate test of our generated compositions.

In this section we first outline the evaluation procedures for both the numerical metrics and the human listeners. We then discuss in detail analysis and summaries that arose from the numerical and human evaluations. We close the section by relating some of our observations from the evaluations to music theory.

4.1 Evaluation Procedure

Numerical Evaluation

Compositions were generated for all ten original pieces and models M1-M14. Given the large number of compositions (140), the purpose of the numerical evaluation was to select the top compositions with respect to originality, musicality or harmony, and temporal structure or melody. The metric we used to assess originality was the minimum RMSE empirical entropy. The metric we used to assess harmony was the average of the musicality metrics. The metric we used to assess melody was the RMSE of the average of the temporal structure metrics.

Listening Evaluation

The top three pieces were evaluated by sixteen human listeners of varying musical backgrounds. Eight of the individuals were currently in a musical ensemble and were considered “more” musically knowledgeable, while the other eight individuals were not currently in a musical ensemble. Each individual was told that all three of the pieces had been generated by a computer using statistical models trained on an original piano piece from the Romantic era The pieces were labeled A, B and C, so the listeners did not know a priori the original training piece for each generated piece. In addition to ranking the three pieces in order of their favorite to least favorite, each individual was asked:

  1. What did you like and not like about each piece?

  2. Did any of the pieces sound like they were composed by a human?

  3. If so, why and if not, why not?

  4. Other general comments.

4.2 Analysis of Numerical Evaluation

Top Scoring Pieces

The top three pieces according to the numerical metrics were Mendelssohn’s Hark! The Herald Angels Sing modeled by a layered HMM, Beethoven’s Ode to Joy modeled by a first order HMM, and Chopin’s Piano Sonata No. 2, 3rd Movement (Marche funebre) modeled by a layered HMM, see Table 3 for the RMSE scores. These top three pieces were used for the human evaluations. The Chopin Piano Sonata had the lowest RMSE with respect to the temporal structure metrics. The Mendelssohn piece had the second lowest RMSE with respect to empirical entropy since the Chopin Piano Sonata had the lowest RMSE with respect to empirical entropy and temporal structure and was already included. The Beethoven piece had the lowest RMSE for the musicality metrics.

Training Piece Model Entropy RMSE Average Musicality RMSE Average Temporal Structure RMSE
Ode to Joy First Order HMM 0.042 0.018 0.142
Marche funebre Layered HMM 0.021 0.019 0.049
Hark! The Herald Angels Sing Layered HMM 0.022 0.027 0.075
Table 3: Summary of the entropy, musicality average, and temporal structure average RMSE values for the three pieces selected for the listening evaluation.

General Trends

The numerical metrics suggested some general trends in model performance. The layered HMM had one of the lowest RMSE values for all of the metrics considered and the generated pieces tended to have minimal dissonance and the most structure of the pieces generated by the different models. The layered HMM also produced compositions with the lowest edit distance and highest mutual information as compared to the original piece. Pieces generated by the layered HMMs were the most similar to the original piece, which led to more pleasing musical results, but also pieces that might be considered too similar to the original piece.

The simplest model, the first order HMM, also tended to perform well in terms of the musicality measures. The pieces generated by the first order HMM were not too dissonant, did not have many large melodic or harmonic intervals and had pitch distributions similar to that of the original piece. However, the first order HMM did not have as much global structure as the layered HMM. The two hidden state HMM with and or with and tended to repeat the same note for a long time, and thus sounded repetitive and had pitch distributions that were quite different from that of the original piece. The pieces generated by the “random HMM” tended to produce very dissonant pieces that obviously lacked any kind of long-term structure and especially suffered from several large harmonic and melodic intervals. These models thus tended to perform poorly on all metrics, especially the large interval metric.

To focus on these general trends in some more detail, we considered as a case study Mussorgsky’s Pictures at an Exhibition, Promenade - Gnomus. To provide an indication of the performance of each model, the RMSE for each metric was calculated for each model trained only on this piece. The model results ranked from lowest to highest RMSE are listed in Table 4.

15 2.872 15 0.860 11 0.077
9 2.760 10 0.835 10 0.116
2 2.619 11 0.814 7 0.219
3 2.600 9 0.805 5 0.219
6 2.510 3 0.796 8 0.220
13 2.510 2 0.794 6 0.221
5 2.506 7 0.766 4 0.221
4 2.506 8 0.764 1 0.223
7 2.505 6 0.763 2 0.225
8 2.505 5 0.763 3 0.225
1 2.504 4 0.753 9 0.241
12 2.454 12 0.742 15 0.248
10 1.962 1 0.700 12 0.717
11 1.614 13 0.535 13 1.085
Table 4: RMSE for various metrics for all 15 models (listed as 1-15) ordered from best to worst when trained on Pictures at an Exhibition, Promenade - Gnomus. Ent =Entropy, Dis=Dissonance, ED=Edit Distance, MI=Mutual Information, NC=Note Count, LI = Large Interval.
13 0.003 13 0.017 13 0.006 10 0.068 10 0.059
1 0.004 1 0.019 4 0.007 12 0.069 11 0.060
10 0.004 2 0.021 8 0.008 13 0.069 12 0.060
11 0.004 8 0.023 7 0.008 1 0.070 13 0.061
15 .006 7 0.024 5 0.008 11 0.070 4 0.061
9 0.006 4 0.027 6 0.008 4 0.071 1 0.061
7 0.006 5 0.028 1 0.008 15 0.072 15 0.062
6 0.006 6 0.029 2 0.012 3 0.072 3 0.062
3 0.006 3 0.039 3 0.012 5 0.072 2 0.062
4 0.006 10 0.062 9 0.012 2 0.072 7 0.062
8 0.006 9 0.095 15 0.019 8 0.072 8 0.062
5 0.006 11 0.113 12 0.022 6 0.072 6 0.062
2 0.007 12 0.157 10 0.023 9 0.072 9 0.062
12 0.009 15 0.167 11 0.026 7 0.072 5 0.062

We also examined in some more depth what the sheet music generated by some of the time series models looked like for the Mussorgsky piece. In Figure Figure 2 we display the first few lines of the original training piece by Mussorgsky. There is no dissonance in the opening to this piece, there is a clear melodic progression with the repetition of the motif in the first two bars and there are also no harmonic or melodic intervals of more than an octave. The repetition of a melodic motif, in particular, was a clear trait of the original training piece that was not seen in any of the generated pieces.

In Figure Figure 2, we display the opening bars of the piece generated using the layered HMM and trained on Pictures at an Exhibition, Promenade - Gnomus. There is some dissonance in this excerpt from the generated piece, especially beginning in the last two beats of the second bar with the C followed by the D (a dissonant minor second interval). While this generated piece is certainly more dissonant than the original training piece, it is the least dissonant of the generated pieces. There is not any clear melodic progression or repetition of motifs as there is in the original piece, however, the rhythmic structure of the first two bars is similar to that of the second two bars. There is a melodic interval larger than an octave in the first bar, from the F to the G, and there are more large intervals present than in the original piece.

We display the first four measures of the piece generated by the first order HMM in Figure Figure 2. This piece was similar to the one generated by the layered HMM with slightly greater dissonance. This is seen in the last two beats of the third measure where there are multiple dissonant intervals and chords as well as several accidentals. There are also some large intervals of more than an octave in the first measures of this generated piece and some slight melodic progression, see the descending treble line in the first bar. The structure is minimal and there are no clear motifs in this generated piece.

In Figure Figure 2 we display the first four measures of the piece generated by the two hidden state HMM with and . There are only three distinct note pitches in the first two measures. The second measure in the generated piece only consists of the note G, while none of the notes in the second measure of the original training piece are the same pitch. There are additionally some large intervals in the bass line in the fourth measure. There is again very little, if any melodic progression, in large part due to the repetition of the same few notes throughout the first few bars.

In Figure Figure 2 we see the piece generated by a first order HMM with random parameters. The piece is highly dissonant, has a large number of large intervals, and nearly every melodic and harmonic interval is dissonant. The rhythm of the generated piece matches closely to the rhythm of the original piece, though the large amount of dissonance practically hinders any sense of melodic progression.

Figure 2: (a) First measures of Pictures at an Exhibition, Promenade - Gnomus. (b) The first few bars from a piece generated by a layered HMM. (c) The first few bars from a piece generated by a first order HMM. (d) The first few bars from a piece generated by a two hidden state HMM (with and ). (e) The first few bars from a piece generated by a first order HMM with random parameters.

4.3 Analysis of Human Evaluation

The human listeners were split into two groups of eight; group A consisted of individuals in a musical ensemble and group B consisted of individuals not in a musical ensemble. All individuals listened to the top three pieces according to the numerical metrics. In Figure 3 we plot the ranking (1-3) of the each of the pieces, with 1 as the most favorite and 3 as the least favorite. Listener comments on each of the generated pieces are given in Table 5.

Figure 3: The rankings of generated pieces from a layered HMM trained on Chopin’s Marche funebre and Mendelssohn’s Hark! The Herald Angels Sing and the piece generated from a first order HMM trained on Beethoven’s Ode to Joy as evaluated by eight listeners who were currently in a musical ensemble (left) and not currently in a musical ensemble (right).

Several members in group A thought that the pieces sounded like they could have been composed by a human, but that the composition style was typical of a “twentieth century”, “modern”, “contemporary” or “post-classical” composer and commented that the pieces did not sound like they had been composed in the Romantic era. In particular, the “atonal chords and sustained base chords” and the “harmonic variation and non-diatonic chords” were provided as justifications for the more modern sound of the composed pieces, though one listener noted that this could have been due to the fact that modern composers “are less obligated to obey the rules of tonal harmony”. Several of the listeners in group A commented on how they wished there was more phrasing or structure in the generated pieces, as there was not much “complexity” in the pieces. Two listeners also thought that a human performer could improve the interpretation of the generated pieces by adding some phrasing to the music itself.

One of the listeners in the group B commented that overall each piece sounded “distinct” and suggested that this “experimental” method of composition might be applied to “free jazz”. Another listener thought that each piece sounded “related” to the piece it was trained on, in particular for the piece generated by a layered HMM trained on Mendelssohn’s Hark! The Herald Angels Sing, a piece that they were familiar with.

Generated Piece Musical Ensemble Group Non-Musical Ensemble Group
Layered HMM - Chopin’s Marche funebre - Melody on top of bass line added complexity
- Movement between octaves made piece sound human composed
- Dissonance in piece more resolved than in other pieces
- “Somewhat” of a melody that built towards end
- Sounded like “funeral” piece or Chopin nocturne
- Half thought complex piece, melodic with a few different themes, more likely to be human-composed
- Half thought “rote”, repetitive, less likely to be human-composed
- Abrupt ending
- Lack of dissonance
- Piece sounded modern or like Chopin
Layered HMM - Mendelssohn’s Hark! The Herald Angels Sing - Too many half-steps / dissonances to sound like human-composed
- Too simple, predictable and “choppy”
- 5/8 thought piece sounded like original training piece, more missed notes and dissonance
- Unintended, random dissonance, less likely to be human-composed (or in style of Modern composer)
- Note progressions sounded like human composition
- 1/8 thought piece sounded like original training piece
First Order HMM - Beethoven’s Ode to Joy - Repetitive melodies, lower voice “boring”
- Piece could be developed into larger work, used as interlude music
- Atonal, out-of-place dissonance, sounded like a Modern composition
- Better phrases, melodic progression that other pieces, “good structure” to piece
- “Jarring” dissonance, less so than some of previous pieces
- Enjoyed depth/texture, faster tempo
- Sounded like human-composed “New-Age” piano music
- Repetitive, piece “forgot” what had previously occurred and repeated itself
- Sounded like Christmas music, too dissonant to be human-composed
Table 5: Listener comments by the two listening groups on each of the three generated pieces.

4.4 Observations in Relation to Music Theory

One of the observations from the listening evaluations was that the piece generated by training a layered HMM on Chopin’s Marche funebre was in general the most well received, with the fewest comments complaining of un-resolved or out of place dissonance. Chopin’s Marche funebre is built on chords that are fifths—an open, perfect interval widely used in the music of non-Western cultures. Thus, even when there was a passing dissonance in the piece generated by training an HMM on Chopin’s Marche funebre, the dissonance could be resolved by relaxation to a pure interval, resulting in a piece that sounded less dissonant. The majority of the dissonance in the generated piece could be resolved in this way. We believe this simplicity of the harmony in the original piece by Chopin contributed to the relative success in the generation of new pieces from HMMs trained on Chopin’s Marche funebre.

In contrast, Mendelssohn’s Hark! The Herald Angels Sing and Beethoven’s Ode to Joy are built on major chords comprised of thirds. There is a greater “potential” for dissonance for major chords comprised of thirds as there are multiple ways that one can obtain dissonance. Indeed, the only ways for there to be consonance is for intervals that are either: (a) in perfect unison, or (b) a third, fifth or octave in the chord. For pieces built on thirds, the dissonance could not be resolved as easily as in the case of Chopin’s Marche funebre built on open fifths, thus the generated pieces sounded much more dissonant. In the examination of the pieces generated by HMMs trained on Mussorgsky’s Pictures at an Exhibition, Promenade - Gnomus, the generated pieces were also quite dissonant, especially compared to the pieces generated by HMMs trained on Chopin’s Marche funebre, and again, Mussorgsky’s original piece is built on major chords where the third of the chord is present.

This intuition from music theory is confirmed in Table 6 where we display the percentage of simple harmonic intervals (harmonic intervals that are an octave or less) that are either thirds (minor or major), perfect fourths or fifths, or dissonant for both the original training piece and the generated pieces. The third in the major or minor triad is an imperfect consonance and is not as easily resolved as perfect consonances like the fifth in the triad. This leads to more dissonance in the generated pieces. All of the generated pieces in Table 6 have a much higher percentage of dissonant simple harmonic intervals then their respective training pieces. However, Chopin’s Marche funebre is the only training piece which is primarily perfect fourths or fifths (in this case fifths) as opposed to thirds, and the generated piece has the lowest percentage of dissonant intervals of the three generated pieces.

Piece Thirds Fourths / Fifths Dissonant
Chopin’s Marche funebre 0.0964 0.4608 0.0060
Chopin’s Marche funebre - Layered HMM 0.2071 0.2893 0.1286
Mendelssohn’s Hark! The Herald Angels Sing 0.2690 0.2398 0.0741
Mendelssohn’s Hark! The Herald Angels Sing - Layered HMM 0.2067 0.2667 0.2433
Beethoven’s Ode to Joy 0.4710 0.2101 0.0145
Beethoven’s Ode to Joy - First Order HMM 0.2927 0.2134 0.2317
Table 6: Percentage of simple harmonic intervals that are thirds, perfect fourths or fifths and dissonant.

All the generated pieces lacked the extent of overall melodic progression or global structure of the original pieces, a shortcoming that the listeners commented on and that was evident in the metrics. Figure 4 displays the ACF and PACF for both the piece generated by training a first order HMM on Beethoven’s Ode to Joy and the original piece. It is clear that the original piece has much greater dependence over longer lags. One can also see from the models trained on Mussorgsky’s Pictures at an Exhibition, Promenade - Gnomus, that even relatively short, simple motifs that are highly repetitive were not modeled well by any of the HMMs considered. This lack of global structure is to be expected, as the models considered made quite restrictive assumptions about long-term dependence between hidden states.

Figure 4: (a) Plots of the ACF and PACF for Beethoven’s Ode to Joy. (b) Plots of the ACF and PACF for the piece generated by a first order HMM trained on Beethoven’s Ode to Joy.

None of the evaluated pieces sounded like they were composed during the Romantic era. Most listeners seemed to think that the pieces could have been composed by a human, albeit one in the modern or contemporary period (especially due to the prevalence of large intervals more than an octave and the often unresolved dissonances that are much more common in modern music than in Romantic era music). The HMMs considered did not model the traits of Romantic music well and this was reflected in the generated pieces. Additionally, the layered HMM in particular, may have suffered from “overfitting”, as several listeners were able to identify that one of the evaluated pieces was indeed trained on Hark! The Herald Angels Sing. However, we do expect the generated pieces to bear similarity to the training piece and this potential “overfitting” is not necessarily an issue. Many listeners could not identify the training piece, only overall themes, and none of the models reproduced the training piece exactly.

4.5 Validation of Trends

Two hypotheses suggested by both the listener responses and the metrics were that the generated pieces lacked overall or melodic progression and that models trained on harmonically simple pieces tended to produce more consonant pieces that were preferred by listeners, while original pieces mainly consisting of triads led to more dissonant pieces that listeners did not prefer.

To validate these two hypotheses we considered an additional training piece from the Baroque era, Pachelbel’s Canon in D. Pachelbel’s Canon in D is harmonically simple and is also built on a perfect interval, in this case the perfect fourth. Our prediction was that pieces generated from training on Pachelbel’s Canon in D should result in pieces that were less dissonant than most of the ten pieces from the Romantic era.

We again used the numeric metrics to obtain the top three ranked pieces with respect to lowest entropy, musicality, and temporal RMSE values. The resulting pieces were a layered HMM generated piece trained on Ode to Joy (lowest musicality RMSE), a layered HMM piece trained on We Three Kings (lowest entropy RMSE), and TVAR(11) generated piece trained on Pachelbel’s Canon (lowest temporal RMSE). These pieces were then evaluated and ranked by a new set of human listeners, five currently in a musical ensemble (group A) and eight not currently in a musical ensemble (group B). In addition to ranking each piece from favorite to least favorite, each listener was asked to quantitatively score each piece on a scale of 1 to 5 according to:

  1. how much the generated piece sounded like it was composed by a human, 1 = piece sounded completely random, 5 = piece sounded just like a human composition;

  2. how harmonically pleasing each piece was, 1 = not at all harmonically pleasing, 5 = very harmonically pleasing;

  3. how melodically pleasing each piece was, 1 = not at all melodically pleasing, 5 = very melodically pleasing.

Members of group A had unanimous rankings with the piece generated from Ode to Joy their favorite and the piece generated from We Three Kings as their least favorite, see Figure 5. The members of group B had much greater variation in preference with about half preferring the piece generated from Pachelbel’s Canon as their favorite, again see Figure 5.

One of our hypotheses was that harmonically simple pieces tended to be more consonant. To test this we examined the generated pieces from Pachelbel’s Canon in more detail. Pachelbel’s Canon is built upon the perfect fourth, though not as high a percentage of perfect fourths or fifths in terms of the harmonic intervals observed, see Table 9 for a comparison of the harmonic intervals in the three pieces. When we consider melodic intervals as well, the two pieces that are primarily built upon perfect consonances (We Three Kings and Pachelbel’s Canon) show a decrease in the percentage of dissonant melodic intervals in the generated pieces as compared to the original training pieces, see Table 9. The piece generated by a layered HMM trained on Beethoven’s Ode to Joy showed an increase in the percentage of dissonant melodic intervals. However, the piece generated by a layered HMM trained on Ode to Joy had the lowest percentage of dissonant melodic intervals of the three generated pieces considered, likely explaining why the harmonic aspects of this generated piece were ranked highly by both groups of listeners. Although the piece generated by a TVAR(11) trained on Pachelbel’s Canon had the highest percentage of dissonant melodic intervals, the majority of these intervals were resolved in the piece, leading to a favorable harmonic rating by both groups of human listeners. This is in contrast to the layered HMM trained on We Three Kings which had the smallest number of dissonant harmonic intervals, though it was clear from the human listeners that the dissonant harmonic intervals that it did contain were particularly jarring. This is why we think that it was ranked the worst in terms of harmonic qualities.

Our other hypothesis was that the generated pieces lacked overall or melodic progression as compared to the originals. We explored this hypothesis by examining if the TVAR model, which is capable of modeling longer lags than the the layered HMMs, improved the musicality metrics, see (c). In (c) we see that the decay in correlations is not as steep for the TVAR model as it is for the layered HMM, though the autocorrelation function for both models is impoverished with respect to the original. However, human listeners were not always able to distinguish this additional temporal structure, as the TVAR(11) generated piece trained on Pachelbel’s Canon was not consistently scored higher for melodic qualities than the pieces generated by a layered HMM.

Figure 5: The rankings of generated pieces from a layered HMM trained on Ode to Joy and We Three Kings and the piece generated from a TVAR(11) model trained on Pachelbel’s Canon as evaluated by eight listeners who were not currently in a musical ensemble and five listeners currently in a musical ensemble.
Musical Ensemble Group No Musical Ensemble Group
Human-like Harmony Melody Human-like Harmony Melody
Ode to Joy - Layered HMM 3.125 3.250 3.125 3.8 4.0 3.8
We Three Kings - Layered HMM 3.500 3.125 3.125 2.4 1.8 2.2
Pachelbel’s Canon - TVAR(11) 3.438 3.375 3.063 3.5 3.6 3.2
Table 7: Average scores for how much each evaluated generated piece sounded like it was composed by a human, how harmonically pleasing it was and how melodically pleasing it was, on a scale of 1 (lowest) to 5 (highest).
Figure 6: (a) ACF and PACF for the original training piece, Pachelbel’s Canon. (b) ACF and PACF for a piece generated by a layered HMM trained on Pachelbel’s Canon. (c) ACF and PACF for a piece generated by a TVAR(11) model trained on Pachelbel’s Canon.
Piece Thirds Fourths / Fifths Dissonant
Beethoven’s Ode to Joy 0.4710 0.2101 0.0145
Beethoven’s Ode to Joy - Layered HMM 0.1955 0.2849 0.2793
We Three Kings 0.2845 0.3724 0.0
We Three Kings - Layered HMM 0.2049 0.3394 0.2294
Pachelbel’s Canon 0.4286 0.1203 0.0075
Pachelbel’s Canon - TVAR(11) 0.1477 0.2727 0.267
Table 8: Percentage of simple harmonic intervals that are thirds, perfect fourths or fifths and dissonant.
Piece Thirds Fourths / Fifths Dissonant
Beethoven’s Ode to Joy 0.0769 0.3427 0.1119
Beethoven’s Ode to Joy - Layered HMM 0.1266 0.2848 0.1709
We Three Kings 0.0757 0.2919 0.4324
We Three Kings - Layered HMM 0.2054 0.2865 0.1784
Pachelbel’s Canon 0.1497 0.2389 0.5191
Pachelbel’s Canon - TVAR(11) 0.625 0.275 0.3000
Table 9: Percentage of simple melodic intervals that are thirds, perfect fourths or fifths and dissonant.

5 Conclusions and future work

We were able to train state space models to generate pieces from the Romantic era and some of the pieces were considered to be possibly composed by humans when heard by human evaluators. The models were more successful at modeling harmony than melodic progression. The state space models were more successful at generating consonant pieces when trained on pieces with simple harmonies, particularly original pieces that were built on perfect intervals. Models with greater hierarchical structure, particularly the layered HMM, tended to be more successful at generating pieces with less dissonance and slightly more melodic progression than other models considered. Listeners felt that the generated pieces sounded more like pieces composed in the Modern era than like pieces composed in the Romantic era. The greatest shortcoming was that the generated pieces lacked global structure or long-term melodic progression.

Based on the shortcomings of the state space models, several directions for future work are suggested that attempt to resolve some of the problems with the generated pieces of the considered models. These future directions include considering hierarchical models, natural language models, and recurrent neural networks (RNNs).

5.1 Hierarchical Models

In original piano pieces from the Romantic era, there are several layers to the music that evolve over different time periods. These include short reoccurring motifs, longer melodies and the global form of the piece. Of the models we considered those that had a hierarchical component tended to perform better in terms of the RMSE and listening evaluations and came closer to modeling the actual hidden structure of an original piece. Thus, models with a greater hierarchical structure such as the hierarchical HMM (Fine et al., 1998), dynamic Bayesian networks with a hierarchical structure (Ghahramani, 1997), factorial HMMs with dependence structure between the underlying processes (Ghahramani and Jordan, 1997), and hierarchical RNNs (Hihi and Bengio, 1996) are all interesting candidates that may improve the ability of modeling both short-term and global structure in musical pieces. In particular, explicitly modeling the hierarchy of rhythm, melody and harmony in the original training piece could improve the ability of hierarchical models to generate music that sounds like it was composed by a human.

5.2 Natural Language Models and Musical Grammar

Classical music composition follows basic rules and guidelines just like a spoken language. The composer and conductor Leonard Bernstein expressed great interest in developing a concrete musical grammar; this was the main topic Bernstein explored in a series of talks at Harvard (Bernstein, 1976). Inspired by these talks, Lerdahl and Jackendoff (1983) developed a generative theory of tonal music, Baroni et al. (1983) proposed a grammar of melody based on Bach chorales, and Pearce and Wiggins (2006)

explored the concept of expectation in melodic music using statistical models based on n-grams.

Nierhaus (2009) surveyed additional work using natural language models to generate or model music.

The development of a grammar for music with production rules and syntax would likely improve the harmonic and melodic aspects of the generated pieces. This grammar would also have utility beyond music generation, such as allowing for the use of existing topic models for insight into common aspects in works by a particular composer, for example. The concept of a musical grammar would also contribute to a more concrete understanding and interpretation of the hidden states in the state space models and could allow for the ability to explicitly build musical theory into the these models. HMMs are in part so successful in speech recognition applications because knowledge about physical speech production and speech signals can be implemented in the models as constraints in the transition matrix. In general, the more prior knowledge about the series of interest that can be built into a state space model, the better the model is expected to perform and the concept of a musical grammar could thus introduce music theoretic constraints, leading to generated music that obeyed these grammatical rules.

Stochastic context-free grammars are an extension of HMMs that are well understood and algorithms for parameter inference exist (Lari and Young, 1990). Stochastic context-free grammars have probabilistic production rules and are hierarchical in structure, although they are computationally intensive. However, stochastic context-free grammars are likely a promising next step in exploring and understanding the underlying processes in original musical pieces.

5.3 Recurrent Neural Networks

One of the main criticisms of the music composed by the state space models we considered was the lack of global structure in the generated pieces. By model design, HMMs have very limited memory and are thus incapable of modeling the longer term structure that occurs in original musical pieces. RNNs are neural networks that are specialized for processing sequential data (Goodfellow et al., 2016)). RNNs have been used as an alternative to HMMs in music composition (Mozer, 1994; Eck and Schmidhuber, 2002; Boulanger-Lewandowski et al., 2012; Johnson, 2015; Google Developers, 2017; Magenta, 2016; Hadjeres and Pachet, 2016).

However, Mozer (1994), Eck and Schmidhuber (2002) and Johnson (2015) note that RNNs suffer from a similar problem as HMMs in music composition, a lack of global structure. RNNs by themselves are unable to capture long-term dependencies that occur in classical pieces of music, thus the music composed by RNNs can produce repetitive generated pieces (Johnson, 2015) or generated pieces that lack global structure (Mozer, 1994; Eck and Schmidhuber, 2002). Eck and Schmidhuber (2002) note that this is likely due to the problem of vanishing gradients in RNNs and use Long Short Term Memory (LSTM) units to successfully capture longer-term structure in a corpus of blues music. Boulanger-Lewandowski et al. (2012)

use recurrent temporal restricted Boltzmann machines and a generalization that they call the recurrent neural network restricted Boltzmann machine to model polyphonic music. They find that this model outperforms other models of polyphonic music, including HMMs, at learning harmonic and rhythmic probabilistic rules, where they are primarily interested in musical transcription of polyphonic music.

Hadjeres and Pachet (2016) likewise utilize LSTM units to generate harmonized Bach chorales. Exploring models, like the LSTM, that are capable of modeling longer-term dependencies in the original musical pieces would likely improve the global structure of the generated musical pieces and attempt to solve the problem of the lack of melodic progression in the pieces generated by HMMs.

Software and Data

All the software to generate the music, compute the metrics, as well as the original pieces and the generated pieces are available at In addition examples of music theory concepts are provided and discussed.


A.Y. would like to acknowledge Jeff Miller for discussions concerning the Two Hidden State HMM, Mike West for providing TVAR code in Matlab, which was converted to Python code, and Mike Kris for many discussions about the musical aspects of this work. A.Y. is currently affiliated with MIT Lincoln Laboratory. S.M. would like to acknowledge NSF DMS 16-13261, NSF IIS 15-46331, NSF DMS 14-18261, NSF DMS-17-13012, NIH R21 AG055777-01A, and NSF ABI 16-61386 for partial support.

Acronyms used in the paper

  • ACF: Autocorrelation Function

  • ARHMM: Autoregressive Hidden Markov Model

  • FHMM: Factorial Hidden Markov Model

  • HMM: Hidden Markov Model

  • HSMM: Hidden Semi-Markov Model

  • LHMM: Layered Hidden Markov Model

  • LRHMM: Left-Right Hidden Markov Model

  • LSTM: Long Short Term Memory

  • MIDI: Musical Instrument Digital Interface

  • NSHMM: Non-stationary Hidden Markov Model

  • PACF: Partial Autocorrelation Function

  • RMSE: Root Mean Squared Error

  • RNN: Recurrent Neural Network

  • SCFG: Stochastic Context-Free Grammar

  • THSMM: Two Hidden State Markov Model

  • TVAR: Time-Varying Autoregression

Appendix A Baum-Welch Algorithm for the Two Hidden State HMM

For the HMM with two hidden states, and are the hidden states, see Figure 7. Each state in the hidden process can take on one of possible values, while each state in the hidden process can take on one of possible values. The length of both series is still . We define the following parameters:

Figure 7: Directed graph of the HMM with two hidden states. Both the and the are hidden states.

The constraints are

Let be the model parameters, where and are the initial state distribution and emission distribution, respectively, as defined for the first order HMM. Let be the current values of these parameters at time in the Baum-Welch Algorithm. Define to be a constant. Then, the auxiliary function for the E step of the update Baum-Welch Algorithm for the HMM with two hidden states can be written as:


We have that

Then, we can find the value of to maximize (where is a Lagrange multiplier to handle the constraints placed on and ):


The Forward-Backward Algorithm is exactly the same as in the first order HMM case, where as defined above is the transition matrix used. and are updated exactly the same way as in the Baum-Welch Algorithm for the first order HMM (Miller, 2016).


  • Allan and Williams (2004) Allan, M. and C. Williams (2004). Harmonising chorales by probabilistic inference. In Advances in Neural Information Processing Systems 17, pp. 25–32.
  • Ames (1989) Ames, C. (1989). The Markov Process as a Compositional Model: A Survey and Tutorial. Leonardo 22(2), 175–187.
  • Apple (Apple) Apple. GarageBand for Mac.
  • Baroni et al. (1983) Baroni, M., S. Maguire, and W. Drabkin (1983). The concept of musical grammar. Music Analysis 2(2), 175–208.
  • Baum (1972) Baum, L. E. (1972).

    An inequality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes.

    Inequalities 3, 1–8.
  • Bernstein (1976) Bernstein, L. (1976). The Unanswered Question: Six Talks at Harvard. Harvard University Press.
  • Boulanger-Lewandowski et al. (2012) Boulanger-Lewandowski, N., Y. Bengio, and P. Vincent (2012). Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription. In Proceedings of the 29th International Conference on Machine Learning (ICML-12), pp. 1159–1166.
  • Coffman (1992) Coffman, D. D. (1992). Measuring musical originality using information theory. Psychology of Music 20, 154–161.
  • Cope (1991) Cope, D. (1991). Computers and Musical Style. The Computer Music and Digital Audio Series. A-R Editions, Inc.
  • Cope (1996) Cope, D. (1996). Experiments in musical intelligence. A-R Editions, Inc.
  • Cover and Thomas (2006) Cover, T. M. and J. A. Thomas (2006). Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). Wiley-Interscience.
  • Dirst and Weigend (1993) Dirst, M. and A. S. Weigend (1993). Time Series Prediction: Forecasting the Future and Understanding the Past, Chapter Baroque Forecasting: On Completing J. S. Bach’s Last Fugue. Addison-Wesley.
  • Djuric and Chun (1999) Djuric, P. M. and J.-H. Chun (1999). Estimation of Nonstationary Hidden Markov Models by MCMC Sampling. In ICASSP ’99 Proceedings of the Acoustics, Speech, and Signal Processing, 1999. on 1999 IEEE International Conference - Volume 03, pp. 1737–1740.
  • Djuric and Chun (2002) Djuric, P. M. and J.-H. Chun (2002). An MCMC Sampling Approach to Estimation of Nonstationary Hidden Markov Models. IEEE Transactions on Signal Processing 50(5), 1113–1123.
  • Eck and Schmidhuber (2002) Eck, D. and J. Schmidhuber (2002). A First Look at Music Composition using LSTM Recurrent Neural Networks. Technical report, Istituto Dalle Molle Di Studi Sull Intelligenza Artificiale.
  • Fine et al. (1998) Fine, S., Y. Singer, and N. Tishby (1998). The Hierarchical Hidden Markov Model: Analysis and Applications. Machine Learning 32, 41–62.
  • Gauldin (2004) Gauldin, R. (2004). Harmonic Practice in Tonal Music (Second ed.). W. W. Norton & Company.
  • Ghahramani (1997) Ghahramani, Z. (1997). Learning Dynamic Bayesian Networks. Technical report, University of Toronto.
  • Ghahramani and Jordan (1997) Ghahramani, Z. and M. I. Jordan (1997). Factorial Hidden Markov Models. Machine Learning, 1–31.
  • Goodfellow et al. (2016) Goodfellow, I., Y. Bengio, and A. Courville (2016). Deep Learning. MIT Press.
  • Google Developers (2017) Google Developers (2017, February).

    Magenta: Music and Art Generation (TensorFlow Dev Summit 2017).
  • Guan et al. (2016) Guan, X., R. Raich, and W.-K. Wong (2016). Efficient Multi-Instance Learning for Activity Recognition from Time Series Data Using an Auto-Regressive Hidden Markov Model. In Proceedings of The 33rd International Conference on Machine Learning, Volume 48.
  • Hadjeres and Pachet (2016) Hadjeres, G. and F. Pachet (2016). DeepBach: a Steerable Model for Bach chorales generation. CoRR abs/1612.01010.
  • Hihi and Bengio (1996) Hihi, S. E. and Y. Bengio (1996). Hierarchical Recurrent Neural Networks for Long-Term Dependencies. In Advances in Neural Information Processing Systems 8 (NIPS’95). MIT Press.
  • Johnson (2015) Johnson, D. (2015, August). Composing Music with Recurrent Neural Networks.
  • Jurafsky and Martin (2009) Jurafsky, D. and J. H. Martin (2009). Speech and Language Processing (Second ed.). Prentice Hall.
  • Krueger (2016) Krueger, B. (2016). Classical Piano MIDI Page.
  • Laitz (2003) Laitz, S. G. (2003). The Complete Musician. Oxford University Press.
  • Lari and Young (1990) Lari, K. and S. J. Young (1990). The estimation of stochastic context-free grammars using the Inside-Outside algorithm. Computer Speech and Language 4, 35–56.
  • Lerdahl and Jackendoff (1983) Lerdahl, F. and R. Jackendoff (1983). A Generative Theory of Tonal Music. The MIT Press.
  • Magenta (2016) Magenta (2016). Magenta.
  • Mari and Schott (2001) Mari, J.-F. and R. Schott (2001). Probabilistic and Statistical Methods in Computer Science. Springer.
  • Meehan (1980) Meehan, J. R. (1980). An Artificial Intelligence Approach to Tonal Music Theory. Computer Music Journal 4(2), 60–65.
  • (2017) (2017). mfiles.
  • MIDIworld (2009) MIDIworld (2009).
  • Miller (2016) Miller, J. W. (2016, April). Two Hidden State Markov Model. Discussions.
  • Mozer (1994) Mozer, M. C. (1994). Neural network music composition by prediction: Exploring the benefits of psychophysical constraints and multiscale processing. Cognitive Science 6, 247–280.
  • Murphy (2002) Murphy, K. P. (2002). Hidden semi-Markov models (HSMMs). Technical report, MIT.
  • Nierhaus (2009) Nierhaus, G. (2009). Algorithmic Composition: Paradigms of Automated Music Generation. Springer, Wien, New York.
  • Oliver et al. (2004) Oliver, N., A. Garg, and E. Horvitz (2004). Layered representations for learning and inferring office activity from multiple sensory channels. Computer Vision and Image Understanding 96, 163–180.
  • Pearce and Wiggins (2006) Pearce, M. T. and G. A. Wiggins (2006). Expectation in Melody: The Influence of Context and Learning. Music Perception: An Interdisciplinary Journal 23(5), 377–405.
  • Pikrakis et al. (2006) Pikrakis, A., S. Theodoridis, and D. Kamarotos (2006). Classification of musical patterns using variable duration hidden Markov models. IEEE Transactions on Audio, Speech, and Language Processing 14(5), 1795–1807.
  • Prado and West (2010) Prado, R. and M. West (2010). Time Series: Modeling, Computation and Inference. Chapman & Hall/CRC.
  • Rabiner (1989) Rabiner, L. R. (1989, February). A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. In Proceedings of the IEEE, Volume 77. IEEE.
  • Ren et al. (2010) Ren, L., D. Dunson, S. Lindroth, and L. Carin (2010). Dynamic Nonparametric Bayesian Models for Analysis of Music. Journal of the American Statistical Association 105(490), 458–472.
  • Rothstein (1992) Rothstein, J. (1992). MIDI: A Comprehensive Introduction. The Computer Music and Digital Audio Series. A-R Editions, Inc.
  • Sin and Kim (1995) Sin, B. and J. H. Kim (1995). Nonstationary hidden Markov model. Signal Processing 46, 31–46.
  • Suzuki and Kitahara (2014) Suzuki, S. and T. Kitahara (2014). Four-part Harmonization Using Bayesian Networks: Pros and Cons of Introducing Chord Notes. Journal of New Music Research 43(3), 331–353.
  • Vincent (2016) Vincent, J. (2016). A night at the AI jazz club.
  • Walker (2008) Walker, J. (2008, January). Midi-csv.
  • Warrack (1983) Warrack, J. (1983). The New Oxford Compantion to Music. Oxford University Press.
  • Weiland et al. (2005) Weiland, M., A. Smaill, and P. Nelson (2005). Learning musical pitch structures with hierarchical hidden Markov models. Journees d’Informatique Musical.
  • Whorley and Conklin (2016) Whorley, R. P. and D. Conklin (2016). Music Generation from Statistical Models of Harmony. Journal of New Music Research 45(2), 160–183.
  • Yu (2010) Yu, S.-Z. (2010). Hidden semi-Markov models. Artificial Intelligence 174, 215–243.