Modeling polyphonic music is a particularly challenging task because of the intricate interplay between melody and harmony. A good model should satisfy three requirements: statistical accuracy (capturing faithfully the statistics of correlations at various ranges, horizontally and vertically), flexibility (coping with arbitrary user constraints), and generalization capacity (inventing new material, while staying in the style of the training corpus). Models proposed so far fail on at least one of these requirements. We propose a statistical model of polyphonic music, based on the maximum entropy principle. This model is able to learn and reproduce pairwise statistics between neighboring note events in a given corpus. The model is also able to invent new chords and to harmonize unknown melodies. We evaluate the invention capacity of the model by assessing the amount of cited, re-discovered, and invented chords on a corpus of Bach chorales. We discuss how the model enables the user to specify and enforce user-defined constraints, which makes it useful for style-based, interactive music generation.READ FULL TEXT VIEW PDF
Many practices have been presented in music generation recently. While
We introduce a Maximum Entropy model able to capture the statistics of
Recurrent Neural Networks (RNNS) are now widely used on sequence generat...
Deep Learning models have shown very promising results in automatically
A big challenge in algorithmic composition is to devise a model that is ...
This paper explores a new natural language processing task, review-drive...
This paper proposes a word representation strategy for rhythm patterns. ...
Polyphonic tonal music is often considered as a highlight of Western civilization. Today’s music is still largely based on complex structures invented and developed since the Renaissance period, and modeled, e.g. by Jean Philippe Rameau  in the XVII century. In particular, polyphonic music is characterized by an intricate interplay between melody (single-voice stream of notes) and harmony (progression of simultaneously-heard notes). Additionally, composers tend to develop a specific style, that influences the way notes are combined together to form a musical piece.
Many models of polyphonic music have been proposed since the 50s (see 
for a comprehensive survey), starting with the famous Illiac Suite, which used Markov chains to produce 4-voice music, controlled by hand-made rules. In this paper we address the issue of learning agnostically the style of a polyphonic composer, with the aim of producing new musical pieces, that satisfy additional user constraints.
In practice, a good model should satisfy three requirements: statistical accuracy (capturing faithfully statistics of correlations at various ranges, horizontally and vertically), flexibility (coping with arbitrary user constraints), and generalization capacity (inventing new material, while staying in the style of the training corpus). Models proposed so far fail on at least one of these requirements.  propose a chord invention framework but is not based on agnostic learning, and requires a hand-made ontology. The approach described in  consists in a dynamic programming template enriched by constrained Markov chains. This approach generates musically convincing results  but is ad hoc and specialized for jazz. Furthermore it does not invent any new voicing by construction (the vertical ordering of the notes in a chord).  and  describe a HMM approach trained on an annotated corpus. This model imitates the style of Bach chorales, as shown by cross entropy measures. However, it is also not able to produce new voicings by construction, and only replicates voicings found in the training corpus. Another related approach is  which uses HMMs on chord representations (based on an expert knowledge of the common-practice harmony) called General Chord Type (GCT) 
to generate homorhythmic sequences. Those models are not agnostic in the sense that they include a priori knowledge about music such as the concept of dissonance, consonance, tonality or scale degrees. Agnostic approaches using neural networks have been investigated with promising results. In
, chords are modeled with Restricted Boltzmann Machines (RBMs). Their temporal dependencies are learned using Recurrent Neural networks (RNNs). Variations of these architectures have been developed, based on Long Short-Term Memory (LSTM) units
or GRU (Gated Recurrent Units). However, these models require large and coherent training sets which are not always available. More importantly, it is not clear how to enforce additional user constraints (flexibility). Moreover, their invention capacity is not demonstrated.
In this paper we introduce a graphical model based on the maximum entropy principle for learning and generating polyphony. Such models have been used for music retrieval applications , but never, to our knowledge, for polyphonic music generation.
This model requires no expert knowledge about music and can be trained on small corpora. Moreover, generation is extremely fast.
We show that this model can capture and reproduce pairwise statistics at possibly long range, both horizontally and vertically. These pairwise statistics are also able, to some extent, to capture implicitly higher order correlations, such as the structure of 4-note chords. The model is flexible, as it allows the user to post arbitrary unary constraints on any voice. Most importantly, we show that this model exhibits a remarkable chord invention capacity. In particular we show that it produces harmonically consistent sequences using chords which did not appear in the original corpus.
we discuss a range of interactive applications in music generation. Finally, we discuss how the “musical interest” of the generated sequences depends on the choice of our model’s hyperparameters in Sect.3.5.
The model we propose is based on a maximum entropy model described in . This model is extended to handle several voices instead of one, and to establish vertical as well as diagonal interactions between notes. We formulate the model as an exponential family obtained by a product of experts (one for each voice).
We aim to learn sequences of -part chord sequences. A sequence is composed of chords where the th chord is denoted
with note considered as an integer pitch belonging to the pitch range . The th part or voice corresponds to
Our model is based on the idea that chord progressions can be generated by replicating the occurrences of pairs of neighboring notes. It is invariant by translation in time, it aims at capturing the local “texture” of the chord sequences. Similar ideas have been shown to be successful in modeling highly combinatorial and arbitrary structures such as English four-letter words .
We denote by the model scope, which means that we consider that chords distant by more than time steps are conditionally independent given all other variables. We focus on the interaction between neighbouring notes and try to replicat the co-occurrences notes. A natural way to formalize this is to introduce a family of functions (or features) such that each member of this family counts the number of occurrence of a given pair of notes. As a result, the finite number of features we want to learn can be written as a family
of functions over chord sequences where
stands for the number of occurrences of pairs of notes such that note at voice precedes by time steps note at voice in the chord sequence . We can represent this family of binary connections as a graphical model as can be seen in Fig. 1 where each subfamily
is represented as a link between two notes. Our model has also unary parameters, acting on single notes and modeling the single note marginal distributions. For notational convenience we will treat these unary parameters as binary connections between pairs of identical notes (connections such that , , ) and call them local fields after the corresponding statistical physics terminology. We differentiate four types of connections: unary (local fields) and binary “horizontal”, “vertical” and “diagonal” connections. We implicitly identify with , for all , .
From now on, we will denote the set of indexes of the family by . We note that there is approximately
indexes, where stands for the mean alphabet size
Using only a subset of the family given by (1) can reduce the number of parameters while leading to good results. Indeed, if we consider that notes in different voices are conditionally independent if they are distant by more than time steps, we obtain a index set of size approximately equal to
In the following, can designate the whole set of indexes within scope as well as any of its subset.
Let for be real numbers. From all distributions such that the averages over all possible sequences of length verify
it is known that the exponential distribution with statisticsand parameters is the one of maximum entropy (i.e. which makes the least amount of assumptions) satisfying Eq. (4
). We thus consider an energy-based model of parameter
is usually called the energy of the sequence and
the normalizer or partition function such that
defines a probability function over sequences of size. The sum in the partition function is for every of size . There are approximately such sequences, which make the exact computation of the partition function intractable in general.
We are given a training dataset composed of -part sequences that we suppose, for clarity, to be concatenated in one sequence . Since we are dealing with discrete data, gradient techniques such as score matching  cannot be used. Instead, we choose to minimize the negative pseudo-log-likelihood [24, 9]
of the data to find an approximation of the true maximum likelihood estimator. It consists in approximating the negative log-likelihood function
by the mean of conditional log-likelihoods of a note given the others, that is
where denotes all notes in except . The conditional probabilities are calculated as
where the sum in the denominator is on chord sequences equal to except for the note in position .
Due to the particular structure of the probability density function (5) and the choice of the statistics (1), we note that we can write
where stands for the neighbors of note in which are at a distance inferior to time steps. We now express our dataset as a set of samples consisting of a note and its -distant neighbors, ignoring border terms whose effect is negligible. More precisely, we write the dataset
and split it into datasets such as
Each element in this dataset can still be considered as the subsequence it comes from also written . Those notations set, we can rewrite Eq. 8 as
is the negative conditional log-likelihood function for voice . This consists in minimizing the mean of negative log-likelihood functions (one for each voice) over the data . This method has the advantage of being tractable since there are only terms in the denominator of Eq. 9 and can lead to good estimates . This can be seen as the likelihood of a product of experts using modified (addition of vertical and diagonal connections) copies of the model presented in .
We need to find the parameters minimizing the sum of convex functions. Computing the gradient of for with respect to any parameter , , gives us
This can be written as
which is the difference between the average value of taken with respect to the conditional distribution (9) and its empirical value.
A preprocessing of the corpus is introduced in order to efficiently compute the gradient sums.
Finally, the function we will optimize is the -regularized version of with regularization parameter , which means that we consider
where is the usual -norm, which is the sum of the absolute values of the coordinates of the parameter. This is known as the Lasso regularization whose effects are widely discussed throughout statistical learning literature .
Generation is performed with the Metropolis-Hastings algorithm, which is an extensively used sampling algorithm (see  for an introduction). Its main feature is the possibility to sample from an unnormalized distribution since it only needs to compute ratios of probabilities
between two sequences and . With the specific proposals detailed below, we just need to start from a random sequence as our current sequence and repeat the following: choose a proposal , compute and then accept as our current sequence with probability or reject this proposal with probability and keep . After a sufficient number of iterations of this procedure, we are assured that the sequences obtained are distributed according the objective distribution .
We used this algorithm to generate chord sequences by choosing uniformly among all sequences differing by only one note from . In fact, with this algorithm, we can also enforce unary constraints on the produced sequences. If we only propose sequences containing a sequence of imposed notes, i.e. such that
where contains the indexes of the constrained notes, the Metropolis-Hastings algorithm samples from the distribution
This enables us to provide reharmonizations of a given melody. We can also add pitch range constraints on given notes, which means that we are given a set
such that we only propose sequences where
This can be used to impose chord labels without imposing its voicing.
We report experiments made using a set of 343 four-voice () chorale harmonizations by Johann Sebastian Bach . In order to evaluate the chord creation capabilities, we only retained in our chord sequences notes heard on beats. Sect.3.6 shows how we can easily produce rhythm using our model. We transposed every chorale in the key of C and considered 2 corpora: a corpus with chorales in a major key and a corpus with chorales in a minor key.
The quadratic number of parameters (thanks to the exclusive use of binary connections) makes the learning phase computationally tractable.
We used the L-BFGS method from Stanford CoreNLP optimization package  to perform the gradient descent.
In the next sections, we report on accuracy (the style imitation capacity of the model), its invention capacity, and flexibility.
We investigated the capabilities of the system to reproduce pair-wise correlations of the training set. Fig.2 shows a scatter plot comparing the (normalized) values of each binary connection in the generated sequences versus the ones of the original training corpus. The model was trained on a corpus of major chorales, which represents the equivalent of a -beat long chord sequence. We chose to differentiate horizontal connections from vertical and diagonal ones by introducing a parameter as mentioned in Sect. 2.1. We took , , as parameters and generated a -beat long sequence. For a discussion on the choice of the regularization parameter, see Sect. 3.5. We see that despite the small amount of data, the alignment between the generated pair occurrences and the original ones is quite convincing.
The generation procedure needs solely to compute the ratios (14), which can be done in approximately operations. Indeed, since the sequences differ by only one note, only contributions of its neighboring notes has to be taken into account. This has to be compared with the approximate number (3) of parameters. Experimentally, we found that the number of metropolis steps to achieve convergence is of order which enables these models to be used in real-time applications.
We argue that this model does not only reproduce pairwise statistics but can in fact capture higher-order interactions which makes it suitable for style imitation. Indeed, how the binary connections are combined in Eq. 5 makes the model able to reproduce correct voicings even if the way notes composing a chord are distributed among the different voices is a conceptually an interaction of order . Fig. 3 show chords with nice voicings, voices do not cross, triads have correct doublings and each separate voice has a coherent shape. A detailed analysis of the chord creation capacity of the system is made in Sec. 3.2. We discuss the capability to reproduce other higher-order patterns in Sect. 3.3.
We claim that the competition between the horizontal, vertical and diagonal correlations of our probabilistic model can generate new chords in the learned “style”. Three categories of generated chords can be distinguished: the cited chords which appear in the model’s training set, the discovered chords which do not appear in the training set but can be found in other Bach’s chorales and the invented chords which do not belong to any of the above categories. We used the same model as above to plot in Fig. 4 the mean repartition of chords during the Metropolis-Hastings generation as a function of the number of Metropolis steps (divided by ). These curves highly depend on the parameters chosen for the model and on the corpus’s size and structure. Nonetheless we can note the characteristic time for the model to sample from the equilibrium distribution. For every parameter set we tested, we observed that when convergence is reached, the proportion of invented and discovered chords seems fixed and significant.
A closer investigation shows that most of these “invented” chords can in fact be classified as valid in the style of Bach by an expert. The majority of the invented chords is composed of “correct” voicings of minor or major triads, seventh chords and chords with nonchord tones. Fig.3 exhibits interesting “inventions” such as an (unprepared) resolution, a dominant ninth and a diminished seventh. Other invented chords are discordant. A blindfolded evaluation was conducted to assess to which extent listeners distinguish invented chords from cited ones. Three non professional music-loving adult listeners were presented with a series of invented chords extracted from generated sequences, and played with their context (i.e. 4 chords before and 4 chords after). They were asked whether the central chord was “good” or not. Results show that in average of the invented chords were considered as acceptable.
The same analysis as in Sec. 3.2 can be made for other structures than chords. We chose to investigate to which extent the model is able to reproduce the occurrences of quadrilateral tuples. For a sequence , we define the quadrilateral tuple (see Fig.3) between voices and at position to be the tuple
These tuples are of particular interest since many harmonic rules, such as the prohibition of consecutive fiths and consecutive octaves which are often considered to be forbidden in counterpoint, apply to them. Table. 1 compares the percentage of cited/discovered/invented generated quadrilateral tuples by the model of Sec. 3.1, by a model containing only vertical interactions and by an independant model (which reproduces only pitch frequencies).
It shows that an important part of those higher order structures is reproduced. However, analysis exhibits limitations on the higher-order statistics that can be captured (see for instance Fig. 3). Indeed, even if our preprocessed corpus contains some of these “rules violations”, our model is unable to statistically reproduce the number of such structures (they are to times more frequent than in the original corpus). We discuss non agnostic methods to integrate these particular rules in Sec. 4.
As claimed in Sec. 2.3, we can use our model to generate new harmonizations of a melody. Indeed, the simplicity and adaptability of the model allows it to be “twisted” in order to enforce unary constraints while still generating sequences in the learned style. As our model is in a specified key (all chorales were transposed in the same key), we can thus provide convincing Bach-like harmonizations of plainsong melodies provided they “fit” in the training key. Fig. 5 shows two reharmonizations of Beethoven’s Ode to Joy with different unary constraints111Music examples can be heard on the http://flowmachines.jimdo.com/ website. It is worth noting that even if we put constraints on isolated notes and not on full chords, the constraints propagate well both vertically (the voicings are correct) and horizontally (the progression of chords around the constrained notes is coherent). This opens up a wide range of applications. Those examples show how enforcing simple unary constraints can be used to produce interesting musical phenomena during reharmonization such as:
By modifying the constrained metropolis sampling scheme of Sect. 2.3 we can harmonize any melody of size . For each beat , we use a melody analyzer to yield the current key at beat . If the new proposed sequence differs from the current one by a note at beat , we compute the acceptance ratio (14
) by using the probability distribution of the model trained in key. By doing so, we choose the appropriate model for each chunk of the melody and “glue” the results together seamlessly.
In this section we discuss the choice of the regularization parameter of Eq. 13. The benefits of introducing a
-regularization are multiple: it makes the loss function (13) strictly convex (in our case, we do not need to determine if the family (1) is a sufficient family), tends to reduce the number of non zero parameter coordinates and prevents overfitting As our model possesses a important number of parameters compared to the number of samples, adding a regularization term during the training phase appears to be mandatory in obtaining good results in the applications we mentioned Sect. 2.3 and 3.4.
We thus evaluate the impact of the choice of on the cited/discovered/invented classification curves (Fig. 3). We compare the mean repartition of chords in the unconstrained generation case and in the reharmonization case where the first voice is constrained. The same training corpus as in Sect. 2.3 is used, with , and varying . We use the first voice of chorales from the testing corpus as constraints. Results are presented in Fig. 6.
A clear influence of appears for both the unconstrained and constrained generations. However, these curves are not sharply peaked and it seems not clear which regularization parameter could be the most musically-interesting one.
For this, we investigated to which extent the regularization parameter influences the model’s ability to reproduce a wide variety of chords, either seen in the training set or rediscovered in the testing set.
We introduce two quantities revealing the diversity in the generated sequences: the percentage of restitution of the training corpus (the number of cited chords counted without repetitions normalized by the total number of different chords in the training corpus) and the percentage of discovery of the testing corpus (the number of different discovered chords normalized by the number of chords counted without repetitions in the testing corpus which are not in the training corpus). The evolution of the restitution/discovery percentages as a function of is plotted in Fig. 7 for both the constrained and unconstrained generation cases.
Both figures exhibit the same behavior. High values of lead to uniform models and low values of to models which overfit the training data. But the most interesting observation is that their maximum is not attained for the same value of revealing the possibility to control the tradeoff between the invention capacity, the diversity and the faithfulness with respect to the training corpus in the generated sequences.
In this work we focus on chord reproduction and invention. It is an interesting question since the model is by construction pairwise and chords are, in our examples, four-notes objects. In order to simplify the analysis on four-notes chords we chose to work on homorhythmic sequences and we discared all notes not falling on the beat in our corpus, J.S. Bach’s chorales. However, real music is not necessarily homorhythmic. It has rhythm, i.e. notes come in varying durations to form temporal patterns which take place with respect to some periodic pulse, a kind of temporal canvas.
We propose a simple way to extend our model in order to account for rhythmic patterns. The model initially presented in  and extended in 2 is translation invariant. For rhythmic patterns to emerge we need to break this translation invariance. We do that by introducing position dependent parameters. More specifically, we choose a cycle which is repeated over time and within which the translation invariance is broken. Such cycle can be for example one or two bars of music. We then divide this cycle in equal time bins which correspond to all possible positions where a note can start or end. We call these time-bins metrical positions. We could then define different parameters between notes on different metrical positions but that would lead to a very large number of parameters, yielding a very inaccurate learning (it can be argued that the number of parameters must be smaller than the number of data points). We have found that a good compromise is to let the unary parameters (local fields) be position-dependent while keeping the translation invariance for the true binary parameters. This leads to a negligible increase in the number of parameters since the unary parameters are of order whereas the true binary ones are of order . Finaly, in order to obtain a variety of note durations as well as rests, we introduce two additional symbols in the alphabet. One symbol for rests and one symbol that signifies the continuation of the previous note in the current metrical position.
The above procedure has the following effect: the position-dependent parameters are biasing localy the occurence of symbols (pitches, rests or continuations of the previous pitch) in a way that is consistent with the original corpus, which leads to the emergence of rhythic patterns of the same kind as the ones found in the corpus. An example can be seen in 8. For this example we used a cycle of one bar and divided it in 8 equal parts, corresponding to eighth notes (quavers) which is also the smallest division found in the corpus (here Missa Sanctorum Meritis, Credo by Giovanni Pierluigi da Palestrina).
We proposed a probabilistic model for chord sequences that captures pairwise dependencies between neighboring notes. The model is able to reproduce harmonic progressions without any prior information and invent new “stylistically correct” chords. The possibility to sample with arbitrary unary user-defined constraints makes this model applicable in a wide range of situations. We focused mainly on the chord creation and restitution capabilities which is in our point of view its most interesting feature, an analysis of the plagiarism of monophonic graphical models being made in . We showed that even if the original training set is highly combinatorial, these probabilistic methods behave impressively well even if high-order hard constraints such as parallel fifths of octaves cannot be captured. This method is general and applies to all discrete -tuple sequences. Indeed, we used as features the occurrences of notes, but any other family of functions could be selected. For instance, adding occurrences of parallel fifths or parallel octaves in the family (1) would be possible and would only require parameters, which does not increase our model’s complexity.
The utmost importance of the regularization parameter suggests to investigate finer and more problem-dependent regularizations such as group lasso  or other hierarchical sparsity-inducing norms . We believe that having more than a single scalar regularization parameter can lead to a better control of the “creativity” of our model.
This research is conducted within the Flow Machines project which received funding from the European Research Council under the European Union’s Seventh Framework Programme (FP/2007-2013) / ERC Grant Agreement n. 291156.
Boulanger-lewandowski, N., Bengio, Y., Vincent, P.: Modeling temporal dependencies in high-dimensional sequences: Application to polyphonic music generation and transcription. In: Proceedings of the 29th International Conference on Machine Learning (ICML-12). pp. 1159–1166 (2012)
Eppe, M., Confalonieri, R., Maclean, E., Kaliakatsos, M., Cambouropoulos, E., Schorlemmer, M., Codescu, M., Kühnberger, K.U.: Computational invention of cadences and chord progressions by conceptual chord-blending. In: International Joint Conference on Artificial Intelligence (IJCAI) (2015)
Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S.J., McClosky, D.: The stanford corenlp natural language processing toolkit. ACL 2014 p. 55 (2014)
Ravikumar, P., Wainwright, M.J., Lafferty, J.D., et al.: High-dimensional ising model selection using l1-regularized logistic regression. The Annals of Statistics 38(3), 1287–1319 (2010)