code accompanying "DeepBach: a Steerable Model for Bach Chorales Generation" paper
This paper introduces DeepBach, a graphical model aimed at modeling polyphonic music and specifically hymn-like pieces. We claim that, after being trained on the chorale harmonizations by Johann Sebastian Bach, our model is capable of generating highly convincing chorales in the style of Bach. DeepBach's strength comes from the use of pseudo-Gibbs sampling coupled with an adapted representation of musical data. This is in contrast with many automatic music composition approaches which tend to compose music sequentially. Our model is also steerable in the sense that a user can constrain the generation by imposing positional constraints such as notes, rhythms or cadences in the generated score. We also provide a plugin on top of the MuseScore music editor making the interaction with DeepBach easy to use.READ FULL TEXT VIEW PDF
Recent advances in deep neural networks have enabled algorithms to compo...
Automatic melody generation for pop music has been a long-time aspiratio...
Machine learning models of music typically break up the task of composit...
Music is an art, perceived in unique ways by every listener, coming from...
We describe a novel approach for generating music using a self-correctin...
A commonly-cited reason for the poor performance of automatic chord
It is shown how binary sequences can be associated with automatic compos...
code accompanying "DeepBach: a Steerable Model for Bach Chorales Generation" paper
Teaching a NN to generate midis based on Bach
Udacity: Deep Learning Nanodegree - Project 3
Experiments in generative music. (sandbox)
The composition of polyphonic chorale music in the style of J.S. Bach has represented a major challenge in automatic music composition over the last decades. The corpus of the chorale harmonizations by Johann Sebastian Bach is remarkable by its homogeneity and its size (389 chorales in (Bach, 1985)). All these short pieces (approximately one minute long) are written for a four-part chorus (soprano, alto, tenor and bass) using similar compositional principles: the composer takes a well-known (at that time) melody from a Lutheran hymn and harmonizes it i.e. the three lower parts (alto, tenor and bass) accompanying the soprano (the highest part) are composed, see Fig.1 for an example.
Moreover, since the aim of reharmonizing a melody is to give more power or new insights to its text, the lyrics have to be understood clearly. We say that voices are in homophony, i.e. they articulate syllables simultaneously. This implies characteristic rhythms, variety of harmonic ideas as well as characteristic melodic movements which make the style of these chorale compositions easily distinguishable, even for non experts.
The difficulty, from a compositional point of view comes from the intricate interplay between harmony (notes sounding at the same time) and voice movements (how a single voice evolves through time). Furthermore, each voice has its own “style” and its own coherence. Finding a chorale-like reharmonization which combines Bach-like harmonic progressions with musically interesting melodic movements is a problem which often takes years of practice for musicians.
From the point of view of automatic music generation, the first solution to this apparently highly combinatorial problem was proposed by (Ebcioglu, 1988) in 1988. This problem is seen as a constraint satisfaction problem, where the system must fulfill numerous hand-crafted constraints characterizing the style of Bach. It is a rule-based expert system which contains no less than 300 rules and tries to reharmonize a given melody with a generate-and-test method and intelligent backtracking. Among the short examples presented at the end of the paper, some are flawless. The drawbacks of this method are, as stated by the author, the considerable effort to generate the rule base and the fact that the harmonizations produced “do not sound like Bach, except for occasional Bachian patterns and cadence formulas.” In our opinion, the requirement of an expert knowledge implies a lot of subjective choices.
A neural-network-based solution was later developed by(Hild et al., 1992). This method relies on several neural networks, each one trained for solving a specific task: a harmonic skeleton is first computed then refined and ornamented. A similar approach is adopted in (Allan & Williams, 2005)
, but uses Hidden Markov Models (HMMs) instead of neural networks. Chords are represented as lists of intervals and form the states of the Markov models. These approaches produce interesting results even if they both use expert knowledge and bias the generation by imposing their compositional process. In(Whorley et al., 2013; Whorley & Conklin, 2016), authors elaborate on those methods by introducing multiple viewpoints and variations on the sampling method (generated sequences which violate “rules of harmony” are put aside for instance). However, this approach does not produce a convincing chorale-like texture, rhythmically as well as harmonically and the resort to hand-crafted criteria to assess the quality of the generated sequences might rule out many musically-interesting solutions.
Recently, agnostic approaches (requiring no knowledge about harmony, Bach’s style or music) using neural networks have been investigated with promising results. In (Boulanger-Lewandowski et al., 2012)
, chords are modeled with Restricted Boltzmann Machines (RBMs). Their temporal dependencies are learned using Recurrent Neural Networks (RNNs). Variations of these architectures based on Long Short-Term Memory (LSTM) units ((Hochreiter & Schmidhuber, 1997; Mikolov et al., 2014)
) or GRUs (Gated Recurrent Units) have been developed by(Lyu et al., 2015) and (Chung et al., 2014) respectively. However, these models which work on piano roll representations of the music are too general to capture the specificity of Bach chorales. Also, a major drawback is their lack of flexibility. Generation is performed from left to right. A user cannot interact with the system: it is impossible to do reharmonization for instance which is the essentially how the corpus of Bach chorales was composed. Moreover, their invention capacity and non-plagiarism abilities are not demonstrated.
A method that addresses the rigidity of sequential generation in music was first proposed in (Sakellariou et al., 2015, 2016) for monophonic music and later generalized to polyphony in (Hadjeres et al., 2016). These approaches advocate for the use of Gibbs sampling as a generation process in automatic music composition.
The most recent advances in chorale harmonization is arguably the BachBot model (Liang, 2016), a LSTM-based approach specifically designed to deal with Bach chorales. This approach relies on little musical knowledge (all chorales are transposed in a common key) and is able to produce high-quality chorale harmonizations. However, compared to our approach, this model is less general (produced chorales are all in the C key for instance) and less flexible (only the soprano can be fixed). Similarly to our work, the authors evaluate their model with an online Turing test to assess the efficiency of their model. They also take into account the fermata symbols (Fig. 4) which are indicators of the structure of the chorales.
In this paper we introduce DeepBach, a dependency network (Heckerman et al., 2000) capable of producing musically convincing four-part chorales in the style of Bach by using a Gibbs-like sampling procedure. Contrary to models based on RNNs, we do not sample from left to right which allows us to enforce positional, unary user-defined constraints such as rhythm, notes, parts, chords and cadences. DeepBach is able to generate coherent musical phrases and provides, for instance, varied reharmonizations of melodies without plagiarism. Its core features are its speed, the possible interaction with users and the richness of harmonic ideas it proposes. Its efficiency opens up new ways of composing Bach-like chorales for non experts in an interactive manner similarly to what is proposed in (Papadopoulos et al., 2016) for leadsheets.
In Sect. 2 we present the DeepBach model for four-part chorale generation. We discuss in Sect. 3 the results of an experimental study we conducted to assess the quality of our model. Finally, we provide generated examples in Sect. 4.3 and elaborate on the possibilities offered by our interactive music composition editor in Sect. 4. All examples can be heard on the accompanying web page222https://sites.google.com/site/deepbachexamples/ and the code of our implementation is available on GitHub333https://github.com/Ghadjeres/DeepBach. Even if our presentation focuses on Bach chorales, this model has been successfully applied to other styles and composers including Monteverdi five-voice madrigals to Palestrina masses.
In this paper we introduce a generative model which takes into account the distinction between voices. Sect. 2.1 presents the data representation we used. This representation is both fitted for our sampling procedure and more accurate than many data representation commonly used in automatic music composition. Sect. 2.2 presents the model’s architecture and Sect. 2.3 our generation method. Finally, Sect. 2.4 provides implementation details and indicates how we preprocessed the corpus of Bach chorale harmonizations.
We use MIDI pitches to encode notes and choose to model voices separately. We consider that only one note can be sung at a given time and discard chorales with voice divisions.
Since Bach chorales only contain simple time signatures, we discretize time with sixteenth notes, which means that each beat is subdivided into four equal parts. Since there is no smaller subdivision in Bach chorales, there is no loss of information in this process.
In this setting, a voice is a list of notes indexed by , where is the duration piece (in sixteenth notes).
We choose to model rhythm by simply adding a hold symbol “__” coding whether or not the preceding note is held to the list of existing notes. This representation is thus unambiguous, compact and well-suited to our sampling method (see Sect. 2.3.4).
The music sheet (Fig. (b)b) conveys more information than only the notes played. We can cite:
the key signature,
the time signature,
the beat index,
an implicit metronome (on which subdivision of the beat the note is played),
the fermata symbols (see Fig. 4),
current key signature,
current mode (major/minor/dorian).
In the following, we will only take into account the fermata symbols, the subdivision indexes and the current key signature. To this end, we introduce:
The fermata list that indicates if there is a fermata symbol, see Fig. 4, over the current note, it is a Boolean value. If a fermata is placed over a note on the music sheet, we consider that it is active for all time indexes within the duration of the note.
The subdivision list that contains the subdivision indexes of the beat. It is an integer between 1 and 4: there is no distinction between beats in a bar so that our model is able to deal with chorales with three and four beats per measure.
We represent a chorale as a couple
composed of voices and metadata. For Bach chorales, is a list of 4 voices for (soprano, alto, tenor and bass) and a collection of metadata lists ( and ).
Our choices are very general and do not involve expert knowledge about harmony or scales but are only mere observations of the corpus. The list acts as a metronome. The list is added since fermatas in Bach chorales indicate the end of each musical phrase. The use of fermata to this end is a specificity of Bach chorales that we want to take advantage of.
We choose to consider the metadata sequences in as given. For clarity, we suppose in this section that our dataset is composed of only one chorale written as in Eq. 1 of size . We define a dependency network on the finite set of variables
by specifying a set of conditional probability distributions (parametrized by parameter)
where indicates the note of voice at time index and all variables in except from the variable . As we want our model to be time invariant so that we can apply it to sequences of any size, we share the parameters between all conditional probability distributions on variables lying in the same voice, i.e.
Finally, we fit each of these conditional probability distributions on the data by maximizing the log-likelihood. Due to weight sharing, this amounts to solving four classification problems of the form:
where the aim is to predict a note knowing the value of its neighboring notes, the subdivision of the beat it is on and the presence of fermatas. The advantage with this formulation is that each classifier has to make predictions within a small range of notes whose ranges correspond to the notes within the usual voice ranges (see2.4).
For accurate predictions and in order to take into account the sequential aspect of the data, each classifier is modeled using four neural networks: two Deep Recurrent Neural Networks (Pascanu et al., 2013), one summing up past information and another summing up information coming from the future together with a non-recurrent neural network for notes occurring at the same time. Only the last output from the uppermost RNN layer is kept. These three outputs are then merged and passed as the input of a fourth neural network whose output is . Figure 8 shows a graphical representation for one of these models. Details are provided in Sect. 2.4. These choices of architecture somehow match real compositional practice on Bach chorales. Indeed, when reharmonizing a given melody, it is often simpler to start from the cadence and write music “backwards.”
Generation in dependency networks is performed using the pseudo-Gibbs sampling procedure. This Markov Chain Monte Carlo (MCMC) algorithm is described in Alg.1. It is similar to the classical Gibbs sampling procedure (Geman & Geman, 1984) on the difference that the conditional distributions are potentially incompatible (Chen & Ip, 2015). This means that the conditional distributions of Eq. (2
) do not necessarily comes from a joint distributionand that the theoretical guarantees that the MCMC converges to this stationary joint distribution vanish. We experimentally verified that it was indeed the case by checking that the Markov Chain of Alg.1 violates Kolmogorov’s criterion (Kelly, 2011): it is thus not reversible and cannot converge to a joint distribution whose conditional distributions match the ones used for sampling.
However, this Markov chain converges to another stationary distribution and applications on real data demonstrated that this method yielded accurate joint probabilities, especially when the inconsistent probability distributions are learned from data (Heckerman et al., 2000). Furthermore, nonreversible MCMC algorithms can in particular cases be better at sampling that reversible Markov Chains (Vucelja, 2014).
The advantage of this method is that we can enforce user-defined constraints by tweaking Alg. 1:
instead of choosing voice from to we can choose to fix the soprano and only resample voices from , and in step (3) in order to provide reharmonizations of the fixed melody
we can choose the fermata list in order to impose end of musical phrases at some places
more generally, we can impose any metadata
for any and any , we can fix specific subsets of notes within the range of voice . We then restrict ourselves to some specific chorales by re-sampling from
at step (5). This allows us for instance to fix rhythm (since the hold symbol is considered as a note), impose some chords in a soft manner or restrict the vocal ranges.
Note that it is possible to make generation faster by making parallel Gibbs updates on GPU. Steps (3) to (5) from Alg. 1 can be run simultaneously to provide significant speedups. Even if it is known that this approach is biased (De Sa et al., 2016) (since we can update simultaneously variables which are not conditionally independent), we experimentally observed that for small batch sizes ( or ), DeepBach still generates samples of great musicality while running ten times faster than the sequential version. This allows DeepBach to generate chorales in a few seconds.
It is also possible to use the hard-disk-configurations generation algorithm (Alg.2.9 in (Krauth, 2006)) to appropriately choose all the time indexes at which we parallelly resample so that:
every time index is at distance at least from the other time indexes
configurations of time indexes satisfying the relation above are equally sampled.
This trick allows to assert that we do not update simultaneously a variable and its local context.
We emphasize on this section the importance of our particular choice of data representation with respect to our sampling procedure. The fact that we obtain great results using pseudo-Gibbs sampling relies exclusively on our choice to integrate the hold symbol into the list of notes.
Indeed, Gibbs sampling fails to sample the true joint distribution when variables are highly correlated, creating isolated regions of high probability states in which the MCMC chain can be trapped. However, many data representations used in music modeling such as
the piano-roll representation,
the couple (pitch, articulation) representation where articulation is a Boolean value indicating whether or not the note is played or held,
tend to make the musical data suffer from this drawback.
As an example, in the piano-roll representation, a long note is represented as the repetition of the same value over many variables. In order to only change its pitch, one needs to change simultaneously a large number of variables (which is exponentially rare) while this is achievable with only one variable change with our representation.
We implemented DeepBach using Keras(Chollet, 2015)
with the Tensorflow(Abadi et al., 2015) backend. We used the database of chorale harmonizations by J.S. Bach included in the music21 toolkit (Cuthbert & Ariza, 2010). After removing chorales with instrumental parts and chorales containing parts with two simultaneous notes (bass parts sometimes divide for the last chord), we ended up with 352 pieces. Contrary to other approaches which transpose all chorales to the same key (usually in C major or A minor), we choose to augment our dataset by adding all chorale transpositions which fit within the vocal ranges defined by the initial corpus. This gives us a corpus of 2503 chorales and split it between a training set (80%) and a validation set (20%). The vocal ranges contains less than 30 different pitches for each voice (21, 21, 21, 28) for the soprano, alto, tenor and bass parts respectively.
As shown in Fig. 8, we model only local interactions between a note and its context (, ) i.e. only elements with time index between and are taken as inputs of our model for some scope . This approximation appears to be accurate since musical analysis reveals that Bach chorales do not exhibit clear long-term dependencies.
a neural network with one hidden layer of size 200 and ReLU(Nair & Hinton, 2010) nonlinearity and as the “Deep RNN brick” two stacked LSTMs (Hochreiter & Schmidhuber, 1997; Mikolov et al., 2014), each one being of size 200 (see Fig. 2 (f) in (Li & Wu, 2015)). The “embedding brick” applies the same neural network to each time slice . There are 20% dropout on input and 50% dropout after each layer.
We experimentally found that sharing weights between the left and right embedding layers improved neither validation accuracy nor the musical quality of our generated chorales.
We evaluated the quality of our model with an online test conducted on human listeners.
and a Multilayer Perceptron (MLP) model.
The Maximum Entropy model is a neural network with no hidden layer. It is given by:
is a vector containing the elements in, a matrix and a vector of size with being the size of , the number of notes in the voice range and Softmax the softmax function given by
for a vector .
The Multilayer Perceptron model we chose takes as input elements in , is a neural network with one hidden layer of size 500 and uses a ReLU (Nair & Hinton, 2010) nonlinearity.
All models are local and have the same scope , see Sect. 2.4.
Subjects were asked to give information about their musical expertise. They could choose what category fits them best between:
I seldom listen to classical music
Music lover or musician
Student in music composition or professional musician.
The musical extracts have been obtained by reharmonizing 50 chorales from the validation set by each of the three models (MaxEnt, MLP, DeepBach). We rendered the MIDI files using the Leeds Town Hall Organ soundfont555https://www.samplephonics.com/products/free/sampler-instruments/the-leeds-town-hall-organ and cut two extracts of 12 seconds from each chorale, which gives us 400 musical extracts for our test: 4 versions for each of the 100 melody chunks. We chose our rendering so that the generated parts (alto, tenor and bass) can be distinctly heard and differentiated from the soprano part (which is fixed and identical for all models): in our mix, dissonances are easily heard, the velocity is the same for all notes as in a real organ performance and the sound does not decay, which is important when evaluating the reharmonization of long notes.
Subjects were presented series of only one musical extract together with the binary choice “Bach” or “Computer”. Fig. 9 shows how the votes are distributed depending on the level of musical expertise of the subjects for each model. For this experiment, 1272 people took this test, 261 with musical expertise 1, 646 with musical expertise 2 and 365 with musical expertise 3.
The results are quite clear: the percentage of “Bach” votes augment as the model’s complexity increase. Furthermore, the distinction between computer-generated extracts and Bach’s extracts is more accurate when the level of musical expertise is higher. When presented a DeepBach-generated extract, around 50% of the voters would judge it as composed by Bach. We consider this to be a good score knowing the complexity of Bach’s compositions and the facility to detect badly-sounding chords even for non musicians.
We also plotted specific results for each of the 400 extracts. Fig. 10 shows for each reharmonization extract the percentage of Bach votes it collected: more than half of the DeepBach’s automatically-composed extracts has a majority of votes considering them as being composed by J.S. Bach while it is only a third for the MLP model.
We developed a plugin on top of the MuseScore music editor allowing a user to call DeepBach on any rectangular region. Even if the interface is minimal (see Fig.11), the possibilities are numerous: we can generate a chorale from scratch, reharmonize a melody and regenerate a given chord, bar or part. We believe that this interplay between a user and the system can boost creativity and can interest a wide range of audience.
We made two major changes between the model we described for the online test and the interactive composition tool.
We changed the MIDI encoding of the notes to a full name encoding of the notes. Indeed, some information is lost when reducing a music sheet to its MIDI representation since we cannot differentiate between two enharmonic notes (notes that sound the same but that are written differently e.g. F# and Gb). This difference in Bach chorales is unambiguous and it is thus natural to consider the full name
of the notes, like C#3, Db3 or E#4. From a machine learning point of view, these notes would appear in totally different contexts. This improvement enables the model to generate notes with the correct spelling, which is important when we focus on the music sheet rather than on its audio rendering.
We added the current key signature list to the metadata . This allows users to impose modulations and key changes. Each element
of this list contains the number of sharps of the estimated key for the current bar. It is a integer between -7 and 7. The current key is computed using the key analyzer algorithm from music21.
We now provide and comment on examples of chorales generated using the DeepBach plugin. Our aim is to show the quality of the solutions produced by DeepBach. For these examples, no note was set by hand and we asked DeepBach to generate regions longer than one bar and covering all four voices.
Despite some compositional errors like parallel octaves, the musical analysis reveals that the DeepBach compositions reproduce typical Bach-like patterns, from characteristic cadences to the expressive use of nonchord tones. As discussed in Sect. 4.2, DeepBach also learned the correct spelling of the notes. Among examples in Fig. 15, examples (a) and (b) share the same metadata ( and ). This demonstrates that even with fixed metadata it is possible to generate contrasting chorales.
Since we aimed at producing music that could not be distinguished from actual Bach compositions, we had all provided extracts sung by the Wishful Singing choir. These audio files can be heard on the accompanying website.
We described DeepBach, a probabilistic model together with a sampling method which is flexible, efficient and provides musically convincing results even to the ears of professionals. The strength of our method is the possibility to let users impose unary constraints, which is a feature often neglected in probabilistic models of music. Through our graphical interface, the composition of polyphonic music becomes accessible to non-specialists. The playful interaction between the user and this system can boost creativity and help explore new ideas quickly. We believe that this approach could form a starting point for a novel compositional process that could be described as a constructive dialogue between a human operator and the computer. This method is general and its implementation simple. It is not only applicable to Bach chorales but embraces a wider range of polyphonic music.
Future work aims at refining our interface, speeding up generation and handling datasets with small corpora.
Dependency networks for inference, collaborative filtering, and data visualization.Journal of Machine Learning Research, 1(Oct):49–75, 2000.
Proceedings of the 24th International Conference on Artificial Intelligence, pp. 4138–4139. AAAI Press, 2015.