Counterpoint is the process of placing notes against notes to construct a polyphonic musical piece.  This is a challenging task, as each note has strong musical influences on its neighbors and notes beyond. Human composers have developed systems of rules to guide their compositional decisions. However, these rules sometimes contradict each other, and can fail to prevent their users from going down musical dead ends. Statistical models of music, which is our current focus, is one of the many computational approaches that can help composers try out ideas more quickly, thus reducing the cost of exploration .
Whereas previous work in statistical music modeling has relied mainly on sequence models such as Hidden Markov Models (HMMs
) and Recurrent Neural Networks (RNNs), we instead employ convolutional neural networks due to their invariance properties and emphasis on capturing local structure. Nevertheless, they have also been shown to successfully model large-scale structure [38, 37]
. Moreover, convolutional neural networks have shown to be extremely versatile once trained, as demonstrated by a variety of creative uses such as DeepDream and style transfer .
We introduce Coconet, a deep convolutional model trained to reconstruct partial scores. Once trained, Coconet provides direct access to all conditionals of the form where selects a fragment of a musical score and is in its complement. Coconet is an instance of deep orderless Nade , which learns an ensemble of factorizations of the joint
, each corresponding to a different ordering. A related approach is the multi-prediction training of deep Boltzmann machines (MP-DBM), which also gives a model that can predict any subset of variables given its complement.
However, the sampling procedure for orderless Nade treats the ensemble as a mixture and relies heavily on ordering. Sampling from an orderless Nade involves (randomly) choosing an ordering, and sampling variables one by one according to the chosen ordering. This process is called ancestral sampling, as the order of sampling follows the directed structure of the model. We have found that this produces poor results for the highly structured and complex domain of musical counterpoint.
Instead, we propose to use blocked-Gibbs sampling, a Markov Chain Monte Carlo method to sample from a joint probability distribution by repeatedly resampling subsets of variables using conditional distributions derived from the joint probability distribution. An instance of this was previously explored by who employed a Nade in the transition operator for a Markov Chain, yielding a Generative Stochastic Network (GSN). The transition consists of a corruption process that masks out a subset of variables, followed by a process that independently resamples variables (with ) according to the distribution emitted by the model with parameters . Crucially, the effects of independent sampling are amortized by annealing the probability with which variables are masked out. Whereas  treat their procedure as a cheap approximation to ancestral sampling, we find that it produces superior samples. Intuitively, the resampling process allows the model to iteratively rewrite the score, giving it the opportunity to correct its own mistakes.
Coconet addresses the general task of completing partial scores; special cases of this task include ”bridging” two musical fragments, and temporal upsampling and extrapolation. Figure 1 shows an example of Coconet populating a partial piano roll using blocked-Gibbs sampling. Code and samples are publically available.111 Code: https://github.com/czhuang/coconet Data: https://github.com/czhuang/JSB-Chorales-dataset Samples: https://coconets.github.io/ Our samples on a variety of generative tasks such as rewriting, melodic harmonization and unconditioned polyphonic music generation show the versatility of our model. In this work we focus on Bach chorales, and assume four voices are active at all times. However, our model can be easily adapted to the more general, arbitrarily polyphonic representation as used in .
Section 2 discusses related work in modeling music composition, with a focus on counterpoint. The details of our model and training procedure are laid out in Section 3. We discuss evaluation under the model in Section 4, and sampling from the model in Section 5. Results of quantitative and qualitative evaluations are reported in Section 6.
2 Related Work
Computers have been used since their early days for experiments in music composition. A notable composition is Hiller and Issacson’s string quartet Illiac Suite , which experiments with statistical sequence models such as Markov chains. One challenge in adapting such models is that music consists of multiple interdependent streams of events. Compare this to typical sequence domains such as speech and language, which involve modeling a single stream of events: a single speaker or a single stream of words. In music, extensive theories in counterpoint have been developed to address the challenge of composing multiple streams of notes that coordinate. One notable theory is due to Fux  from the Baroque period, which introduces species counterpoint as a pedagogical scheme to gradually introduce students to the complexity of counterpoint. In first species counterpoint only one note is composed against every note in a given fixed melody (cantus firmus), with all notes bearing equal durations and the resulting vertical intervals consisting of only consonances.
Computer music researchers have taken inspiration from this pedagogical scheme by first teaching computers to write species counterpoint as opposed to full-fledged counterpoint. Farbood  uses Markov chains to capture transition probabilities of different melodic and harmonic transitions rules. Herremans [16, 17] takes an optimization approach by writing down an objective function that consists of existing rules of counterpoint and using a variable neighbourhood search (VNS) algorithm to optimize it.
J.S. Bach chorales has been the main corpus in computer music that serves as a starting point to tackle full-fledged counterpoint. A wide range of approaches have been used to generate music in the style of Bach chorales, for example rule-based and instance-based approaches such as Cope’s recombinancy method . This method involves first segmenting existing Bach chorales into smaller chunks based on music theory, analyzing their function and stylistic signatures and then re-concatenating the chunks into new coherent works. Other approaches range from constraint-based  to statistical methods . In addition,  gives a comprehensive survey of AI methods used not just for generating Bach chorales, but also algorithmic composition in general.
Sequence models such as HMMs and RNNs are natural choices for modeling music. Successful application of such models to polyphonic music often requires serializing or otherwise re-representing the music to fit the sequence paradigm. For instance, Liang in BachBot  serializes four-part Bach chorales by interleaving the parts, while Allan and Williams  construct a chord vocabulary. Boulanger et al.  adopt a piano roll representation, a binary matrix where iff some instrument is playing pitch at time
. To model the joint probability distribution of the multi-hot pitch vector
, they employ a Restricted Boltzmann Machine (RBM[32, 19]
) or Neural Autoregressive Distribution Estimator (NADE) at each time step. Similarly Goel et al. 
employ a Deep Belief Network on top of an RNN.
Hadjeres et al.  instead employ an undirected Markov model to learn pairwise relationships between neighboring notes up to a specified number of steps away in a score. Sampling involves Markov Chain Monte Carlo (MCMC) using the model as a Metropolis-Hastings (MH) objective. The model permits constraints on the state space to support tasks such as melody harmonization. However, the Markov assumption can limit the expressivity of the model.
Hadjeres and Pachet in DeepBach  model note predictions by breaking down its full context into three parts, with the past and the future modeled by stacked LSTMs going in the forward and backward directions respectively, and the present harmonic context modeled by a third neural network. The three are then combined by a fourth neural network and used in Gibbs sampling for generation.
Lattner et al. imposes higher-level structure by interleaving selective Gibbs sampling on a convolutional RBM  and gradient descent that minimizes cost to template piece on features such as self-similarity. This procedure itself is wrapped in simulated annealing to ensure steps do not lower the solution quality too much.
We opt for an orderless Nade training procedure which enables us to train a mixture of all possible directed models simultaneously. Finally, an approximate blocked Gibbs sampling procedure  allows fast generation from the model.
We employ machine learning techniques to obtain a generative model of musical counterpoint in the form of piano rolls. Given a dataset of observed musical pieces posited to come from some true distribution , we introduce a model with parameters . In order to make close to , we maximize the data log-likelihood (an approximation of
For example, it can be factorized in chronological order:
In general, Nade permits any one fixed ordering, and although all orderings are equivalent from a theoretical perspective, they differ in practice due to effects of optimization and approximation.
Instead, we can train Nade for all orderings simultaneously using the orderless Nade  training procedure. This procedure relies on the observation that, thanks to parameter sharing, computing for all is no more expensive than computing it only for .222Here is used as shorthand for variables that occur earlier in the ordering. Hence for a given and we can simultaneously obtain partial losses for all orderings that agree with up to :
An orderless Nade model offers direct access to all distributions of the form conditioned on any set of contextual variables that might already be known. This gives us a very flexible generative model; in particular, we can use these conditional distributions to complete arbitrarily partial musical scores.
To train the model, we sample a training example and context such that , and update based on the gradient of the loss given by Equation 3. This loss consists of terms, each of which corresponds to one ordering. To ensure all orderings are trained evenly we must reweight the gradients by . This correction, due to , ensures consistent estimation of the joint negative log-likelihood .
In this work, the model is implemented by a deep convolutional neural network . This choice is motivated by the locality of contrapuntal rules and their near-invariance to translation, both in time and in pitch space.
We represent the music as a stack of piano rolls encoded in a binary three-dimensional tensor. Here denotes the number of instruments, the number of time steps, the number of pitches, and iff the th instrument plays pitch at time . We will assume each instrument plays exactly one pitch at a time, that is, for all .
Our focus is on four-part Bach chorales as used in prior work [1, 4, 11, 27, 14]. Hence we assume throughout. We constrain ourselves to only the range that appears in our training data (MIDI pitches 36 through 88). Time is discretized at the level of 16th notes for similar reasons. To curb memory requirements, we enforce by randomly cropping the training examples.
Given a training example , we present the model with values of only a strict subset of its elements and ask it to reconstruct its complement . The input is obtained by masking the piano rolls to obtain the context and concatenating this with the corresponding mask:
where the time and pitch dimensions are treated as spatial dimensions to convolve over. Each instrument’s piano roll and mask is treated as a separate channel and convolved independently.
With the exception of the first and final layers, all convolutions preserve the size of the hidden representation. That is, we use “same” padding throughout and all activations have the same number of channels, such that for all . Throughout our experiments we used layers and
channels. After each convolution we apply batch normalization (denoted by ) with statistics tied across time and pitch. Batch normalization rescales activations at each layer to have mean , which greatly improves optimization. After every second convolution, we introduce a skip connection from the hidden state two levels below to reap the benefits of residual learning .
The final activations are passed through the softmax function to obtain predictions for the pitch at each instrument/time pair:
The loss function from Equation3 is then given by
where denotes the probability under the model with parameters . We train the model by minimizing
with respect to using stochastic gradient descent with step size determined by Adam . The expectations are estimated by sampling piano rolls from the training set and drawing a single context per sample.
The log-likelihood of a given example is computed according to Algorithm 1 by repeated application of Equation 8. Evaluation occurs one frame at a time, within which the model conditions on its own predictions and does not see the ground truth. Unlike notewise teacher-forcing, where the ground truth is injected after each prediction, the framewise evaluation is thus sensitive to accumulation of error. This gives a more representative measure of quality of the generative model. For each example, we repeat the evaluation process a number of times to average over multiple orderings, and finally average across frames and examples. For chronological evaluation, we draw only orderings that have the s in increasing order.
|RNN-Nade (our implementation)||5.03||3.78||2.05|
We can sample from the model using the orderless Nade ancestral sampling procedure, in which we first sample an ordering and then sample variables one by one according to the ordering. However, we find that this yields poor samples, and we propose instead to use Gibbs sampling.
5.1 Orderless Nade Sampling
Sampling according to orderless Nade involves first randomly choosing an ordering and then sampling variables one by one according to the chosen ordering. We use an equivalent procedure in which we arrive at a random ordering by at each step randomly choosing the next variable to sample. We start with an empty (zero everywhere) piano roll and empty context and populate them iteratively by the following process. We feed the piano roll and context into the model to obtain a set of categorical distributions for . As the are not conditionally independent, we cannot simply sample from these distributions independently. However, if we sample from one of them, we can compute new conditional distributions for the others. Hence we randomly choose one to sample from, and let equal the one-hot realization. Augment the context with and repeat until the piano roll is populated. This procedure is easily generalized to tasks such as melody harmonization and partial score completion by starting with a nonempty piano roll.
Unfortunately, samples thus generated are of low quality, which we surmise is due to accumulation of errors. This is a well-known weakness of autoregressive models.[39, 3, 20, 24] While the model provides conditionals for all , some of these conditionals may be better modeled than others. We suspect in particular those conditionals used early on in the procedure, for which the context consists of very few variables. Moreover, although the model is trained to be order-agnostic, different orderings invoke different distributions, which is another indication that some conditionals are poorly learned. We test this hypothesis in Section 6.2.
5.2 Gibbs Sampling
To remedy this, we allow the model to revisit its choices: we repeatedly mask out some part of the piano roll and then repopulate it. This is a form of blocked Gibbs sampling . Blocked sampling is crucial for mixing, as the high temporal resolution of our representation causes strong correlations between consecutive notes. For instance, without blocked sampling, it would take many steps to snap out of a long-held note. Similar considerations hold for the Ising model from statistical mechanics, leading to the Swendsen-Wang algorithm  in which large clusters of variables are resampled at once.
We consider two strategies for resampling a given block of variables: ancestral sampling and independent sampling. Ancestral sampling invokes the orderless Nade sampling procedure described in Section 5.1 on the masked-out portion of the piano roll. Independent sampling simply treats the masked-out variables as independent given the context .
Using independent blocked Gibbs to sample from a Nade model has been studied by , who propose to use an annealed masking probability for some minimum and maximum probabilities , total number of Gibbs steps and fraction of time spent before settling onto the minimum probability . Initially, when the masking probability is high, the chain mixes fast but samples are poor due to independent sampling. As decreases, the blocked Gibbs process with independent resampling approaches standard Gibbs where one variable at a time is resampled, thus amortizing the effects of independent sampling.
is a hyperparameter which as a rule of thumb we set equal to; it can be set lower than that to save computation at a slight loss of sample quality.
 treat independent blocked Gibbs as a cheap approximation to ancestral sampling. Whereas plain ancestral sampling (5.1) requires model evaluations, ancestral blocked Gibbs requires a prohibitive model evaluations and independent Gibbs requires only , where can be chosen to be less than . Moreover, we find that independent blocked Gibbs sampling in fact yields better samples than plain ancestral sampling.
We evaluate our approach on a popular corpus of four-part Bach chorales. While the literature features many variants of this dataset [1, 4, 27, 14], we report results on that used by . As the quarter-note temporal resolution used by  is frankly too coarse to accurately convey counterpoint, we also evaluate on eighth-note and sixteenth-note quantizations of the same data.
It should be noted that quantitative evaluation of generative models is fundamentally hard . The gold standard for evaluation is qualitative comparison by humans, and we therefore report human evaluation results as well.
6.1 Data Log-likelihood
Table 1 compares the framewise log-likelihood of the test data under variants of our model and those reported in . We find that the temporal resolution has a dramatic influence on the performance, which we suspect is an artifact of the performance metric. The log-likelihood is evaluated by teacher-forcing, that is, the prediction of a frame is conditioned on the ground truth of all previously predicted frames. As temporal resolution increases, chord changes become increasingly rare, and the model is increasingly rewarded for simply holding notes over time.
We evaluate Coconet on both chronological and random orderings, in both cases averaging likelihoods across an ensemble of orderings. The chronological orderings differ only in the ordering of instruments within each frame. We see in Table 1 that fully random orderings lead to significantly better performance. We believe the members of the more diverse random ensemble are more mutually complementary. For example, a forward ordering is uncertain at the beginning of a piece and more certain toward the end, whereas a backward ordering is more certain at the beginning and less certain toward the end.
6.2 Sample Quality
In Section 5 we conjectured that the low quality of Nade samples is due to poorly modeled conditionals where is small. We test this hypothesis by evaluating the likelihood under the model of samples generated by the ancestral blocked Gibbs procedure with chosen according to independent Bernoulli variables. When we set the inclusion probability to 0, we obtain Nade. Increasing increases the expected context size , which should yield better samples if our hypothesis is true. The results shown in Table 2 confirm that this is the case. For these experiments, we used sample length time steps and number of Gibbs steps .
|Sampling scheme||Framewise NLL|
|Ancestral Gibbs, (Nade)|
|Independent Gibbs |
Figure 2 shows the convergence behavior of the various Gibbs procedures, averaged over 100 runs. We see that for low values of (small ), the chains hardly make progress beyond Nade in terms of likelihood. Higher values of (large ) enable the model to get off the ground and reach significantly better likelihood.
6.3 Human Evaluations
To further compare the sample quality of different sampling procedures, we carried out a listening test on Amazon’s Mechanical Turk (MTurk). The procedures include orderless Nade ancestral sampling and independent Gibbs , with each we generate four unconditioned samples of eight-measure lengths from empty piano rolls. To have an absolute reference for the quality of samples, we include first eight measures of four random Bach chorale pieces from the validation set. Each fragment lasts thirty-four seconds after synthesis.
For each MTurk hit, participants are asked to rate on a Likert scale which of the two random samples they perceive as more Bach-like. A total of 96 ratings were collected, with each source involved in 64 (=96*2/3) pairwise comparisons. Figure 3 shows the number of times each source was perceived as closer to Bach’s style. We perform a Kruskal-Wallis H test on the ratings, , showing there are statistically significant differences between models. A post-hoc analysis using the Wilcoxon signed-rank test with Bonferroni correction showed that participants perceived samples from independent Gibbs as more Bach-like than ancestral sampling (Nade), . This confirms the loglikelihood comparisons on sample quality in 6.2 that independent Gibbs produces better samples. There was also a significance difference between Bach and ancestral samples but not between Bach and independent Gibbs.
We introduced a convolutional approach to modeling musical scores based on the orderless Nade  framework. Our experiments show that the Nade ancestral sampling procedure yields poor samples, which we have argued is because some conditionals are not captured well by the model. We have shown that sample quality improves significantly when we use blocked Gibbs sampling to iteratively rewrite parts of the score. Moreover, annealed independent blocked Gibbs sampling as proposed by  is not only faster but in fact produces better samples.
We thank Kyle Kastner, Guillaume Alain, Gabriel Huang, Curtis (Fjord) Hawthorne, the Google Brain Magenta team, as well as Jason Freidenfelds for helpful feedback, discussions, suggestions and support. We also thank Calcul Québec and Compute Canada for computational support.
-  Moray Allan and Christopher KI Williams. Harmonising chorales by probabilistic inference. Advances in neural information processing systems, 17:25–32, 2005.
-  Leonard E Baum and Ted Petrie. Statistical inference for probabilistic functions of finite state markov chains. The annals of mathematical statistics, 37(6):1554–1563, 1966.
-  Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems, pages 1171–1179, 2015.
-  Nicolas Boulanger-Lewandowski, Yoshua Bengio, and Pascal Vincent. Modeling temporal dependencies in high-dimensional sequences: Application to polyphonic music generation and transcription. International Conference on Machine Learning, 2012.
Music generation from statistical models.
Proceedings of the AISB 2003 Symposium on Artificial Intelligence and Creativity in the Arts and Sciences, pages 30–35. Citeseer, 2003.
-  David Cope. Computers and musical style. 1991.
-  Mary Farbood and Bernd Schöner. Analysis and synthesis of palestrina-style counterpoint using markov chains. In ICMC, 2001.
-  Jose D Fernández and Francisco Vico. Ai methods in algorithmic composition: A comprehensive survey. Journal of Artificial Intelligence Research, 48:513–582, 2013.
-  Johann Joseph Fux. The study of counterpoint from Johann Joseph Fux’s Gradus ad Parnassum. Number 277. WW Norton & Company, 1965.
-  Leon A Gatys, Alexander S Ecker, and Matthias Bethge. A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576, 2015.
-  Kratarth Goel, Raunaq Vohra, and JK Sahoo. Polyphonic music generation by modeling temporal dependencies using a rnn-dbn. In International Conference on Artificial Neural Networks, pages 217–224. Springer, 2014.
-  Ian Goodfellow, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Multi-prediction deep boltzmann machines. In Advances in Neural Information Processing Systems, pages 548–556, 2013.
-  Gaëtan Hadjeres and François Pachet. Deepbach: a steerable model for bach chorales generation. arXiv preprint arXiv:1612.01010, 2016.
-  Gaëtan Hadjeres, Jason Sakellariou, and François Pachet. Style imitation and chord invention in polyphonic music with exponential families. arXiv preprint arXiv:1609.05152, 2016.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.
-  Dorien Herremans and Kenneth Sörensen. Composing first species counterpoint with a variable neighbourhood search algorithm. Journal of Mathematics and the Arts, 6(4):169–189, 2012.
-  Dorien Herremans and Kenneth Sörensen. Composing fifth species counterpoint music with a variable neighborhood search algorithm. Expert systems with applications, 40(16):6427–6437, 2013.
-  Lejaren A Hiller Jr and Leonard M Isaacson. Musical composition with a high speed digital computer. In Audio Engineering Society Convention 9. Audio Engineering Society, 1957.
-  Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527–1554, 2006.
-  Ferenc Huszár. How (not) to train your generative model: Scheduled sampling, likelihood, adversary? arXiv preprint arXiv:1511.05101, 2015.
-  Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
-  Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
-  Alex M Lamb, Anirudh Goyal ALIAS PARTH GOYAL, Ying Zhang, Saizheng Zhang, Aaron C Courville, and Yoshua Bengio. Professor forcing: A new algorithm for training recurrent networks. In Advances In Neural Information Processing Systems, pages 4601–4609, 2016.
-  Hugo Larochelle and Iain Murray. The neural autoregressive distribution estimator. In AISTATS, volume 1, page 2, 2011.
-  Stefan Lattner, Maarten Grachten, and Gerhard Widmer. Imposing higher-level structure in polyphonic music generation using convolutional restricted boltzmann machines and constraints. arXiv preprint arXiv:1612.04742, 2016.
-  Feynman Liang. Bachbot: Automatic composition in the style of bach chorales. Masters thesis, University of Cambridge, 2016.
-  Jun S Liu. The collapsed gibbs sampler in bayesian computations with applications to a gene regulation problem. Journal of the American Statistical Association, 89(427):958–966, 1994.
-  Alexander Mordvintsev, Christopher Olah, and Mike Tyka. Inceptionism: Going deeper into neural networks, 2015.
-  François Pachet and Pierre Roy. Musical harmonization with constraints: A survey. Constraints, 6(1):7–19, 2001.
-  David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by back-propagating errors. Cognitive modeling, 5(3):1, 1988.
-  Paul Smolensky. Information processing in dynamical systems: Foundations of harmony theory. Technical report, DTIC Document, 1986.
-  Robert H Swendsen and Jian-Sheng Wang. Nonuniversal critical dynamics in monte carlo simulations. Physical review letters, 58(2):86, 1987.
-  Lucas Theis, Aäron van den Oord, and Matthias Bethge. A note on the evaluation of generative models. arXiv preprint arXiv:1511.01844, 2015.
-  Benigno Uria, Marc-Alexandre Côté, Karol Gregor, Iain Murray, and Hugo Larochelle. Neural autoregressive distribution estimation. arXiv preprint arXiv:1605.02226, 2016.
-  Benigno Uria, Iain Murray, and Hugo Larochelle. A deep and tractable density estimator. In ICML, pages 467–475, 2014.
-  Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
-  Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. In Proceedings of The 33rd International Conference on Machine Learning, pages 1747–1756, 2016.
-  Arun Venkatraman, Martial Hebert, and J Andrew Bagnell. Improving multi-step prediction of learned time series models. In AAAI, pages 3024–3030, 2015.
-  Li Yao, Sherjil Ozair, Kyunghyun Cho, and Yoshua Bengio. On the equivalence between deep nade and generative stochastic networks. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 322–336. Springer, 2014.