Counterpoint by Convolution

03/18/2019 ∙ by Cheng-Zhi Anna Huang, et al. ∙ Université de Montréal 0

Machine learning models of music typically break up the task of composition into a chronological process, composing a piece of music in a single pass from beginning to end. On the contrary, human composers write music in a nonlinear fashion, scribbling motifs here and there, often revisiting choices previously made. In order to better approximate this process, we train a convolutional neural network to complete partial musical scores, and explore the use of blocked Gibbs sampling as an analogue to rewriting. Neither the model nor the generative procedure are tied to a particular causal direction of composition. Our model is an instance of orderless NADE (Uria et al., 2014), which allows more direct ancestral sampling. However, we find that Gibbs sampling greatly improves sample quality, which we demonstrate to be due to some conditional distributions being poorly modeled. Moreover, we show that even the cheap approximate blocked Gibbs procedure from Yao et al. (2014) yields better samples than ancestral sampling, based on both log-likelihood and human evaluation.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Blocked Gibbs inpainting of a corrupted Bach chorale by Coconet

. At each step, a random subset of notes is removed, and the model is asked to infer their values. New values are sampled from the probability distribution put out by the model, and the process is repeated. Left: annealed masks show resampled variables. Colors distinguish the four voices. Middle: grayscale heatmaps show predictions

summed across instruments. Right: complete pianorolls after resampling the masked variables. Bottom: a sample from NADE (left) and the original Bach chorale fragment (right).

Counterpoint is the process of placing notes against notes to construct a polyphonic musical piece. [9] This is a challenging task, as each note has strong musical influences on its neighbors and notes beyond. Human composers have developed systems of rules to guide their compositional decisions. However, these rules sometimes contradict each other, and can fail to prevent their users from going down musical dead ends. Statistical models of music, which is our current focus, is one of the many computational approaches that can help composers try out ideas more quickly, thus reducing the cost of exploration [8].

Whereas previous work in statistical music modeling has relied mainly on sequence models such as Hidden Markov Models (HMMs 

[2]

) and Recurrent Neural Networks (RNNs 

[31]), we instead employ convolutional neural networks due to their invariance properties and emphasis on capturing local structure. Nevertheless, they have also been shown to successfully model large-scale structure [38, 37]

. Moreover, convolutional neural networks have shown to be extremely versatile once trained, as demonstrated by a variety of creative uses such as DeepDream 

[29] and style transfer [10].

We introduce Coconet, a deep convolutional model trained to reconstruct partial scores. Once trained, Coconet provides direct access to all conditionals of the form where selects a fragment of a musical score and is in its complement. Coconet is an instance of deep orderless Nade [36], which learns an ensemble of factorizations of the joint

, each corresponding to a different ordering. A related approach is the multi-prediction training of deep Boltzmann machines (MP-DBM) 

[12], which also gives a model that can predict any subset of variables given its complement.

However, the sampling procedure for orderless Nade treats the ensemble as a mixture and relies heavily on ordering. Sampling from an orderless Nade involves (randomly) choosing an ordering, and sampling variables one by one according to the chosen ordering. This process is called ancestral sampling, as the order of sampling follows the directed structure of the model. We have found that this produces poor results for the highly structured and complex domain of musical counterpoint.

Instead, we propose to use blocked-Gibbs sampling, a Markov Chain Monte Carlo method to sample from a joint probability distribution by repeatedly resampling subsets of variables using conditional distributions derived from the joint probability distribution. An instance of this was previously explored by 

[40] who employed a Nade in the transition operator for a Markov Chain, yielding a Generative Stochastic Network (GSN). The transition consists of a corruption process that masks out a subset of variables, followed by a process that independently resamples variables (with ) according to the distribution emitted by the model with parameters . Crucially, the effects of independent sampling are amortized by annealing the probability with which variables are masked out. Whereas [40] treat their procedure as a cheap approximation to ancestral sampling, we find that it produces superior samples. Intuitively, the resampling process allows the model to iteratively rewrite the score, giving it the opportunity to correct its own mistakes.

Coconet addresses the general task of completing partial scores; special cases of this task include ”bridging” two musical fragments, and temporal upsampling and extrapolation. Figure 1 shows an example of Coconet populating a partial piano roll using blocked-Gibbs sampling. Code and samples are publically available.111 Code: https://github.com/czhuang/coconet Data: https://github.com/czhuang/JSB-Chorales-dataset Samples: https://coconets.github.io/ Our samples on a variety of generative tasks such as rewriting, melodic harmonization and unconditioned polyphonic music generation show the versatility of our model. In this work we focus on Bach chorales, and assume four voices are active at all times. However, our model can be easily adapted to the more general, arbitrarily polyphonic representation as used in [4].

Section 2 discusses related work in modeling music composition, with a focus on counterpoint. The details of our model and training procedure are laid out in Section 3. We discuss evaluation under the model in Section 4, and sampling from the model in Section 5. Results of quantitative and qualitative evaluations are reported in Section 6.

2 Related Work

Computers have been used since their early days for experiments in music composition. A notable composition is Hiller and Issacson’s string quartet Illiac Suite [18], which experiments with statistical sequence models such as Markov chains. One challenge in adapting such models is that music consists of multiple interdependent streams of events. Compare this to typical sequence domains such as speech and language, which involve modeling a single stream of events: a single speaker or a single stream of words. In music, extensive theories in counterpoint have been developed to address the challenge of composing multiple streams of notes that coordinate. One notable theory is due to Fux [9] from the Baroque period, which introduces species counterpoint as a pedagogical scheme to gradually introduce students to the complexity of counterpoint. In first species counterpoint only one note is composed against every note in a given fixed melody (cantus firmus), with all notes bearing equal durations and the resulting vertical intervals consisting of only consonances.

Computer music researchers have taken inspiration from this pedagogical scheme by first teaching computers to write species counterpoint as opposed to full-fledged counterpoint. Farbood [7] uses Markov chains to capture transition probabilities of different melodic and harmonic transitions rules. Herremans [16, 17] takes an optimization approach by writing down an objective function that consists of existing rules of counterpoint and using a variable neighbourhood search (VNS) algorithm to optimize it.

J.S. Bach chorales has been the main corpus in computer music that serves as a starting point to tackle full-fledged counterpoint. A wide range of approaches have been used to generate music in the style of Bach chorales, for example rule-based and instance-based approaches such as Cope’s recombinancy method [6]. This method involves first segmenting existing Bach chorales into smaller chunks based on music theory, analyzing their function and stylistic signatures and then re-concatenating the chunks into new coherent works. Other approaches range from constraint-based [30] to statistical methods [5]. In addition, [8] gives a comprehensive survey of AI methods used not just for generating Bach chorales, but also algorithmic composition in general.

Sequence models such as HMMs and RNNs are natural choices for modeling music. Successful application of such models to polyphonic music often requires serializing or otherwise re-representing the music to fit the sequence paradigm. For instance, Liang in BachBot [27] serializes four-part Bach chorales by interleaving the parts, while Allan and Williams [1] construct a chord vocabulary. Boulanger et al. [4] adopt a piano roll representation, a binary matrix where iff some instrument is playing pitch at time

. To model the joint probability distribution of the multi-hot pitch vector

, they employ a Restricted Boltzmann Machine (RBM 

[32, 19]

) or Neural Autoregressive Distribution Estimator (NADE 

[25]) at each time step. Similarly Goel et al. [11]

employ a Deep Belief Network 

[19] on top of an RNN.

Hadjeres et al. [14] instead employ an undirected Markov model to learn pairwise relationships between neighboring notes up to a specified number of steps away in a score. Sampling involves Markov Chain Monte Carlo (MCMC) using the model as a Metropolis-Hastings (MH) objective. The model permits constraints on the state space to support tasks such as melody harmonization. However, the Markov assumption can limit the expressivity of the model.

Hadjeres and Pachet in DeepBach [13] model note predictions by breaking down its full context into three parts, with the past and the future modeled by stacked LSTMs going in the forward and backward directions respectively, and the present harmonic context modeled by a third neural network. The three are then combined by a fourth neural network and used in Gibbs sampling for generation.

Lattner et al. imposes higher-level structure by interleaving selective Gibbs sampling on a convolutional RBM [26] and gradient descent that minimizes cost to template piece on features such as self-similarity. This procedure itself is wrapped in simulated annealing to ensure steps do not lower the solution quality too much.

We opt for an orderless Nade training procedure which enables us to train a mixture of all possible directed models simultaneously. Finally, an approximate blocked Gibbs sampling procedure [40] allows fast generation from the model.

3 Model

We employ machine learning techniques to obtain a generative model of musical counterpoint in the form of piano rolls. Given a dataset of observed musical pieces posited to come from some true distribution , we introduce a model with parameters . In order to make close to , we maximize the data log-likelihood (an approximation of

) by stochastic gradient descent.

The joint distribution

over variables is often difficult to model directly and hence we construct our model from simpler factors. In the Nade [25] framework, the joint is factorized autoregressively, one variable at a time, according to some ordering , such that

(1)

For example, it can be factorized in chronological order:

(2)

In general, Nade permits any one fixed ordering, and although all orderings are equivalent from a theoretical perspective, they differ in practice due to effects of optimization and approximation.

Instead, we can train Nade for all orderings simultaneously using the orderless Nade [36] training procedure. This procedure relies on the observation that, thanks to parameter sharing, computing for all is no more expensive than computing it only for .222Here is used as shorthand for variables that occur earlier in the ordering. Hence for a given and we can simultaneously obtain partial losses for all orderings that agree with up to :

(3)

An orderless Nade model offers direct access to all distributions of the form conditioned on any set of contextual variables that might already be known. This gives us a very flexible generative model; in particular, we can use these conditional distributions to complete arbitrarily partial musical scores.

To train the model, we sample a training example and context such that , and update based on the gradient of the loss given by Equation 3. This loss consists of terms, each of which corresponds to one ordering. To ensure all orderings are trained evenly we must reweight the gradients by . This correction, due to [36], ensures consistent estimation of the joint negative log-likelihood .

In this work, the model is implemented by a deep convolutional neural network [23]. This choice is motivated by the locality of contrapuntal rules and their near-invariance to translation, both in time and in pitch space.

We represent the music as a stack of piano rolls encoded in a binary three-dimensional tensor

. Here denotes the number of instruments, the number of time steps, the number of pitches, and iff the th instrument plays pitch at time . We will assume each instrument plays exactly one pitch at a time, that is, for all .

Our focus is on four-part Bach chorales as used in prior work [1, 4, 11, 27, 14]. Hence we assume throughout. We constrain ourselves to only the range that appears in our training data (MIDI pitches 36 through 88). Time is discretized at the level of 16th notes for similar reasons. To curb memory requirements, we enforce by randomly cropping the training examples.

Given a training example , we present the model with values of only a strict subset of its elements and ask it to reconstruct its complement . The input is obtained by masking the piano rolls to obtain the context and concatenating this with the corresponding mask:

(4)
(5)

where the time and pitch dimensions are treated as spatial dimensions to convolve over. Each instrument’s piano roll and mask is treated as a separate channel and convolved independently.

With the exception of the first and final layers, all convolutions preserve the size of the hidden representation. That is, we use “same” padding throughout and all activations have the same number of channels

, such that for all . Throughout our experiments we used layers and

channels. After each convolution we apply batch normalization 

[21] (denoted by ) with statistics tied across time and pitch. Batch normalization rescales activations at each layer to have mean

and standard deviation

, which greatly improves optimization. After every second convolution, we introduce a skip connection from the hidden state two levels below to reap the benefits of residual learning [15].

(6)
(7)

The final activations are passed through the softmax function to obtain predictions for the pitch at each instrument/time pair:

(8)

The loss function from Equation 

3 is then given by

(9)

where denotes the probability under the model with parameters . We train the model by minimizing

(10)

with respect to using stochastic gradient descent with step size determined by Adam [22]. The expectations are estimated by sampling piano rolls from the training set and drawing a single context per sample.

4 Evaluation

The log-likelihood of a given example is computed according to Algorithm 1 by repeated application of Equation 8. Evaluation occurs one frame at a time, within which the model conditions on its own predictions and does not see the ground truth. Unlike notewise teacher-forcing, where the ground truth is injected after each prediction, the framewise evaluation is thus sensitive to accumulation of error. This gives a more representative measure of quality of the generative model. For each example, we repeat the evaluation process a number of times to average over multiple orderings, and finally average across frames and examples. For chronological evaluation, we draw only orderings that have the s in increasing order.

Given a piano roll
for all
for multiple orderings  do
     ,
     Sample an ordering over frames
     for  do
         Sample an ordering over instruments
         for  do
               for all
              
              
              
         end for
         
     end for
end for
return
Algorithm 1 Framewise log-likelihood evaluation
Model Temporal resolution
quarter eighth sixteenth
Nade [4] 7.19
RNN-RBM [4] 6.27
RNN-Nade [4] 5.56
RNN-Nade (our implementation) 5.03 3.78 2.05
Coconet (chronological)
Coconet (random)
Table 1: Framewise negative log-likelihoods (NLLs) on the Bach corpus. We compare against [4], who used quarter-note resolution. We also compare on higher temporal resolutions (eighth notes, sixteenth notes), against our own reimplementation of RNN-Nade. Coconet is an instance of orderless Nade, and as such we evaluate it on random orderings. However, the baselines support only chronological frame ordering, and hence we evaluate our model in this setting as well.

5 Sampling

We can sample from the model using the orderless Nade ancestral sampling procedure, in which we first sample an ordering and then sample variables one by one according to the ordering. However, we find that this yields poor samples, and we propose instead to use Gibbs sampling.

5.1 Orderless Nade Sampling

Sampling according to orderless Nade involves first randomly choosing an ordering and then sampling variables one by one according to the chosen ordering. We use an equivalent procedure in which we arrive at a random ordering by at each step randomly choosing the next variable to sample. We start with an empty (zero everywhere) piano roll and empty context and populate them iteratively by the following process. We feed the piano roll and context into the model to obtain a set of categorical distributions for . As the are not conditionally independent, we cannot simply sample from these distributions independently. However, if we sample from one of them, we can compute new conditional distributions for the others. Hence we randomly choose one to sample from, and let equal the one-hot realization. Augment the context with and repeat until the piano roll is populated. This procedure is easily generalized to tasks such as melody harmonization and partial score completion by starting with a nonempty piano roll.

Unfortunately, samples thus generated are of low quality, which we surmise is due to accumulation of errors. This is a well-known weakness of autoregressive models

[39, 3, 20, 24] While the model provides conditionals for all , some of these conditionals may be better modeled than others. We suspect in particular those conditionals used early on in the procedure, for which the context consists of very few variables. Moreover, although the model is trained to be order-agnostic, different orderings invoke different distributions, which is another indication that some conditionals are poorly learned. We test this hypothesis in Section 6.2.

5.2 Gibbs Sampling

To remedy this, we allow the model to revisit its choices: we repeatedly mask out some part of the piano roll and then repopulate it. This is a form of blocked Gibbs sampling [28]. Blocked sampling is crucial for mixing, as the high temporal resolution of our representation causes strong correlations between consecutive notes. For instance, without blocked sampling, it would take many steps to snap out of a long-held note. Similar considerations hold for the Ising model from statistical mechanics, leading to the Swendsen-Wang algorithm [33] in which large clusters of variables are resampled at once.

We consider two strategies for resampling a given block of variables: ancestral sampling and independent sampling. Ancestral sampling invokes the orderless Nade sampling procedure described in Section 5.1 on the masked-out portion of the piano roll. Independent sampling simply treats the masked-out variables as independent given the context .

Using independent blocked Gibbs to sample from a Nade model has been studied by [40], who propose to use an annealed masking probability for some minimum and maximum probabilities , total number of Gibbs steps and fraction of time spent before settling onto the minimum probability . Initially, when the masking probability is high, the chain mixes fast but samples are poor due to independent sampling. As decreases, the blocked Gibbs process with independent resampling approaches standard Gibbs where one variable at a time is resampled, thus amortizing the effects of independent sampling.

is a hyperparameter which as a rule of thumb we set equal to

; it can be set lower than that to save computation at a slight loss of sample quality.

[40] treat independent blocked Gibbs as a cheap approximation to ancestral sampling. Whereas plain ancestral sampling (5.1) requires model evaluations, ancestral blocked Gibbs requires a prohibitive model evaluations and independent Gibbs requires only , where can be chosen to be less than . Moreover, we find that independent blocked Gibbs sampling in fact yields better samples than plain ancestral sampling.

6 Experiments

We evaluate our approach on a popular corpus of four-part Bach chorales. While the literature features many variants of this dataset [1, 4, 27, 14], we report results on that used by [4]. As the quarter-note temporal resolution used by [4] is frankly too coarse to accurately convey counterpoint, we also evaluate on eighth-note and sixteenth-note quantizations of the same data.

It should be noted that quantitative evaluation of generative models is fundamentally hard [34]. The gold standard for evaluation is qualitative comparison by humans, and we therefore report human evaluation results as well.

6.1 Data Log-likelihood

Table 1 compares the framewise log-likelihood of the test data under variants of our model and those reported in [4]. We find that the temporal resolution has a dramatic influence on the performance, which we suspect is an artifact of the performance metric. The log-likelihood is evaluated by teacher-forcing, that is, the prediction of a frame is conditioned on the ground truth of all previously predicted frames. As temporal resolution increases, chord changes become increasingly rare, and the model is increasingly rewarded for simply holding notes over time.

We evaluate Coconet on both chronological and random orderings, in both cases averaging likelihoods across an ensemble of orderings. The chronological orderings differ only in the ordering of instruments within each frame. We see in Table 1 that fully random orderings lead to significantly better performance. We believe the members of the more diverse random ensemble are more mutually complementary. For example, a forward ordering is uncertain at the beginning of a piece and more certain toward the end, whereas a backward ordering is more certain at the beginning and less certain toward the end.

6.2 Sample Quality

In Section 5 we conjectured that the low quality of Nade samples is due to poorly modeled conditionals where is small. We test this hypothesis by evaluating the likelihood under the model of samples generated by the ancestral blocked Gibbs procedure with chosen according to independent Bernoulli variables. When we set the inclusion probability to 0, we obtain Nade. Increasing increases the expected context size , which should yield better samples if our hypothesis is true. The results shown in Table 2 confirm that this is the case. For these experiments, we used sample length time steps and number of Gibbs steps .

Sampling scheme Framewise NLL
Ancestral Gibbs, (Nade)
Ancestral Gibbs,
Ancestral Gibbs,
Ancestral Gibbs,
Ancestral Gibbs,
Independent Gibbs [40]
Table 2: Mean ( SEM) NLL under model of unconditioned samples generated from model by various schemes.

Figure 2 shows the convergence behavior of the various Gibbs procedures, averaged over 100 runs. We see that for low values of (small ), the chains hardly make progress beyond Nade in terms of likelihood. Higher values of (large ) enable the model to get off the ground and reach significantly better likelihood.

Figure 2: Likelihood under the model for ancestral Gibbs samples obtained with various context distributions . Nade () is included for reference.

6.3 Human Evaluations

To further compare the sample quality of different sampling procedures, we carried out a listening test on Amazon’s Mechanical Turk (MTurk). The procedures include orderless Nade ancestral sampling and independent Gibbs [40], with each we generate four unconditioned samples of eight-measure lengths from empty piano rolls. To have an absolute reference for the quality of samples, we include first eight measures of four random Bach chorale pieces from the validation set. Each fragment lasts thirty-four seconds after synthesis.

Figure 3:

Human evaluations from MTurk on how many times a sampling procedure or Bach is perceived as more Bach-like. Error bars show the standard deviation of a binomial distribution fitted to each’s binary win/loss counts.

For each MTurk hit, participants are asked to rate on a Likert scale which of the two random samples they perceive as more Bach-like. A total of 96 ratings were collected, with each source involved in 64 (=96*2/3) pairwise comparisons. Figure 3 shows the number of times each source was perceived as closer to Bach’s style. We perform a Kruskal-Wallis H test on the ratings, , showing there are statistically significant differences between models. A post-hoc analysis using the Wilcoxon signed-rank test with Bonferroni correction showed that participants perceived samples from independent Gibbs as more Bach-like than ancestral sampling (Nade), . This confirms the loglikelihood comparisons on sample quality in 6.2 that independent Gibbs produces better samples. There was also a significance difference between Bach and ancestral samples but not between Bach and independent Gibbs.

7 Conclusion

We introduced a convolutional approach to modeling musical scores based on the orderless Nade [35] framework. Our experiments show that the Nade ancestral sampling procedure yields poor samples, which we have argued is because some conditionals are not captured well by the model. We have shown that sample quality improves significantly when we use blocked Gibbs sampling to iteratively rewrite parts of the score. Moreover, annealed independent blocked Gibbs sampling as proposed by [40] is not only faster but in fact produces better samples.

Acknowledgments

We thank Kyle Kastner, Guillaume Alain, Gabriel Huang, Curtis (Fjord) Hawthorne, the Google Brain Magenta team, as well as Jason Freidenfelds for helpful feedback, discussions, suggestions and support. We also thank Calcul Québec and Compute Canada for computational support.

References