Harmonic Recomposition using Conditional Autoregressive Modeling

11/18/2018 ∙ by Kyle Kastner, et al. ∙ 0

We demonstrate a conditional autoregressive pipeline for efficient music recomposition, based on methods presented in van den Oord et al.(2017). Recomposition (Casal & Casey, 2010) focuses on reworking existing musical pieces, adhering to structure at a high level while also re-imagining other aspects of the work. This can involve reuse of pre-existing themes or parts of the original piece, while also requiring the flexibility to generate new content at different levels of granularity. Applying the aforementioned modeling pipeline to recomposition, we show diverse and structured generation conditioned on chord sequence annotations.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Since the early days of computation, composers have explored methods of combining aleatoric music and algorithmic composition with generic computing devices (Agon et al., 2003; Cope, 1989). Authors have taken a wide variety of data driven approaches to ”creative generation” in various domains (Barbieri et al., 2012; Ha & Eck, 2017; Graves, 2013), with extensive application to music modeling (Briot et al., 2017; Roberts et al., 2018; Eck & Schmidhuber, 2002; Sturm et al., 2015; Hadjeres et al., 2016; Boulanger-Lewandowski et al., 2012; Bretan et al., 2017; Roberts et al., 2018).

In this paper, we focus on the task of harmonic recomposition (Casal & Casey, 2010). Melody generation and evaluation is a difficult task, even in monophonic music (Jaques et al., 2016), so we use the term harmonic recomposition to reference our focus on aspects of agreement and structure between voices. Our pipeline is also applicable to purely sequential and iterative generation, as has been shown in prior work (Huang et al., 2017; van den Oord et al., 2017).

1.1 Related Work

Autoregressive models have proven to be powerful distribution estimators for images and sequence data, showing excellent results in generative settings

(van den Oord et al., 2016a). They have also performed well in related prior work for polyphonic music generation (Briot et al., 2017). Most related to the work described in this paper is CoCoNet (Huang et al., 2017, 2018), which also uses an autoregressive convolutional model over image-like structures for polyphonic music generation and was a direct inspiration for our approach. One key difference of our approach is our utilization of a two stage pipeline (first seen in the work of van den Oord et al. (2017)) which greatly improves training and generation speed as well as creating an implicit separation between local voice agreement (first stage) and global consistency over measures (second stage).

2 Implementation Details

In this section, we describe the data, model, and training details for our recomposition approach. An open source implementation of our setup (including audio samples) is available online111https://github.com/kastnerkyle/harmonic_recomposition_workshop.

2.1 Data

We use a subset of the scores associated with the composer Josquin des Prez as compiled by the Josquin project 222http://josquin.stanford.edu/. Only pieces with parts are considered, resulting in a dataset with pieces comprising measures. We hold a contiguous measures out for use as a source of harmonic chord sequences during conditional generation.

After extracting individual measures, we convert to a ”piano roll” style multichannel image, with each measure having quantized timesteps (regardless of time-signature) on the horizontal axis, and one of possible tones on the vertical axis, where comes from the set of all possible notes used in the key normalized data (Hadjeres et al., 2016). These

values are padded to

for compatibility with the convolutional strided layers used in the VQ-VAE, and each voice assigned its own channel in an image-like container of size

described in examples, height, width, and channels format (NHWC). The overall result can be seen in Fig. 1, where each color represents a separate channel.

2.2 Conditional Information

We extract the chord function and voicing of all measures using the music21 software package (Cuthbert & Ariza, 2010), and form ”function triplets” of the previous, current, and next measure. A measure group of chords would form triplet groupings of (we repeat chords to handle border issues), then , and finally .

2.3 Models

The model pipeline is a two-stage generative setup, as described by van den Oord et al. (2017), wherein an initial stage (denoted VQ-VAE) is unconditionally trained to compress inputs to a spatially reduced, discrete representation (which we call ), and uncompress. Once the VQ-VAE stage is trained, we use it to generate a compressed for each element in the dataset, and train an autoregressive generative ”prior model” on this representation. The prior model learns to generate components of (which takes the form of a spatial map in this work), denoted , one at a time conditioning the next generation step on all previously generated .

The prior model may also take conditioning as one or multiple vectors (separate embeddings for each previous, current, and next chord, indexed by a chord integer), a spatial map (a

from some previous measure), or a combination of both during the generation process. The effect of conditioning type can be seen in Fig 2.

2.4 Experiment Details

The first stage VQ-VAE has 2 strided convolutional layers of kernel size and a stride of on both spatial axes, followed by an additional 2 layers of convolution with stride , using rectified linear activations (Glorot et al., 2011)

and batch normalization

(Ioffe & Szegedy, 2015). These layers have sizes , , , and leading to a VQ codebook size of , which results in a latent dimension of for size input. This procedure is inverted using transpose convolution for the decoder and combined with a binary cross-entropy loss alongside codebook and commitment losses for the VQ-VAE, averaged over all channels and spatial dimensions. Training was performed over minibatches of size with an Adam optimizer with , , , (Kingma & Ba, 2014).

In the second stage, a gated conditional PixelCNN (van den Oord et al., 2016b) is configured with layers of projection channels. The first layer has a kernel size of

and no residual connection. Layers after the first utilize residual connections

(He et al., 2016) and have a kernel size of , followed by convolution, rectified linear activation, and a final convolution of size and channels (due to the VQ codebook size of ). Training was performed over minibatches of size with an Adam optimizer (configured as before) with categorical cross-entropy loss averaged over the output.

3 Results

We experiment with two types of conditioning combined with the aforementioned architecture. The higher level information contained in the chord sequences alone seems sufficient to produce directed, coherent trajectories, without need for spatial conditioning information. When spatial conditioning from the previous timestep is included, the resulting generations are punctuated by dissonant intervals or long silent gaps. Finding better ways to combine local note level information with chord annotation will be an important step to improving this pipeline.

Figure 1: Two 8 bar sequences generated over the conditioning sequence ,, , , , , , , , using different random seeds
Figure 2: 8 bar sequences generated over conditioning sequence ,,,,,,,,,. Left: only chord triplet conditioning. Right: chord triplet conditioning along with previous generation as conditioning spatial map.

4 Conclusion

Chord-conditional generative models are an ideal fit for harmonic recomposition. We find that a two-stage pipeline reminiscent of van den Oord et al. (2017) and Huang et al. (2017) captures musical structure, and allows for chord conditional generation. Our work demonstrates note-level realizations of given chordal sequences and provides an open-source implementation with examples.

References