An Improved Relative Self-Attention Mechanism for Transformer with Application to Music Generation

09/12/2018 ∙ by Cheng-Zhi Anna Huang, et al. ∙ Google 2

Music relies heavily on self-reference to build structure and meaning. We explore the Transformer architecture (Vaswani et al., 2017) as a generative model for music, as self-attention has shown compelling results on tasks that require long-term structure such as Wikipedia summary generation (Liu et al, 2018). However, timing information is critical for polyphonic music, and Transformer does not explicitly model absolute or relative timing in its structure. To address this challenge, Shaw et al. (2018) introduced relative position representations to self-attention to improve machine translation. However, the formulation was not scalable to longer sequences. We propose an improved formulation which reduces the memory requirements of the relative position computation from O(l^2d) to O(ld), making it possible to train much longer sequences and achieve faster convergence. In experiments on symbolic music we find that relative self-attention substantially improves sample quality for unconditioned generation and is able to generate sequences of lengths longer than those from the training set. When primed with an initial sequence, the model generates continuations that develop the prime coherently and exhibit long-term structure. Relative self-attention can be instrumental in capturing richer relationships within a musical piece.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A musical piece often consists of recurring elements, from motives to phrases to sections such as verse-chorus. To generate a coherent piece, a model needs to remember the elements that came before, in order to repeat them, vary and further develop them, and also to know how to create contrast and surprise. Self-attention mechanisms are a natural fit for this challenge, as they offer direct access to the generated history, allowing the model to choose the level of detail or summarization to derive from it.

The attention-based Transformer architecture vaswani2017attention has shown compelling results in generating Wikipedia articles with coherent long-term structure and consistency liu2018generatin . Like written language, a musical score can be represented with discrete symbols and has a hierarchical syntax, with notes being roughly like characters and motives of notes like words. Unlike text that can be represented as a single sequence, music unfolds with multiple streams of events occurring simultaneously. Hence, building a language model for music requires making decisions on how to serialize a polyphonic texture. Furthermore, there is the question of how timing corresponds or does not correspond to position in sequence.

We explore Transformer on two distinct musical genres that demand very different representations, allowing us to investigate how well the architecture can capture different properties of music. J.S. Bach Chorales is a canonical dataset used for evaluating generative models for music  allan2005harmonising ; boulanger2012modeling ; goel2014polyphonic ; liang2016bachbot ; hadjeres2016style . It is small and constrained, consisting of pieces with traditional harmony, always four voices and predominantly moves on a 16th-note timing grid. After serializing the score into a single sequence (described in Section 3.1), the most common sequence length is 1024. Here there is a direct correspondence between position in sequence and time in piece. Samples from the baseline Transformer show that the architecture is not able to maintain the strict timing grid, causing voices to misalign.

This corroborates with shaw2018self that Transformer does not have a strong notion of either absolute or relative timing. The only source of timing information comes from the positional sinusoids added to the input embedding, which can be difficult to disentangle. shaw2018self proposes to add a relative positional representation that allow for similarity comparisons to be sensitive to how far apart the tokens are in a sequence. For the Bach dataset this is a good match, as distance in sequence position corresponds to time deltas. We indeed see a drastic improvement in both sample quality and perplexity after adding relative attention. The samples maintained the timing grid, and with this foundation, the model was able to capture timing on a more global level, giving rise to regular phrases. We also generalize the notion of relative representations to other attributes of music such as pitch, allowing for domain-informed invariance.

However, the Bach dataset is a special case where the musical score can be discretized on a fixed grid. In contrast, the second dataset we explore, Piano-e-Competition, consists of MIDI recorded from actual contestant performances, which involve expressive timing and dynamics. Time advances in milliseconds and the sequences become too long if discretized at a fixed time unit. Here we break the direct correspondence between position in sequence and actual time in music and adopt a MIDI-like event based representation, which advances time in variable deltas as consecutive events happen (described in Section 3.2). This dataset is much larger, and the music is also more complex, consisting of virtuosic piano music composed in the 18th through 20th centuries, which requires much larger models. In order to test if relative attention is still beneficial on this more fluid event-based representation, we had to improve memory usage.

1.1 Contributions

This paper makes two main contributions. First, we are able to improve the memory consumption of the relative attention mechanism proposed in shaw2018self

. By transforming between absolute and relative position based tensors, as further described in Section 

4.3, we reduce memory usage of the relative-position based tensor instantiations from to , where is the length of the sequence, and the hidden size of the model.

Second, we show in a series of experiments on music generation that relative attention is critical for both grid-like and event-based representations as they carry strong sequential dependencies. Relative self-attention results in more consistency in sample quality for unconditioned generation and models that can generate sequences longer than those in the training set. When primed, the model generates continuations that develop the prime in a coherent fashion and exhibit long-term structure. We conducted a listening test and found that samples from models with relative self-attention were perceived as more coherent than the baseline Transformer model from vaswani2017attention . With these experiments we demonstrate that relative self-attention has potential as a productive tool for musicians as it can listen more attentively to input sequences and generate in a fashion that builds on existing material.

2 Related work

Sequence models have been the canonical choice for modeling music, from Hidden Markov Models (HMMs 

baum1966statistical allan2005harmonising ; farbood2001analysis

, to Recurrent Neural Networks (RNNs 


) and Long Short Term Memory networks (LSTMs 

hochreiter1997long eck2002finding ; liang2016bachbot ; simon2017performance , to bidirectional LSTMs (BLSTMs graves2005framewise hadjeres2016deepbach . Successful application of sequential models to polyphonic music often require serializing the musical score or performance into a single sequence (e.g.) by interleaving different instruments or voices. Alternatively, a 2D pianoroll-like representation (further described in Section 3.1

) can be decomposed into a sequence of multi-hot pitch vectors, and their joint probability distributions can be captured using Restricted Boltzmann Machines (RBMs 

smolensky1986information ; hinton2006fast

) or Neural Autoregressive Distribution Estimators (NADE 


). Pianorolls are also image-like and can be modeled by Convolutional Neural Networks (CNNs), and trained under the objectives of for example Generative Adversarial Networks (GAN 

goodfellow2014generative dong2017musegan or orderless NADE uria2014deep ; uria2016neural which trains all orderings of factorizations huang2017coconet .

We show the first use of the self-attention based Transformer model for music generation, and explore the robustness of relative attention on two distinct representations and datasets that span from fixed to highly expressive timing among other differences. We see that Transformer outperforms LSTMs on modeling the Piano-E-competition dataset. This task requires repeating and varying motives through a long sequence, among others. This could be easier for Transformers because the maximum path length is constant between any two tokens, while for LSTMs it is O(n). In particular, self-attention can see distant motifs as relevant because it has direct access to its entire past regardless of distance and is designed to attend based on similarity. Relative attention amplifies this by learning the typical periods of repetition. This inductive bias to learning relational information, as opposed to absolute-position based patterns, allows Transformers to also generalize beyond observed lengths.

Self-attention can be thought of as related to self-similarity, while the former maps the input through different projections to queries and keys, and the latter uses the same projection for both. Self-similarity has been used for example in  lattner2016imposing in a style-transfer like fashion where the self-similarity structure of a piece serves as a template objective for gradient descent to modify an input score to bear similar repetition structure.

The Transformer architecture has also been extended to other non-language domains, such as image generation and speech recognition parmar2018image ; poveytime . The primary challenges there have been to compute attention over long sequences, because the time and memory complexity of self-attention grows quadratically with the number of positions. The common solution has been to restrict the attention computation to local windows around the query position. parmar2018image also extended self-attention to 2 dimensional grids, again with local windows to avoid the quadratic factor. Partly to reduce memory requirements for complex music, we represent music as 1D sequences (see Section 3). As timing is central to music, we focus on improving timing through a more efficient implementation of relative attention, building upon shaw2018self .

3 Domain-specific representations

Adapting sequence models for music requires making decisions on how to serialize a polyphonic texture. The data type, whether score or performance, makes certain representations more natural for encoding all the information needed while still resulting in reasonable sequence lengths.

3.1 Serialized instrument/time grid (J.S.Bach Chorales)

The first dataset, J.S. Bach Chorales, consists of four-part score-based choral music. The time resolution is sixteenth notes, making it possible to use a serialized grid-like representation. Figure 1 shows how a pianoroll (left) can be represented as a grid (right), following huang2017coconet . The rows show the MIDI pitch number of each of the four voices, from top to bottom being soprano (), alto (), tenor () and bass (), while the columns is discretized time, advancing in sixteenth notes. Here longer notes such as quarter notes are broken down into multiple repetitions. To serialize the grid into a sequence, we interleave the parts by first iterating through all the voices at time step 1, and then move to the next column, and then iterate again from top to bottom, and so on. The resulting sequence is , where the subscript gives the time step. After serialization, the most common sequence length is 1024. Each token is represented as onehot in pitch.

S: 67, 67, 67, 67
A: 62, 62, 62, 62
T: 59, 59, 57, 57
B: 43, 43, 45, 45

Figure 1: The opening measure of BWV 428 is visualized as a pianoroll (left, where the x-axis is discretized time and y-axis is MIDI pitch number), and encoded in grid representation with sixteenth note resolution (right). The soprano and alto voices have quarter notes at pitches G4 (67) and D4 (62), the tenor has eighth notes at pitches B3 (59) and A3 (57), and the bass has eighth notes at pitches A2 (45) and G2 (43).

3.2 MIDI-like event-based (Piano-e-Competition)

The second dataset, Piano-e-Competition, consists of polyphonic piano performances with expressive timing and dynamics. The time resolution here is on the millisecond level, so a grid representation would result in sequences that are too long. Instead, the polyphonic performance is serialized into a sequence of one hot encoded events as proposed in 

simon2017performance .

First, the input MIDI files are preprocessed to extend note durations based on sustain pedal control events. The sustain pedal is considered to be down whenever a sustain control change is encountered with a value ; the sustain pedal is then considered up after a control change with a value . Within a period where the sustain pedal is down, the duration of each note is extended to either the beginning of the next note of the same pitch or the end of the sustain period, whichever happens first. If the original duration extends beyond the time when the sustain pedal is down, that original duration is used.

Next, the MIDI note events are converted into a sequence from the following set of vocabulary: 128 NOTE_ON events for starting a note of with one of the 128 MIDI pitches, 128 NOTE_OFF events for ending a note with one of the 128 MIDI pitches, 100 TIME_SHIFT events representing forward time shifts in 10ms increments from 10ms to 1s, and 32 SET_VELOCITY events representing the velocity for future NOTE_ON events in the form of the 128 possible MIDI velocities quantized into 32 bins. An example performance encoding is illustrated in Figure 2.


Figure 2: A snippet of a piano performance visualized as a pianoroll (left) and encoded as performance events (right, serialized from left to right and then down the rows). A C Major chord is arpeggiated with the sustain pedal active. At the 2-second mark, the pedal is released, ending all of the notes. At the 3-second mark, an F is played for .5 seconds. The C chord is played at velocity 80 and the F is played at velocity 100.

4 Model

We investigate the Transformer architecture vaswani2017attention as a generative model for symbolic polyphonic music. While typically cast as an encoder-decoder model, it can also be adapted for language modeling by instantiating a decoder-only model liu2018generatin . We first introduce the basic building blocks of Transformer, and then summarize how shaw2018self incorporates relative positional representation. In contrast to translation, music relies heavily on timing and its sequence lengths are much longer. We show in Section 4.3 how we can reduce the memory requirements of relative attention to scale to more complex music data.

4.1 Background: Transformer decoder architecture

The Transformer decoder is an autoregressive generative model, using primarily self-attention mechanisms. Positional sinusoids are added to the embeddings of the input sequence vaswani2017attention . Each layer consists of two sub-layers, a self-attention sub-layer and a feedforward sub-layer. The self-attention sub-layer computes for each position in the input sequence a weighted sum of its past based on how relevant past positions are to the current query.

More specifically, each input position is represented as a -dimensional vector, which is then projected through different weight matrices to form a query , key and value . Multiple heads are typically used to allow the model to focus on different parts of the history. These are supported by first splitting the queries, keys, and values into parts on the depth dimension. Equation 1 shows how multi-head attention between the queries and keys are computed in tensor form as , a batch matrix multiplication of shapes by transposed, resulting in shape , where is the length of an entire sequence and is the batch size.

For tasks akin to language modeling, the upper triangle on the last two dimension (corresponding to the query and key positions respectively) is masked out because queries can only attend to past key positions to result in an autoregressive model. The softmax is then performed over key positions, resulting in a weighted summary of

of shape .


The feedforward (FF) sub-layer then takes the output from the previous attention sub-layer, and performs two layers of point-wise dense layers on the depth dimension, as shown in Equation 2. are weights and biases of those two layers.


4.2 Relative positional self-attention

As the Transformer model relies solely on positional sinusoids to represent timing information vaswani2017attention , Shaw et al. shaw2018self introduced relative position representations to allow attention to be informed by how far two positions are apart in a sequence. This involves learning a separate relative position embedding where each pairwise distance is giving an embedding entry. This embeddings are then used to modulate relevance, as shown in Equation 3.

In practice, Shaw et al. shaw2018self implements Equation 4, by first multiplying and before adding the result to . is an intermediate tensor of shape (), hence of memory complexity . It is constructed by first computing pairwise distances for each of the query-by-key positions ( by ), and then gathering the corresponding embeddings from (Table 1). is then left multiplied by to obtain an output of shape that can be directly added to the result of . Even though is shared across batch and heads , () is still the dominating factor and is prohibitively large for long sequences (Table 2).


4.3 Memory efficient implementation of relative position-based attention

We improve the implementation of relative attention by reducing the intermediate memory requirement from to . We observe that all of the terms we need from are already available if we directly multiply of shape with of shape (Table 1). As is indexed by absolute query positions and is indexed by relative distances, the result is a absolute-by-relative indexed tensor. However, the rows in

are indexed by absolute query positions and columns indexed by absolute key positions, we need to “skew

so that the relative distance terms are added to the query-by-key entries that bear that relative distance (Equation  5).

Implementation Instantiating , tensor shape Memory complexity
Shaw et al. shaw2018self Gather from
Ours Directly use , which has shape
Table 1: Memory requirements for instantiating the relative positional representation or its analogous term.
Implementation Relative term Batch matmul tensor shapes Memory Time
Shaw et al. shaw2018self
Table 2: Overall memory and time complexity for computing the relative attention term.

4.3.1 The “skewing” procedure

Hence, we propose a “skewing” procedure to transform an absolute-by-relative indexed matrix to into an absolute-by-absolute indexed matrix. For simplicity the matrices illustrated in Figure 3 correspond to skewing the last two dimensions of , which is an absolute-by-relative indexed matrix is of shape , rows indexed by query positions, and columns indexed by the pairwise distances between query and key positions subtracted by . Hence its leftmost column corresponds to time steps back and the rightmost column corresponds to 0 time steps back. The goal is to skew this matrix into one that is absolute-by-absolute indexed with rows indexed by queries, columns by keys. The “skewing” procedure is as follows.

  1. Pad a dummy column vector of length before the leftmost column (Figure 3 middle panel).

  2. Reshape the matrix to have shape (Figure 3 right panel).

  3. Slice that matrix to retain only the last rows and all the columns, resulting in a matrix again, but now absolute-by-absolute indexed (Figure 3 right panel).

Figure 3: Steps for “skewing” an absolute-by-relative indexed matrix into an absolute-by-absolute indexed matrix. The grey portions are either masked in self-attention or dummy entries introduced by the skewing procedure. Zeros correspond to positions where the relative distance is zero. For instance, the resultant matrix has relative distance of zero on its diagonal. The purple dotted outlined rectangle in the middle and right panel correspond to the entries that will be removed.

5 Experiments

5.1 J.S. Bach Chorales

J.S. Bach chorales is a canonical dataset used for evaluating generative models for music 333J.S. Bach chorales dataset: (e.g. allan2005harmonising ; boulanger2012modeling ; liang2016bachbot ; hadjeres2016style ; huang2017coconet ). It consists of score-based four-part chorales. We first discretize the scores onto a 16th-note grid, and then serialize it by iterating through all the voices within a time step and then advancing time (see Section 3.1 for more details). As there is a direct correspondence between position in sequence and position on the timing/instrument grid in a piece, adding relative position representations could make it easier to learn this grammar. We indeed see relative attention drastically improve negative loglikelihood (NLL) over baseline Transformer (Table 3). This improvement is also reflected in sample quality. The samples now maintain the necessary timing/instrument grid, always advancing four steps before advancing in time. As local timing is maintained, the model is able to capture timing on a more global level, giving rise to regular phrasing, as shown in Figure 4.

Figure 4: Comparing unconditioned generation from baseline self-attention (left) and relative self-attention (right). The green vertical boxes indicate the endings of (sub)phrases where cadences are held.

In addition to relative attention, we also explored enhancing absolute timing through concatenating instead of adding the sinusoids to the input embeddings. This allows the model to more directly learn its absolute positional mapping. This further improves performance for both the baseline and relative transformer (Table 3).

To compare to prior work, we choose Coconet as it is one of the best performing models that has also been evaluated on the 16-note grid using the canonical dataset split in the literature. To directly compare, we re-evaluated Coconet to obtain note-wise losses on the validation set 444Frame-wise losses were reported in some earlier papers to compare to models such as RNN-RBM which model “chords”. Coconet can be evaluated under both note-wise and frame-wise losses and their open-sourced code also supports this.. For the Transformer models (abbreviated as TF

), we hyperparameter tuned over number of layers (L in {4,5,6}), attention hidden size (att in {256, 512}) and pointwise feedforward hidden size (ff in {512, 1024}).

5.1.1 Generalizing relative attention to capture relational information

A musical event bears multiple attributes, such as timing, pitch, instrument etc. To capture more relational information, we extend relative attention to capture pairwise distances on additional attributes. For example, separate relative embeddings can be learned for timing and also pitch . has entries corresponding to how many sixteenth notes apart are two positions, while embeds the pairwise pitch interval. We call this relative music attention, where additional relative terms can be added to modulate attention (Equation 6). However this approach is not directly scalable beyond J.S.Bach Chorales because it involves explicitly gathering relative embeddings for and , resulting in a memory complexity of as in Shaw at el’s shaw2018self . This is due to relative information being computed based on content as opposed to content-invariant information such as position in sequence. In our experiments, we found it helpful to only add them to the first attention layer, perhaps because it is closest to the raw input content.

Model variation Validation NLL
Coconet (CNN, chronological, 64L, 128 3x3f)
Coconet (CNN, orderless, 64L, 128 3x3f)  555Coconet is an instance of OrderlessNADE, an ensemble over orderings. The chronological loss evaluates the model as autoregressive, from left to right. We can also evaluate the model as a mixture, by averaging its losses over multiple random orderings. This is a lower bound. It is intractable to sample from but can be approximated through Gibbs sampling.
Transformer (TF) baseline (decoder-only) vaswani2017attention (5L, 256att, 1024ff)
TF baseline + concat positional sinusoids (cps)
TF baseline + concat positional sinusoids, instrument labels (cpsi)
Relative Transformer shaw2018self (5L, 512att, 512ff, 256r)
Relative Transformer + cpsi
Relative Music Transformer + cpsi
Table 3: Note-wise validation NLL: J.S.Bach Chorales at 16th notes.

5.2 Piano-e-Competition

We use the first 6 years of of Piano-e-Competition because these years have corresponding MIDI data released 666Piano-e-Competition dataset (under competition history):, resulting in about 1100 pieces, split 80/10/10. The MIDI consists of performed classical piano music with expressive dynamics and timing, calling for a MIDI-like event-based representation (see Section 3.2 for more details). We compare to Magenta’s PerformanceRNN (LSTM, which first used this dataset) simon2017performance and LookBack RNN (LSTM with attention) waite2016generating . LookBack RNN uses an input representation that requires monophonic music with barlines which is information that is not present in performed polyphonic music data, hence we simply adopt their architecture. Table 4 shows that Transformer-based architectures fits this dataset better than LSTM-based models.

Model variation Validation NLL
Performance RNN (LSTM) (3L, 1024hs)
LSTM with attention (3L, 1024hs, 1024att)
Baseline Transformer (decoder-only) vaswani2017attention
Relative Transformer shaw2018self with our efficient formulation
Table 4: Validation NLL for Piano-e-Competition dataset, with event-based representation

5.2.1 Qualitative priming experiments

When primed with an initial motif as shown in Figure 5, we see that Transformer with relative attention, baseline Transformer with regular attention and LSTM perform qualitatively differently. Relative Transformer reuses the motif in a diverse set of ways, while baseline Transformer uses the motif in a more uniform fashion. LSTM uses the motif initially but soon drifts off to other material.

Note that the samples are generated at twice the length it was trained on. Relative attention was able to generalize to lengths longer than trained but baseline Transformer deteriorates beyond its training length. In the listening test, we kept our samples at the length the models were trained on to remove the effect of generalizing beyond training length.

Figure 5: Comparing primed samples from different models trained on the Piano-e-Competition dataset. The top left small score is the prime, which is from Chopin’s Étude Op. 10, No. 5 (Black Keys). The top row, middle and bottom rows are samples from Transformer with relative attention, baseline Transformer with regular attention and PerformanceRNN (LSTM). Repeated motives and structure are seen in samples from Transformer with relative attention (top), but less so in samples from the other models.

5.2.2 Human evaluations

To compare the perceived sample quality of the different models trained on the Piano-e-competition, and their ability to generate a continuation for a priming sequence, we carried out a listening test study comparing the baseline/vanilla Transformer, Transformer with relative-attention, PerformanceRNN, and validation set from the dataset. The study procedure is as follows: Participants were presented with two musical excerpts that shared a common priming sequence. For each excerpt, the priming sequence was played, followed by 2.5 seconds of silence, followed by the priming sequence again and a continuation of that sequence. The continuations were either sampled from one of the models or extracted from our validation set. We evaluated all possible pairs in the space of data and model samples, except from the same model. Each continuation had a length of 512 events using the encoding described in Section 3.2. This corresponds to the length the models were trained on to remove the deteriorating effect that happens with baseline Transformer when asked to generate beyond the length it was trained on. Participants were asked which excerpt they thought was more musical on a Likert scale. 180 such ratings were collected, with each source involved in 30 pair-wise comparisons and each comparison completed by 3 different participants.

Figure 6 shows the number of comparisons in which an excerpt from each model was selected as more musical. Our listening test clearly demonstrates the improvement in sample quality gained by using relative attention over the baseline Transformer model.

Figure 6:

Results of our listening tests, showing the number of times each model/real data won in a pairwise comparison. Black error bars indicate estimated standard deviation of means. Vanilla Transformer is the baseline Transformer, while Relative Transformer is Transformer with relative attention.

Further, a Kruskal-Wallis H test of the ratings showed that there was a statistically significant difference between the models: . A post-hoc analysis using the Wilcoxon signed-rank test with Bonferroni correction showed that participants rated samples from the relative Transformer as more musical than samples from the baseline Transformer with . We did not observe a statistically significant difference between samples from our validation set and the relative Transformer model or between samples from the relative Transformer and the LSTM. The effects are probably less pronounced between relative Transformer and LSTM because we used samples half the length of those shown in Figure 5 to prevent the baseline Transformer from deteriorating. This weakens the comparison on long-term structure.

6 Conclusion

We showed that Transformer with relative attention is well suited for generative modeling of symbolic music. As relative attention has also been shown to improve performance in other domains such as machine translation shaw2018self

, this could imply that the mechanism is able to capture varying notions of distances and periodicities, ranging from actual timing to word ordering, from phrasing to grammar, fine-grained positional distances to possibly bucketed longer distances. The ability for self-attention based models to expand upon a prime suggests that this approach may also be relevant for other problems in text generation. The original formulation was memory inefficient, deterring researchers from studying it on longer sequences. Our algorithmic contribution on scaling relative attention makes this possible and allows relative attention to be a viable solution for practitioners in other domains. For example, it can be useful for dialogue tasks where sequence lengths are vastly varying and suffer from generation deterioration for long sequences. The success of this approach for music motivates further research on the inductive biases of Transformer, in addition to the benefits of augmenting content-invariant information.