1 Introduction
A musical piece often consists of recurring elements, from motives to phrases to sections such as verse-chorus. To generate a coherent piece, a model needs to remember the elements that came before, in order to repeat them, vary and further develop them, and also to know how to create contrast and surprise. Self-attention mechanisms are a natural fit for this challenge, as they offer direct access to the generated history, allowing the model to choose the level of detail or summarization to derive from it.
The attention-based Transformer architecture vaswani2017attention has shown compelling results in generating Wikipedia articles with coherent long-term structure and consistency liu2018generatin . Like written language, a musical score can be represented with discrete symbols and has a hierarchical syntax, with notes being roughly like characters and motives of notes like words. Unlike text that can be represented as a single sequence, music unfolds with multiple streams of events occurring simultaneously. Hence, building a language model for music requires making decisions on how to serialize a polyphonic texture. Furthermore, there is the question of how timing corresponds or does not correspond to position in sequence.
We explore Transformer on two distinct musical genres that demand very different representations, allowing us to investigate how well the architecture can capture different properties of music. J.S. Bach Chorales is a canonical dataset used for evaluating generative models for music allan2005harmonising ; boulanger2012modeling ; goel2014polyphonic ; liang2016bachbot ; hadjeres2016style . It is small and constrained, consisting of pieces with traditional harmony, always four voices and predominantly moves on a 16th-note timing grid. After serializing the score into a single sequence (described in Section 3.1), the most common sequence length is 1024. Here there is a direct correspondence between position in sequence and time in piece. Samples from the baseline Transformer show that the architecture is not able to maintain the strict timing grid, causing voices to misalign.
This corroborates with shaw2018self that Transformer does not have a strong notion of either absolute or relative timing. The only source of timing information comes from the positional sinusoids added to the input embedding, which can be difficult to disentangle. shaw2018self proposes to add a relative positional representation that allow for similarity comparisons to be sensitive to how far apart the tokens are in a sequence. For the Bach dataset this is a good match, as distance in sequence position corresponds to time deltas. We indeed see a drastic improvement in both sample quality and perplexity after adding relative attention. The samples maintained the timing grid, and with this foundation, the model was able to capture timing on a more global level, giving rise to regular phrases. We also generalize the notion of relative representations to other attributes of music such as pitch, allowing for domain-informed invariance.
However, the Bach dataset is a special case where the musical score can be discretized on a fixed grid. In contrast, the second dataset we explore, Piano-e-Competition, consists of MIDI recorded from actual contestant performances, which involve expressive timing and dynamics. Time advances in milliseconds and the sequences become too long if discretized at a fixed time unit. Here we break the direct correspondence between position in sequence and actual time in music and adopt a MIDI-like event based representation, which advances time in variable deltas as consecutive events happen (described in Section 3.2). This dataset is much larger, and the music is also more complex, consisting of virtuosic piano music composed in the 18th through 20th centuries, which requires much larger models. In order to test if relative attention is still beneficial on this more fluid event-based representation, we had to improve memory usage.
1.1 Contributions
This paper makes two main contributions. First, we are able to improve the memory consumption of the relative attention mechanism proposed in shaw2018self
. By transforming between absolute and relative position based tensors, as further described in Section
4.3, we reduce memory usage of the relative-position based tensor instantiations from to , where is the length of the sequence, and the hidden size of the model.Second, we show in a series of experiments on music generation that relative attention is critical for both grid-like and event-based representations as they carry strong sequential dependencies. Relative self-attention results in more consistency in sample quality for unconditioned generation and models that can generate sequences longer than those in the training set. When primed, the model generates continuations that develop the prime in a coherent fashion and exhibit long-term structure. We conducted a listening test and found that samples from models with relative self-attention were perceived as more coherent than the baseline Transformer model from vaswani2017attention . With these experiments we demonstrate that relative self-attention has potential as a productive tool for musicians as it can listen more attentively to input sequences and generate in a fashion that builds on existing material.
2 Related work
Sequence models have been the canonical choice for modeling music, from Hidden Markov Models (HMMs
baum1966statistical ) allan2005harmonising ; farbood2001analysis, to Recurrent Neural Networks (RNNs
rumelhart1988learning) and Long Short Term Memory networks (LSTMs
hochreiter1997long ) eck2002finding ; liang2016bachbot ; simon2017performance , to bidirectional LSTMs (BLSTMs graves2005framewise ) hadjeres2016deepbach . Successful application of sequential models to polyphonic music often require serializing the musical score or performance into a single sequence (e.g.) by interleaving different instruments or voices. Alternatively, a 2D pianoroll-like representation (further described in Section 3.1) can be decomposed into a sequence of multi-hot pitch vectors, and their joint probability distributions can be captured using Restricted Boltzmann Machines (RBMs
smolensky1986information ; hinton2006fast) or Neural Autoregressive Distribution Estimators (NADE
larochelle2011neural). Pianorolls are also image-like and can be modeled by Convolutional Neural Networks (CNNs), and trained under the objectives of for example Generative Adversarial Networks (GAN
goodfellow2014generative ) dong2017musegan or orderless NADE uria2014deep ; uria2016neural which trains all orderings of factorizations huang2017coconet .We show the first use of the self-attention based Transformer model for music generation, and explore the robustness of relative attention on two distinct representations and datasets that span from fixed to highly expressive timing among other differences. We see that Transformer outperforms LSTMs on modeling the Piano-E-competition dataset. This task requires repeating and varying motives through a long sequence, among others. This could be easier for Transformers because the maximum path length is constant between any two tokens, while for LSTMs it is O(n). In particular, self-attention can see distant motifs as relevant because it has direct access to its entire past regardless of distance and is designed to attend based on similarity. Relative attention amplifies this by learning the typical periods of repetition. This inductive bias to learning relational information, as opposed to absolute-position based patterns, allows Transformers to also generalize beyond observed lengths.
Self-attention can be thought of as related to self-similarity, while the former maps the input through different projections to queries and keys, and the latter uses the same projection for both. Self-similarity has been used for example in lattner2016imposing in a style-transfer like fashion where the self-similarity structure of a piece serves as a template objective for gradient descent to modify an input score to bear similar repetition structure.
The Transformer architecture has also been extended to other non-language domains, such as image generation and speech recognition parmar2018image ; poveytime . The primary challenges there have been to compute attention over long sequences, because the time and memory complexity of self-attention grows quadratically with the number of positions. The common solution has been to restrict the attention computation to local windows around the query position. parmar2018image also extended self-attention to 2 dimensional grids, again with local windows to avoid the quadratic factor. Partly to reduce memory requirements for complex music, we represent music as 1D sequences (see Section 3). As timing is central to music, we focus on improving timing through a more efficient implementation of relative attention, building upon shaw2018self .
3 Domain-specific representations
Adapting sequence models for music requires making decisions on how to serialize a polyphonic texture. The data type, whether score or performance, makes certain representations more natural for encoding all the information needed while still resulting in reasonable sequence lengths.
3.1 Serialized instrument/time grid (J.S.Bach Chorales)
The first dataset, J.S. Bach Chorales, consists of four-part score-based choral music. The time resolution is sixteenth notes, making it possible to use a serialized grid-like representation. Figure 1 shows how a pianoroll (left) can be represented as a grid (right), following huang2017coconet . The rows show the MIDI pitch number of each of the four voices, from top to bottom being soprano (), alto (), tenor () and bass (), while the columns is discretized time, advancing in sixteenth notes. Here longer notes such as quarter notes are broken down into multiple repetitions. To serialize the grid into a sequence, we interleave the parts by first iterating through all the voices at time step 1, and then move to the next column, and then iterate again from top to bottom, and so on. The resulting sequence is , where the subscript gives the time step. After serialization, the most common sequence length is 1024. Each token is represented as onehot in pitch.
![]() |
S: 67, 67, 67, 67
|
3.2 MIDI-like event-based (Piano-e-Competition)
The second dataset, Piano-e-Competition, consists of polyphonic piano performances with expressive timing and dynamics. The time resolution here is on the millisecond level, so a grid representation would result in sequences that are too long. Instead, the polyphonic performance is serialized into a sequence of one hot encoded events as proposed in
simon2017performance .First, the input MIDI files are preprocessed to extend note durations based on sustain pedal control events. The sustain pedal is considered to be down whenever a sustain control change is encountered with a value ; the sustain pedal is then considered up after a control change with a value . Within a period where the sustain pedal is down, the duration of each note is extended to either the beginning of the next note of the same pitch or the end of the sustain period, whichever happens first. If the original duration extends beyond the time when the sustain pedal is down, that original duration is used.
Next, the MIDI note events are converted into a sequence from the following set of vocabulary: 128 NOTE_ON
events for starting a note of with one of the 128 MIDI pitches, 128 NOTE_OFF
events for ending a note with one of the 128 MIDI pitches, 100 TIME_SHIFT
events representing forward time shifts in 10ms increments from 10ms to 1s, and 32 SET_VELOCITY
events representing the velocity for future NOTE_ON
events in the form of the 128 possible MIDI velocities quantized into 32 bins. An example performance encoding is illustrated in Figure 2.
![]() |
SET_VELOCITY<80>, NOTE_ON<60>
|
4 Model
We investigate the Transformer architecture vaswani2017attention as a generative model for symbolic polyphonic music. While typically cast as an encoder-decoder model, it can also be adapted for language modeling by instantiating a decoder-only model liu2018generatin . We first introduce the basic building blocks of Transformer, and then summarize how shaw2018self incorporates relative positional representation. In contrast to translation, music relies heavily on timing and its sequence lengths are much longer. We show in Section 4.3 how we can reduce the memory requirements of relative attention to scale to more complex music data.
4.1 Background: Transformer decoder architecture
The Transformer decoder is an autoregressive generative model, using primarily self-attention mechanisms. Positional sinusoids are added to the embeddings of the input sequence vaswani2017attention . Each layer consists of two sub-layers, a self-attention sub-layer and a feedforward sub-layer. The self-attention sub-layer computes for each position in the input sequence a weighted sum of its past based on how relevant past positions are to the current query.
More specifically, each input position is represented as a -dimensional vector, which is then projected through different weight matrices to form a query , key and value . Multiple heads are typically used to allow the model to focus on different parts of the history. These are supported by first splitting the queries, keys, and values into parts on the depth dimension. Equation 1 shows how multi-head attention between the queries and keys are computed in tensor form as , a batch matrix multiplication of shapes by transposed, resulting in shape , where is the length of an entire sequence and is the batch size.
For tasks akin to language modeling, the upper triangle on the last two dimension (corresponding to the query and key positions respectively) is masked out because queries can only attend to past key positions to result in an autoregressive model. The softmax is then performed over key positions, resulting in a weighted summary of
of shape .(1) |
The feedforward (FF) sub-layer then takes the output from the previous attention sub-layer, and performs two layers of point-wise dense layers on the depth dimension, as shown in Equation 2. are weights and biases of those two layers.
(2) |
4.2 Relative positional self-attention
As the Transformer model relies solely on positional sinusoids to represent timing information vaswani2017attention , Shaw et al. shaw2018self introduced relative position representations to allow attention to be informed by how far two positions are apart in a sequence. This involves learning a separate relative position embedding where each pairwise distance is giving an embedding entry. This embeddings are then used to modulate relevance, as shown in Equation 3.
In practice, Shaw et al. shaw2018self implements Equation 4, by first multiplying and before adding the result to . is an intermediate tensor of shape (), hence of memory complexity . It is constructed by first computing pairwise distances for each of the query-by-key positions ( by ), and then gathering the corresponding embeddings from (Table 1). is then left multiplied by to obtain an output of shape that can be directly added to the result of . Even though is shared across batch and heads , () is still the dominating factor and is prohibitively large for long sequences (Table 2).
(3) | ||||
(4) |
4.3 Memory efficient implementation of relative position-based attention
We improve the implementation of relative attention by reducing the intermediate memory requirement from to . We observe that all of the terms we need from are already available if we directly multiply of shape with of shape (Table 1). As is indexed by absolute query positions and is indexed by relative distances, the result is a absolute-by-relative indexed tensor. However, the rows in
are indexed by absolute query positions and columns indexed by absolute key positions, we need to “skew”
so that the relative distance terms are added to the query-by-key entries that bear that relative distance (Equation 5).Implementation | Instantiating , tensor shape | Memory complexity |
---|---|---|
Shaw et al. shaw2018self | Gather from | |
Ours | Directly use , which has shape |
Implementation | Relative term | Batch matmul tensor shapes | Memory | Time |
---|---|---|---|---|
Shaw et al. shaw2018self | ||||
Ours |
(5) |
4.3.1 The “skewing” procedure
Hence, we propose a “skewing” procedure to transform an absolute-by-relative indexed matrix to into an absolute-by-absolute indexed matrix. For simplicity the matrices illustrated in Figure 3 correspond to skewing the last two dimensions of , which is an absolute-by-relative indexed matrix is of shape , rows indexed by query positions, and columns indexed by the pairwise distances between query and key positions subtracted by . Hence its leftmost column corresponds to time steps back and the rightmost column corresponds to 0 time steps back. The goal is to skew this matrix into one that is absolute-by-absolute indexed with rows indexed by queries, columns by keys. The “skewing” procedure is as follows.
5 Experiments
5.1 J.S. Bach Chorales
J.S. Bach chorales is a canonical dataset used for evaluating generative models for music 333J.S. Bach chorales dataset: https://github.com/czhuang/JSB-Chorales-dataset (e.g. allan2005harmonising ; boulanger2012modeling ; liang2016bachbot ; hadjeres2016style ; huang2017coconet ). It consists of score-based four-part chorales. We first discretize the scores onto a 16th-note grid, and then serialize it by iterating through all the voices within a time step and then advancing time (see Section 3.1 for more details). As there is a direct correspondence between position in sequence and position on the timing/instrument grid in a piece, adding relative position representations could make it easier to learn this grammar. We indeed see relative attention drastically improve negative loglikelihood (NLL) over baseline Transformer (Table 3). This improvement is also reflected in sample quality. The samples now maintain the necessary timing/instrument grid, always advancing four steps before advancing in time. As local timing is maintained, the model is able to capture timing on a more global level, giving rise to regular phrasing, as shown in Figure 4.
In addition to relative attention, we also explored enhancing absolute timing through concatenating instead of adding the sinusoids to the input embeddings. This allows the model to more directly learn its absolute positional mapping. This further improves performance for both the baseline and relative transformer (Table 3).
To compare to prior work, we choose Coconet as it is one of the best performing models that has also been evaluated on the 16-note grid using the canonical dataset split in the literature. To directly compare, we re-evaluated Coconet to obtain note-wise losses on the validation set 444Frame-wise losses were reported in some earlier papers to compare to models such as RNN-RBM which model “chords”. Coconet can be evaluated under both note-wise and frame-wise losses and their open-sourced code also supports this.. For the Transformer models (abbreviated as TF
), we hyperparameter tuned over number of layers (L in {4,5,6}), attention hidden size (att in {256, 512}) and pointwise feedforward hidden size (ff in {512, 1024}).
5.1.1 Generalizing relative attention to capture relational information
A musical event bears multiple attributes, such as timing, pitch, instrument etc. To capture more relational information, we extend relative attention to capture pairwise distances on additional attributes. For example, separate relative embeddings can be learned for timing and also pitch . has entries corresponding to how many sixteenth notes apart are two positions, while embeds the pairwise pitch interval. We call this relative music attention, where additional relative terms can be added to modulate attention (Equation 6). However this approach is not directly scalable beyond J.S.Bach Chorales because it involves explicitly gathering relative embeddings for and , resulting in a memory complexity of as in Shaw at el’s shaw2018self . This is due to relative information being computed based on content as opposed to content-invariant information such as position in sequence. In our experiments, we found it helpful to only add them to the first attention layer, perhaps because it is closest to the raw input content.
(6) |
Model variation | Validation NLL |
---|---|
Coconet (CNN, chronological, 64L, 128 3x3f) | |
Coconet (CNN, orderless, 64L, 128 3x3f) | 555Coconet is an instance of OrderlessNADE, an ensemble over orderings. The chronological loss evaluates the model as autoregressive, from left to right. We can also evaluate the model as a mixture, by averaging its losses over multiple random orderings. This is a lower bound. It is intractable to sample from but can be approximated through Gibbs sampling. |
Transformer (TF) baseline (decoder-only) vaswani2017attention (5L, 256att, 1024ff) | |
TF baseline + concat positional sinusoids (cps) | |
TF baseline + concat positional sinusoids, instrument labels (cpsi) | |
Relative Transformer shaw2018self (5L, 512att, 512ff, 256r) | |
Relative Transformer + cpsi | |
Relative Music Transformer + cpsi |
5.2 Piano-e-Competition
We use the first 6 years of of Piano-e-Competition because these years have corresponding MIDI data released 666Piano-e-Competition dataset (under competition history): http://www.piano-e-competition.com/, resulting in about 1100 pieces, split 80/10/10. The MIDI consists of performed classical piano music with expressive dynamics and timing, calling for a MIDI-like event-based representation (see Section 3.2 for more details). We compare to Magenta’s PerformanceRNN (LSTM, which first used this dataset) simon2017performance and LookBack RNN (LSTM with attention) waite2016generating . LookBack RNN uses an input representation that requires monophonic music with barlines which is information that is not present in performed polyphonic music data, hence we simply adopt their architecture. Table 4 shows that Transformer-based architectures fits this dataset better than LSTM-based models.
Model variation | Validation NLL |
---|---|
Performance RNN (LSTM) (3L, 1024hs) | |
LSTM with attention (3L, 1024hs, 1024att) | |
Baseline Transformer (decoder-only) vaswani2017attention | |
Relative Transformer shaw2018self with our efficient formulation |
5.2.1 Qualitative priming experiments
When primed with an initial motif as shown in Figure 5, we see that Transformer with relative attention, baseline Transformer with regular attention and LSTM perform qualitatively differently. Relative Transformer reuses the motif in a diverse set of ways, while baseline Transformer uses the motif in a more uniform fashion. LSTM uses the motif initially but soon drifts off to other material.
Note that the samples are generated at twice the length it was trained on. Relative attention was able to generalize to lengths longer than trained but baseline Transformer deteriorates beyond its training length. In the listening test, we kept our samples at the length the models were trained on to remove the effect of generalizing beyond training length.
5.2.2 Human evaluations
To compare the perceived sample quality of the different models trained on the Piano-e-competition, and their ability to generate a continuation for a priming sequence, we carried out a listening test study comparing the baseline/vanilla Transformer, Transformer with relative-attention, PerformanceRNN, and validation set from the dataset. The study procedure is as follows: Participants were presented with two musical excerpts that shared a common priming sequence. For each excerpt, the priming sequence was played, followed by 2.5 seconds of silence, followed by the priming sequence again and a continuation of that sequence. The continuations were either sampled from one of the models or extracted from our validation set. We evaluated all possible pairs in the space of data and model samples, except from the same model. Each continuation had a length of 512 events using the encoding described in Section 3.2. This corresponds to the length the models were trained on to remove the deteriorating effect that happens with baseline Transformer when asked to generate beyond the length it was trained on. Participants were asked which excerpt they thought was more musical on a Likert scale. 180 such ratings were collected, with each source involved in 30 pair-wise comparisons and each comparison completed by 3 different participants.
Figure 6 shows the number of comparisons in which an excerpt from each model was selected as more musical. Our listening test clearly demonstrates the improvement in sample quality gained by using relative attention over the baseline Transformer model.
Results of our listening tests, showing the number of times each model/real data won in a pairwise comparison. Black error bars indicate estimated standard deviation of means. Vanilla Transformer is the baseline Transformer, while Relative Transformer is Transformer with relative attention.
Further, a Kruskal-Wallis H test of the ratings showed that there was a statistically significant difference between the models: . A post-hoc analysis using the Wilcoxon signed-rank test with Bonferroni correction showed that participants rated samples from the relative Transformer as more musical than samples from the baseline Transformer with . We did not observe a statistically significant difference between samples from our validation set and the relative Transformer model or between samples from the relative Transformer and the LSTM. The effects are probably less pronounced between relative Transformer and LSTM because we used samples half the length of those shown in Figure 5 to prevent the baseline Transformer from deteriorating. This weakens the comparison on long-term structure.
6 Conclusion
We showed that Transformer with relative attention is well suited for generative modeling of symbolic music. As relative attention has also been shown to improve performance in other domains such as machine translation shaw2018self
, this could imply that the mechanism is able to capture varying notions of distances and periodicities, ranging from actual timing to word ordering, from phrasing to grammar, fine-grained positional distances to possibly bucketed longer distances. The ability for self-attention based models to expand upon a prime suggests that this approach may also be relevant for other problems in text generation. The original formulation was memory inefficient, deterring researchers from studying it on longer sequences. Our algorithmic contribution on scaling relative attention makes this possible and allows relative attention to be a viable solution for practitioners in other domains. For example, it can be useful for dialogue tasks where sequence lengths are vastly varying and suffer from generation deterioration for long sequences. The success of this approach for music motivates further research on the inductive biases of Transformer, in addition to the benefits of augmenting content-invariant information.
References
- [1] Moray Allan and Christopher KI Williams. Harmonising chorales by probabilistic inference. Advances in neural information processing systems, 17:25–32, 2005.
-
[2]
Leonard E Baum and Ted Petrie.
Statistical inference for probabilistic functions of finite state markov chains.
The annals of mathematical statistics, 37(6):1554–1563, 1966. -
[3]
Nicolas Boulanger-Lewandowski, Yoshua Bengio, and Pascal Vincent.
Modeling temporal dependencies in high-dimensional sequences:
Application to polyphonic music generation and transcription.
International Conference on Machine Learning
, 2012. - [4] Hao-Wen Dong, Wen-Yi Hsiao, Li-Chia Yang, and Yi-Hsuan Yang. Musegan: Symbolic-domain music generation and accompaniment with multi-track sequential generative adversarial networks. arXiv preprint arXiv:1709.06298, 2017.
- [5] Douglas Eck and Juergen Schmidhuber. Finding temporal structure in music: Blues improvisation with lstm recurrent networks. In Neural Networks for Signal Processing, 2002. Proceedings of the 2002 12th IEEE Workshop on, pages 747–756. IEEE, 2002.
- [6] Mary Farbood and Bernd Schöner. Analysis and synthesis of palestrina-style counterpoint using markov chains. In ICMC, 2001.
- [7] Kratarth Goel, Raunaq Vohra, and JK Sahoo. Polyphonic music generation by modeling temporal dependencies using a rnn-dbn. In International Conference on Artificial Neural Networks, pages 217–224. Springer, 2014.
- [8] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
- [9] Alex Graves and Jürgen Schmidhuber. Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Networks, 18(5-6):602–610, 2005.
- [10] Gaëtan Hadjeres and François Pachet. Deepbach: a steerable model for bach chorales generation. arXiv preprint arXiv:1612.01010, 2016.
- [11] Gaëtan Hadjeres, Jason Sakellariou, and François Pachet. Style imitation and chord invention in polyphonic music with exponential families. arXiv preprint arXiv:1609.05152, 2016.
- [12] Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527–1554, 2006.
- [13] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- [14] Cheng-Zhi Anna Huang, Tim Cooijmans, Adam Roberts, Aaron Courville, and Doug Eck. Counterpoint by convolution. In International Conference on Music Information Retrieval, 2017.
- [15] Hugo Larochelle and Iain Murray. The neural autoregressive distribution estimator. In AISTATS, volume 1, page 2, 2011.
- [16] Stefan Lattner, Maarten Grachten, and Gerhard Widmer. Imposing higher-level structure in polyphonic music generation using convolutional restricted boltzmann machines and constraints. arXiv preprint arXiv:1612.04742, 2016.
- [17] Feynman Liang. Bachbot: Automatic composition in the style of bach chorales. Masters thesis, University of Cambridge, 2016.
- [18] Peter J Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. Generating wikipedia by summarizing long sequences. arXiv preprint arXiv:1801.10198, 2018.
- [19] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Łukasz Kaiser, Noam Shazeer, and Alexander Ku. Image transformer. arXiv preprint arXiv:1802.05751, 2018.
- [20] Daniel Povey, Hossein Hadian, Pegah Ghahremani, Ke Li, and Sanjeev Khudanpur. A time-restricted self-attention layer for asr.
- [21] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by back-propagating errors. Cognitive modeling, 5(3):1, 1988.
- [22] Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position representations. arXiv preprint arXiv:1803.02155, 2018.
- [23] Ian Simon and Sageev Oore. Performance rnn: Generating music with expressive timing and dynamics. https://magenta.tensorflow.org/performance-rnn, 2017.
- [24] Paul Smolensky. Information processing in dynamical systems: Foundations of harmony theory. Technical report, DTIC Document, 1986.
- [25] Benigno Uria, Marc-Alexandre Côté, Karol Gregor, Iain Murray, and Hugo Larochelle. Neural autoregressive distribution estimation. arXiv preprint arXiv:1605.02226, 2016.
- [26] Benigno Uria, Iain Murray, and Hugo Larochelle. A deep and tractable density estimator. In International Conference on Machine Learning, pages 467–475, 2014.
- [27] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000–6010, 2017.
- [28] Elliot Waite. Generating long-term structure in songs and stories. https://magenta.tensorflow.org/2016/07/15/lookback-rnn-attention-rnn, 2016.