1 Introduction
Over the last decade, machine learning techniques have emerged as the tool of choice for the design of symbolic music generation models
[1] with deep learning being the most widely used [2]. Deep generative models have been successfully applied to several different music generation tasks, e.g., monophonic music generation [3, 4, 5], polyphonic music generation [6, 7] and creating musical renditions with expressive timing and dynamics [8, 9]. However, most of these models assume sequential generation of music, i.e, the generated music depends only on the music that has preceded it. In other words, the models rely only on the past musical context. This approach does not align with typical human compositional practices which are often iterative and nonsequential in nature. In addition, the sequential generation paradigm places severe limitations on the degree of interactivity allowed by these models [10, 11]. Once generated, there is no way to tweak specific parts of the generation so as to conform to users’ aesthetic sensibilities or compositional requirements.In this paper, we seek to address these problems by incorporating future musical context into the generation process. Specifically, the task is to train models to fill in missing information in musical scores, duly taking into account the complete musical context — both past and future. In essence, this is similar to inpainting where the objective is to reconstruct missing or degraded parts of any kind of media [12]. For music, inpainting has been traditionally used for restoration purposes [13] or to remove unwanted artifacts such as clipping [14, 15] and packet loss [16]. However, we investigate models for Musical Score Inpainting (see fig:inpaint_schematic and problem_statement) as tools for music creation which can aid people in (i) getting new musical ideas based on specific styles, (ii) joining different musical sections together, and (iii) modifying or extending solos. In addition, such models can allow interactive music generation by enabling users to change the musical context and get new suggestions based on the updated context.
Our main technical contribution is a novel approach for musical score inpainting which relies on latent representationbased deep generative models. These models are trained to compress information from highdimensional spaces, e.g., the space of all bar melodies, to lowdimensional latent
spaces. While these latent spaces have been shown to be able to encode hidden attributes of musical data (see latentspacesmusic), the primary form of interaction with latent spaces has been using simple operations such as attribute vectors
[17, 18]or linear interpolations
[19, 20]. Using the proposed method (see sec:method), we demonstrate that Recurrent Neural Networks (RNNs) can be trained using latent embeddings to learn complex trajectories in the latent space. This, in turn, is used to predict how to fill in missing measures in a piece of symbolic music. Our secondary contributions are: (i) a stochastic training scheme which helps model training and generalization (see train_meth), and (ii) a novel data encoding scheme using uneven tick durations that allows encoding triplets without substantial increase in sequence length (see datarep). The effectiveness of the proposed method is demonstrated using several objective and subjective evaluation methods in experiments.2 Related Work
2.1 Audio & Music Inpainting
The first applications of audio inpainting methods were restorationoriented [13, 14, 21, 16, 22] using different methods such as matrix factorization [14], nonlocal similarity measures [22] and audio similarity graphs [16]. While these techniques have been useful for audiobased tasks, they are not easily extendable to symbolic music.
For inpainting in the symbolic domain, the early attempts were based on Markov Chain Monte Carlo (MCMC) methods which allowed users to specify certain constraints, e.g., which notes to generate and which to retain
[23, 24]. Another approach, proposed by Lattner et al., used iterative gradient descent to force the output of a deep generative model to conform to a specified structural plan [25]. However, methods based on MCMC (which rely on repeated sampling), and those using iterative gradient descent are slow during inference time and hence unsuitable for interactive applications. More recently, Hadjeres et al. proposed the AnticipationRNN framework [10] which used a pair of stacked RNNs to enforce userdefined constraints during inference. This allowed selective regeneration of specific parts of the music (generated or otherwise) using only two forward passes through the RNNpair and enabled realtime generations.2.2 Variational AutoEncoders
The Variational AutoEncoder (VAE) [26] is a type of generative model which uses an autoencoding [27] framework; during training, the model is forced to reconstruct its input. The architecture comprises an encoder and a decoder. The encoder learns to map real datapoints
from a highdimensional dataspace
to points in a lowdimensional space which is referred to as the latentspace. The decoder learns to map the latent vectors back to the dataspace. VAEs treat the latent vector as a random variable and model the generative process as a sequence of sampling operations:
, and , where is a prior distribution over the latent space, and is the conditional pdf. Variational inference [28] is used to approximate the posterior by minimizing the KLdivergence [29] between the approximate posterior and the true posterior by maximizing the evidence lower bound (ELBO) [26]. The training ensures that the reconstruction accuracy is maximized and realistic samples are generated when latent vectors are sampled using the prior .2.3 Leveraging Latent Spaces for Music Generation
Latent representationbased models such as VAEs have been found to be quite useful for several music generation tasks. Bretan et al. used the latent representation of an autoencoderbased model to generate musical phrases [30]. Lattner et al. forced the latent space of a gated autoencoder to learn pitch intervalbased representations which improved the performance of predictive models of music [31, 32]. Latent spaces of music generation models have also been used to explicitly encode and control musical attributes [20, 33, 3, 34], intertrack dependencies [35] and musical genre [36]. These studies show that trained latent spaces are able to encode hidden attributes of musical data which can be leveraged for different music generation tasks. However, latent space traversals have been relying on simpler methods such as attribute vectors [17, 18] or linear interpolations [19, 20].
3 Method
3.1 Problem Statement
We define the score inpainting problem as follows: given a past musical context and a future musical context , the modeling task is to generate an inpainted sequence which can connect and in a musically meaningful manner. In other words, the model should be trained to maximize the likelihood . Without much loss of generality, we assume that , , and comprise of , , and measures of music, respectively.
3.2 Approach
The key motivation behind the proposed method is that the latent embeddings of deep generative models of music encode hidden attributes of music which can be leveraged to perform inpainting. Firstly, we train a VAEmodel, referred to as MeasureVAE, to reconstruct single measures of music, i.e., the latent vectors of this model map to individual measures of music. Once trained, the encoder of this model can be used to process sequences and and output corresponding latent vector sequences and . Secondly, we train an RNNbased model, referred to as LatentRNN, to take as input the past and future latent vector sequences ( and ) and output a third latent vector sequence which can be passed through the decoder of MeasureVAE to obtain .
Effectively, the LatentRNN model learns to traverse the latent space of the MeasureVAE model so as to connect the provided contexts in a musically meaningful manner. The inference is fast since it only requires forward passes through the two models. This overall approach is shown in fig:approach_schematic. We call this joint architecture InpaintNet. While we restrict ourselves to monophonic melodic sequences in this paper, the approach can be extended to other time signatures and polyphonic sequences as well. The individual model architectures are discussed next.
3.3 Model Architectures
3.3.1 MeasureVAE
The MeasureVAE architecture (see fig:measurevae_schematic) is loosely based on the hierarchical recurrent MusicVAE architecture [3] which proved successful in modeling individual measures of music.
The encoder consists of a learnable embedding layer (operating on ticklevel) followed by a bidirectional RNN [37]. The concatenated hidden state from both directions of the RNN is then passed through two identical parallel linear stacks to obtain the mean
and variance
which are used to sample the latent vector via .The decoder follows a hierarchical structure where the sampled latent vector is used to initialize the hidden state of a beatRNN which is unrolled times (where is the number of beats in a measure). The output at each step of the beatRNN is passed through a linear stack before being used to initialize the hidden state of a tickRNN which is unrolled times (where is the number of events/ticks in a beat). The outputs of the tickRNN are individually passed through a second linear stack which maps them back to the dataspace. The hierarchical architecture mitigates the autoregressive nature of the RNN and forces the decoder to use the latent vector more efficiently (as advocated in [3]).
3.3.2 LatentRNN
The LatentRNN model (see fig:latentrnn_schematic) consists of subcomponents. There are identical bidirectional RNNs, referred to as PastContextRNN and FutureContextRNN, which process the latent vector sequences for the past and future contexts ( and ), respectively. These are unrolled for and times in order to encode the context sequences, respectively. The final hidden states of the two contextRNNs are concatenated and then used to initialize the hidden state of a third RNN, referred to as the GenerationRNN, which is unrolled times. The outputs of the GenerationRNN are passed through a linear stack to obtain latent vectors corresponding to the inpainted measures.
The hyperparameters for the model configurations are chosen based on initial experiments and are provided in tab:model_config. For the RNN layers in both models, Gated Recurrent Units (GRU)
[38] are used.Measure VAE  

Embedding Layer  i=dict size, o=10  
EncoderRNN  n=2, i=10, h=512, d=0.5  

i=1024, o=256, n=2, nonlinearity=SELU  
BeatRNN  n=2, i=1, h=512, d=0.5  
TickRNN  n=2, i=522, h=512, d=0.5  
Linear Stack 3  i=512, o=1024, n=1, nonlinearity=ReLU 

Linear Stack 4  i=512, o=dict size, n=1, nonlinearity=ReLU  
Latent RNN  

n=2, i=256, h=512, d=0.5  
Generation RNN  n=2, i=1, h=1024, d=0.5  
Linear Stack  i=2048, o=256, n=1, nonlinearity=None 
Table showing configurations of both models. n: Number of Layers, i: Input Size, o: Output Size, h: Hidden Size, d: Dropout Probability, SELU: Scaled Exponential Linear Unit
[39], ReLU: Rectifier Linear Unit
3.4 Stochastic Training Scheme
We propose a novel stochastic training scheme for training the model. For each training batch, the number of measures to be inpainted and the number of measures in the past context
are randomly sampled from a uniform distribution. Thus, the number of measures in the future context becomes
, where is the total number of measures in each sequence of the training batch. Using these, the input sequences are split into past, future and target sequences and the model is trained to predict the target sequence given the past and future context sequences. This stochastic training scheme ensures that the model learns to deal with variable length contexts and can perform inpaintings at arbitrary locations.3.5 Data Encoding Scheme
We use a variant of the encoding scheme proposed by Hadjeres et al. [24] for our data representation. The original encoding scheme quantizes time uniformly using the sixteenth note as the smallest subdivision. For each subdivision or tick, the note which starts on that tick is represented by a token corresponding to the note name. If no note starts on a tick, a special continuation symbol ‘__’ is used to denote that the previous note is held. Rest is considered as a note and has a special token. The main advantages of this encoding scheme are (i) it uses only a single sequence of tokens, and (ii) uses real note names (e.g., separate tokens for A# and Bb) which allows generation of readable sheet music.
However, a limitation of using the sixteenth note as the smallest subdivision is that it cannot encode triplets. The naive approach of evenly subdividing the sixteenth note divisions to encode triplets increases the sequence length a factor of which can make the sequence modeling task harder. To mitigate this limitation, we propose a novel uneven subdivision scheme. Each beat is divided into uneven ticks (shown in fig:data_rep). This allows encoding triplets while only increasing the sequence length by a factor of . Consequently, each time signature measure is a sequence of tokens.
4 Experiments
The proposed method is compared with two baseline methods (see baselines) using a dataset of monophonic folk melodies in the Scottish and Irish style taken from the Session website [5]. For the purposes of this work, only melodies with time signature in which the shortest note is greater than or equal to the sixteenth note are considered resulting in approx. melodies. Implementation details and source code are available online.^{1}^{1}1https://github.com/ashispati/InpaintNet
4.1 Baseline
The performance of the proposed method is compared with the AnticipationRNN model proposed by Hadjeres et al. [10]. This model, referred to as BaseARNN, uses a stack of LSTMbased [40] RNN layers. Each of the RNNs comprises of layers with a hidden size of . In addition to the notesequence tokens, this model also uses additional metadata information, i.e., tokens to indicate beat and downbeat locations as part of the userdefined constraints. For more details, the readers are directed to [10].
The original model operates on ticklevel sequences and inpainting locations are specified in terms of individual tick locations. Hence, the inpainting locations may or may not be contiguous. In order to make a fair comparison, a second variant of the AnticipationRNN model is considered, referred to as RegARNN, where the stochastic training scheme from train_meth is used instead.
4.2 Training Configuration
The MeasureVAE model was pretrained using single measures following the standard VAE optimization equation [26] with the weighting scheme [41, 42]. In order to prioritize high reconstruction accuracy, a low value of e was used. Pretraining was done for epochs resulting in a reconstruction accuracy of approx. . While this seems to be better than results in [3], we attribute this to the shorter duration of generation (single measures) and the differences in datasets and data encoding. MeasureVAE parameters were frozen after pretraining and no gradientbased updates were performed on these parameters during the InpaintNet model training.
The Adam algorithm [43] was used for model training, with a learning rate of e, , , and e. To ensure consistency, all models were trained for epochs (with earlystopping) with the same batchsize using a subsequence length of measures ( ticks). For the InpaintNet and RegARNN models, the number of measures to be inpainted and the number of past measures were randomly selected: , . This ensured that past and future contexts each contain at least measure. For the baseline models, teacherforcing was used with a probability of .
4.3 Predictions on Test Data
Two experiments were conducted to evaluate the predictive power of the models.
The first experiment considered the average tokenwise negative loglikelihood (NLL) on a heldout test set. The results (see first 3 rows of tab:nll) indicate that our proposed model outperforms both baselines, showing an improvement of approx. in the NLL over the RegARNN model and approx. over the BaseARNN model.
Model Variant  Test NLL 

BaseARNN  0.662 
RegARNN  0.402 
InpaintNet (Our Method)  0.300 
PastInpaintNet  0.643 
FutureInpaintNet  0.481 
The next experiment compared the models by varying the number of measures to be inpainted. fig:n_bar shows the average tokenwise NLL when was increased from to . Again, our proposed model outperforms both baselines. It should be noted that since the subsequence length is constant at measures, increasing means that the available context is reduced. Thus, there is an expected drop in the performance with increasing as the models are forced to make longer predictions with less contextual information. However, the InpaintNet model performs better even when forced to predict beyond the training limit of measures.
4.4 Ablations Studies
In order to further ascertain the efficiency of the proposed approach, ablation studies were conducted to evaluate the benefit of adding past and future context information. Specifically, we trained two variants of the InpaintNet model which relied on only one type of contextual information. The first model, referred to as PastInpaintNet only considered the past context as input whereas the second model, referred to as FutureInpaintNet considered only the future context . The last two rows of tab:nll summarize the performance of these ablation models. It is clear that both past and future contexts are important for the modeling process. In addition, we also tried training a variant of the InpaintNet model with an untrained (randomly initialized) MeasureVAE model. This model failed to train properly achieving an NLL of approx. . This indicates that a structured latent space where latent vectors are trained to encode hidden data attributes is important for training the LatentRNN model.
4.5 Qualitative Analysis
Considering that we are primarily interested in the aesthetic quality of the inpaintings, we encourage the readers to browse through the inpainting examples provided in the supplementary material.^{2}^{2}2https://ashispati.github.io/inpaintnet/ We consider some of those examples in the analysis below.
fig:inpaint_example shows sample inpaintings by the models for one of the melodies in the test set. While the BaseARNN model collapses to produce long half notes which do not effectively reflect the surrounding context, the other two models do better. Both the RegARNN and InpaintNet model generate rhythmically consistent inpaintings. The InpaintNet, in particular, mimics the rhythmic properties of the context better. For instance, measures and of the inpainted measures match the rhythm of measures , , and . Also, measure matches measure . However, the use of G (subdominant scale degree in Dmajor) in the halfnote to end measure is unusual. We observed that in other examples also, the InpaintNet model occasionally produces pitches which are anomalous — either outofkey or not fitting in the context. The RegARNN model, on the other hand, tends to stay in key. Additional examples are provided in the supplementary material.
One advantage of working with the latent space is that the sampling operation, inherent in the VAE inference process, ensures that for the same context we can get different inpainting results. fig:inpaint_ex2 shows three such generations for the context of fig:inpaint_example. It is interesting to note that the base rhythm is retained across all three inpaintings. This feature is particularly interesting from an interactive music generation perspective, as this model can be used to quickly provide users with multiple ideas and will be investigated further in future work.
4.6 Subjective Listening Study
To evaluate the perceived quality of the inpainted measures, a listening test was conducted to compare our proposed model against the two baselines. A set of melodies from the heldout test set were randomly selected and their first measures were extracted. The models were then used to inpaint measures (measure number to ) in these melodic excerpts. Participants were presented with pairs of melodic excerpts and asked to select the one in which they thought the inpainted measures fit better within the surrounding context. In some of the pairs, one melodic excerpt was the real data (without any inpainting). Each participant was presented with such pairs. A total of individuals participated in the study ( comparisons). The location of the inpainted measures was kept consistent across all examples so as to prevent confusion among participants and allow them to focus better on the inpainted measures.
The BradleyTerry model [44, 45]
for paired comparisons was used to get an estimate of how the proposed model performs against the baselines and the real data (see fig:bt). While the proposed model expectedly has a very low probability of winning against the real data (wins approx.
out of times), it performs only at par with the baseline models (with probability approx. ). Significance tests using the Wilcoxon signed rank test were further conducted which validated that differences between the proposed model and the baselines were not statistically significant (value ). This was unexpected since the proposed model showed significant improvement over the baselines in the NLL metric. Further dividing the study population into two groups differing in musical proficiency (based on the Ollen index [46]) showed that, comparatively, the group with greater musical proficiency favored the generations from the InpaintNet model more than the group with less musical proficiency.Additional analysis revealed that cases where the InpaintNet model performed the worst (maximum losses against the baselines), had anomalies in the predicted pitch similar to those discussed in sec:qual_analysis. Specifically, they either had a single outofkey note (e.g., F note in GMajor scale) or used a pitch or interval not used in the provided contexts. We conjecture that it is these anomalous pitch predictions which lead to poor perceptual ratings in spite of the model performing better in terms of modeling rhythmic features. This will be analyzed further in future studies.
5 Conclusion
This paper investigates the problem of musical score inpainting and proposes a novel approach to generate multiple measures of music to connect two musical excerpts by using a conditional RNN which learns to traverse the latent space of a VAE. We also improve upon the data encoding and introduce a stochastic training process which facilitate model training and improve generalization. The proposed model shows good performance across different objective and subjective evaluation experiments. The architecture also enables multiple generations with the same contexts, thereby, making it suitable for interactive applications [47]. We think the idea of learning to traverse latent spaces could be useful for other music generation tasks also. For instance, the architecture of the LatentRNN model can be changed to add contextual information from other voices/instruments to perform multiinstrument music generation. Future work will include a more thorough investigation of the anomalies in pitch prediction. A possible way to address that would be to add the context embedding as input at each step of unrolling the LatentRNN or use additional regularizers. Another promising avenue for future work is substituting RNNs with attentionbased models [48] which have had success in sequential music generation tasks [9].
References
 [1] Rebecca Fiebrink, Baptiste Caramiaux, R Dean, and A McLean. The machine learning algorithm as creative musical tool. Oxford University Press, 2016.
 [2] JeanPierre Briot and François Pachet. Deep learning for music generation: Challenges and directions. Neural Computing and Applications, Oct 2018.
 [3] Adam Roberts, Jesse Engel, Colin Raffel, Curtis Hawthorne, and Douglas Eck. A hierarchical hatent vector model for learning longterm structure in music. In Proc. of the 35th International Conference on Machine Learning (ICML), pages 4364–4373, Stockholmsmässan, Stockholm Sweden, 2018.
 [4] Florian Colombo, Samuel P. Muscinelli, Alexander Seeholzer, Johanni Brea, and Wulfram Gerstner. Algorithmic composition of melodies with deep recurrent neural networks. In Proc. of the 1st Conference on Computer Simulation of Musical Creativity (CSMC), 2016.
 [5] Bob L Sturm, Joao Felipe Santos, Oded BenTal, and Iryna Korshunova. Music transcription modelling and composition using deep learning. In Proc. of the 1st Conference on Computer Simulation of Musical Creativity (CSMC), Huddersfield, UK, 2016.
 [6] LiChia Yang, SzuYu Chou, and YiHsuan Yang. MidiNet: A convolutional generative adversarial network for symbolicdomain music generation. In Proc. of International Society of Music Information Retrieval Conference (ISMIR), pages 324–331, Suzhou, China, 2017.
 [7] Nicolas BoulangerLewandowski, Yoshua Bengio, and Pascal Vincent. Modeling temporal dependencies in highdimensional sequences: Application to polyphonic music generation and transcription. In Proc. of 29th International Conference on Machine Learning (ICML), Edinburgh, Scotland, 2012.
 [8] Sageev Oore, Ian Simon, Sander Dieleman, Douglas Eck, and Karen Simonyan. This time with feeling: Learning expressive musical performance. Neural Computing and Applications, pages 1–13, 2018.
 [9] ChengZhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Ian Simon, Curtis Hawthorne, Noam Shazeer, Andrew M Dai, Matthew D Hoffman, Monica Dinculescu, and Douglas Eck. Music transformer. In Proc. of International Conference of Learning Representations (ICLR), New Orleans, USA, 2019.
 [10] Gaëtan Hadjeres and Frank Nielsen. AnticipationRNN: Enforcing unary constraints in sequence generation, with application to interactive music generation. Neural Computing and Applications, Nov 2018.
 [11] JeanPierre Briot, Gaëtan Hadjeres, and François Pachet. Deep learning techniques for music generationA survey. arXiv preprint arXiv:1709.01620, 2017.
 [12] Marcelo Bertalmio, Guillermo Sapiro, Vincent Caselles, and Coloma Ballester. Image inpainting. In Proc. of the 27th Annual Conference on Computer Graphics and Interactive Techniques, pages 417–424. ACM Press/AddisonWesley Publishing Co., 2000.
 [13] Amir Adler, Valentin Emiya, Maria G Jafari, Michael Elad, Rémi Gribonval, and Mark D. Plumbley. Audio inpainting. IEEE Transactions on Audio, Speech, and Language Processing, 20(3):922–932, 2012.
 [14] Çağdaş Bilen, Alexey Ozerov, and Patrick Pérez. Audio declipping via nonnegative matrix factorization. In Proc. of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pages 1–5. IEEE, 2015.
 [15] Christopher Laguna and Alexander Lerch. An efficient algorithm for clipping detection and declipping audio. In Proc. of the 141st AES Convention, Los Angeles, USA, 2016.
 [16] Nathanael Perraudin, Nicki Holighaus, Piotr Majdak, and Peter Balazs. Inpainting of long audio segments with similarity graphs. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(6):1083–1094, 2018.
 [17] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems (NeurIPS), pages 3111–3119, 2013.

[18]
Shan Carter and Michael Nielsen.
Using artificial intelligence to augment human intelligence.
Distill, 2017. https://distill.pub/2017/aia.  [19] Adam Roberts, Jesse Engel, Sageev Oore, and Douglas Eck. Learning latent representations of music to generate interactive musical palettes. In Proc. of IUI Workshops, 2018.

[20]
Gaëtan Hadjeres, Frank Nielsen, and François Pachet.
GLSRVAE: Geodesic latent space regularization for variational autoencoder architectures.
In Proc. of IEEE Symp. Series on Computational Intelligence (SSCI), pages 1–7. IEEE, 2017.  [21] Çağdaş Bilen, Alexey Ozerov, and Patrick Pérez. Joint audio inpainting and source separation. In Proc. of International Conference on Latent Variable Analysis and Signal Separation, pages 251–258. Springer, 2015.
 [22] Ichrak Toumi and Valentin Emiya. Sparse nonlocal similarity modeling for audio inpainting. In Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 576–580. IEEE, 2018.
 [23] Jason Sakellariou, Francesca Tria, Vittorio Loreto, and François Pachet. Maximum entropy model for melodic patterns. In Proc. of the ICML Workshop on Constructive Machine Learning, Lille, France, 2015.
 [24] Gaëtan Hadjeres, François Pachet, and Frank Nielsen. DeepBach: A steerable model for Bach chorales generation. In Proc. of the 34th International Conference on Machine Learning (ICML), volume 70, pages 1362–1371, Sydney, Australia, 2017.

[25]
Stefan Lattner, Maarten Grachten, and Gerhard Widmer.
Imposing higherlevel structure in polyphonic music generation using convolutional restricted boltzmann machines and constraints.
Journal of Creative Music Systems, 2, March 2018.  [26] Diederik P Kingma and Max Welling. Autoencoding variational bayes. In Proc. of International Conference of Learning Representations (ICLR), Banff, Canada, 2014.

[27]
Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and PierreAntoine Manzagol.
Extracting and composing robust features with denoising autoencoders.
In Proc. of the 25th International Conference on Machine learning (ICML, pages 1096–1103, Helsinki, Finland, 2008.  [28] David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational Inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017.
 [29] Solomon Kullback and Richard A Leibler. On information and sufficiency. The annals of mathematical statistics, 22(1):79–86, 1951.
 [30] Mason Bretan, Gil Weinberg, and Larry Heck. A unit selection methodology for music generation using deep neural networks. In Proc. of the 8th International Conference on Computational Creativity (ICCC), Atlanta, USA, 2016.
 [31] Stefan Lattner, Maarten Grachten, and Gerhard Widmer. A predictive model for music based on learned interval representations. In Proc. of International Society of Music Information Retrieval Conference (ISMIR), pages 26–33, Paris, France, 2018.
 [32] Andreas Arzt and Stefan Lattner. Audiotoscore alignment using transpositioninvariant features. In Proc. of International Society of Music Information Retrieval Conference (ISMIR), pages 592–599, Paris, France, 2018.
 [33] Jesse Engel, Matthew Hoffman, and Adam Roberts. Latent constraints: Learning to generate conditionally from unconditional generative models. In Proc. of International Conference on Learning Representations (ICLR), Toulon, France, 2017.
 [34] Ashis Pati and Alexander Lerch. Latent space regularization for explicit control of musical attributes. In ICML Machine Learning for Music Discovery Workshop (ML4MD), Extended Abstract, Long Beach, CA, USA, 2019.
 [35] Ian Simon, Adam Roberts, Colin Raffel, Jesse Engel, Curtis Hawthorne, and Douglas Eck. Learning a latent space of multitrack measures. In Proc. of the 2nd Workshop on Machine Learning for Creativity and Design, Montréal, Québec, 2018.
 [36] Gino Brunner, Andres Konrad, Yuyi Wang, and Roger Wattenhofer. MIDIVAE: Modeling dynamics and instrumentation of music with applications to style transfer. In Proc. of International Society of Music Information Retrieval Conference (ISMIR), pages 747–754, Paris, France, 2018.
 [37] Mike Schuster and Kuldip K Paliwal. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11):2673–2681, 1997.
 [38] Rafal Jozefowicz, Wojciech Zaremba, and Ilya Sutskever. An empirical exploration of recurrent network architectures. In Proc. of 32nd International Conference on Machine Learning (ICML), pages 2342–2350, Lille, France, 2015.
 [39] Günter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. Selfnormalizing neural networks. In Advances in Neural Information Processing Systems (NeurIPS), pages 971–980, 2017.
 [40] Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 [41] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. BetaVae: Learning basic visual concepts with a constrained variational framework. In Proc. of International Conference on Learning Representations (ICLR), Toulon, France, 2017.
 [42] Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio. Generating sentences from a continuous space. In Proc. of the 20th Conference on Computational Language Processing, 2015.
 [43] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proc. of International Conference on Learning Representations (ICLR), San Diego, USA, 2015.
 [44] Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. The method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
 [45] David R Hunter et al. MM algorithms for generalized BradleyTerry models. The annals of statistics, 32(1):384–406, 2004.
 [46] Joy E Ollen. A criterionrelated validity test of selected indicators of musical sophistication using expert ratings. PhD thesis, The Ohio State University, 2006.
 [47] Théis Bazin, Ashis Pati, and Gaëtan Hadjeres. A modelagnostic web interface for interactive music composition by inpainting. Neural Information Processing Systems (NeurIPS), 2018. Demonstration Track.
 [48] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS), pages 5998–6008, 2017.