Music FaderNets: Controllable Music Generation Based On High-Level Features via Low-Level Feature Modelling

High-level musical qualities (such as emotion) are often abstract, subjective, and hard to quantify. Given these difficulties, it is not easy to learn good feature representations with supervised learning techniques, either because of the insufficiency of labels, or the subjectiveness (and hence large variance) in human-annotated labels. In this paper, we present a framework that can learn high-level feature representations with a limited amount of data, by first modelling their corresponding quantifiable low-level attributes. We refer to our proposed framework as Music FaderNets, which is inspired by the fact that low-level attributes can be continuously manipulated by separate "sliding faders" through feature disentanglement and latent regularization techniques. High-level features are then inferred from the low-level representations through semi-supervised clustering using Gaussian Mixture Variational Autoencoders (GM-VAEs). Using arousal as an example of a high-level feature, we show that the "faders" of our model are disentangled and change linearly w.r.t. the modelled low-level attributes of the generated output music. Furthermore, we demonstrate that the model successfully learns the intrinsic relationship between arousal and its corresponding low-level attributes (rhythm and note density), with only 1 learnt high-level feature representations, we explore the application of our framework in style transfer tasks across different arousal states. The effectiveness of this approach is verified through a subjective listening test.



page 1

page 2

page 3

page 4


New feature for Complex Network based on Ant Colony Optimization for High Level Classification

Low level classification extracts features from the elements, i.e. physi...

FIGARO: Generating Symbolic Music with Fine-Grained Artistic Control

Generating music with deep neural networks has been an area of active re...

A dataset and classification model for Malay, Hindi, Tamil and Chinese music

In this paper we present a new dataset, with musical excepts from the th...

Music Plagiarism Detection via Bipartite Graph Matching

Nowadays, with the prevalence of social media and music creation tools, ...

A data-driven approach to mid-level perceptual musical feature modeling

Musical features and descriptors could be coarsely divided into three le...

Deep Layered Learning in MIR

Deep learning has boosted the performance of many music information retr...

Scream Detection in Heavy Metal Music

Harsh vocal effects such as screams or growls are far more common in hea...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We consider low-level musical attributes as attributes that are relatively straightforward to quantify, extract and calculate from music, such as rhythm, pitch, harmony, etc. On the other hand, high-level musical attributes refer to semantic descriptors or qualities of music that are relatively abstract, such as emotion, style, genre, etc. Due to the nature of abstractness and subjectivity in these high-level musical qualities, obtaining labels for these qualities typically requires human annotation. However, training conditional models on top of these human-annotated labels using supervised learning might result in sub-par performance because firstly, obtaining such labels can be costly, hence the amount of labels collected might be insufficient to train a model that can generalize well [16]; Secondly, the annotated labels could have high variance among raters due to the subjectivity of these musical qualities [3, 13].

Instead of inferring high-level features directly from the music sample, we propose to use low-level features as a “bridge” between the music and the high level features. This is because the relationship between the sample and its low-level features can be learnt relatively easier, as the labels are easier to obtain. In addition, we learn the relationship between the low-level features and the high-level features in a data-driven manner. In this paper, we show that the latter works well even with a limited amount of labelled data. Our work relies heavily on the concept that each high-level feature is intrinsically related to a set of low-level attributes. By tweaking the levels of each low-level attribute in a constrained manner, we can achieve a desired change on the high-level feature. This idea is heavily exploited in rule-based systems

[4, 32, 11], however rule-based systems are often not robust enough as their capabilities are constrained by the fixed set of predefined rules handcrafted by the authors. Hence, we propose an alternative path which is to learn

these implicit relationships with semi-supervised learning techniques.

To achieve the goals stated above, we intend to build a framework which can fulfill these two objectives:

  • Firstly, the model should be able to control multiple low-level attributes of the music sample in a continuous manner, as if it is controlled by sliding knobs on a console (or also known as faders). Each knob should be independent from the others, and only controls one single feature that it is assigned to.

  • Secondly, the model should be able to learn the relationship between the levels of the sliding knobs controlling the low-level features, and the selected high-level feature. This is analogous to learning a preset of the sliding knobs on a console.

We named our model “Music FaderNets”, with reference to musical “faders” and “presets” as described above. Achieving the first objective requires representation learning and feature disentanglement techniques. This motivates us to use latent variable models [29] as we can learn separate latent spaces for each low-level feature to obtain disentangled controllability. Achieving the second objective requires the latent space to have a hierarchical structure, such that high-level information can be inferred from low-level representations. This is achieved by incorporating Gaussian Mixture VAEs [27] in our model.

2 Related Work

2.1 Controllable Music Generation

The application of deep learning techniques for music generation has been rapidly advancing

[20, 5, 19, 35, 24], however, embedding control and interactivity in these systems still remains a critical challenge [5]. Variants of conditional generative models (such as CGAN [34] and CVAE [40]) are used to allow control during generation, which have attained much success mainly in the image domain. Fader Networks [31] is one of the main inspirations of this work (hence the name Music FaderNets), in which users can modify different visual features of an image using “sliding faders”. However, their approach is built upon a CVAE with an additional adversarial component, which is very different from our approach. Recently, controllable music generation has gained much research interest, both on modelling low-level [37, 17, 36, 12] and high-level features [10, 8]. Specifically, [17] and [36] each proposed a novel latent regularization method to encode attributes along specific latent dimensions, which inspired the "sliding knob" application in this work.

2.2 Disentangled Representation Learning for Music

Disentangled representation learning has been widely used across both the visual [7, 21, 28, 44] and speech domain [22, 16, 41] to learn disjoint subsets of attributes. Such techniques have also been applied to music in several recent works, both in the audio [33, 25, 26] and symbolic domain [6, 43, 2]. The discriminator component in our model draws inspiration from both the explicit conditioning component in the EC2-VAE model [43], and the extraction component in the Ext-Res model [2]. We find that most of the work on disentanglement in symbolic music focuses on low-level features, and is done on monophonic music.

This research distinguishes itself from other related work through the following novel contributions:

  • We combine latent regularization techniques with disentangled representation learning to build a framework that can control various continuous low-level musical attribute values using “faders”, and apply the framework on polyphonic music modelling.

  • We show that it is possible to infer high-level features from low-level latent feature representations, even under weakly supervised scenario. This opens up possibilities to learn good representations for abstract, high-level musical qualities even under data scarcity conditions. We further demonstrate that the learnt representations can be used for controllable generation based on high-level features.

3 Proposed Framework

3.1 Gaussian Mixture Variational Autoencoders

VAEs [30] combine the power of both latent variable models and deep generative models, hence they provide both representation learning and generation capabilities. Given observations X and latent variables z, the VAE learns a graphical model by maximizing the evidence lower bound (ELBO) of the marginal likelihood as below:

where and represent the learnt posterior and prior distribution respectively. In vanilla VAEs, is an isotropic, unimodal Gaussian. Gaussian Mixture VAEs (GM-VAE) [27] extend the prior to a mixture of Gaussian components, which corresponds to learning a graphical model with an extra hierarchy of dependency

. The newly introduced categorical variable

, whereby , is a discrete representation of the observations. Hence, a new distribution is introduced to infer the class of each observation, which enables semi-supervised and unsupervised clustering applications.

Following [27], the ELBO of a GM-VAE is derived as:

The original KL loss term from the vanilla VAE is modified into two new terms: (i) the KL divergence between the approximate posterior and the conditional prior , marginalized over all Gaussian components; (ii) the KL divergence between the cluster inferring distribution , and the categorical prior .

Figure 1: Music FaderNets model architecture.

3.2 Model Formulation

Figure 1 shows the model formulation of our proposed Music FaderNets. Input X is a sequence of performance tokens converted from MIDI following [35, 24]. Assume that we want to model a high-level feature with discrete states, which is related to a set of low-level features. We denote the latent variables learnt for each low-level feature as ; the labels for each low-level feature as ; and the class inferred from each latent variable as .

The joint probability of

is written as:

We assume that each categorical prior ,

is uniformly distributed, and the conditional distributions

are diagonal-covariance Gaussians with learnable means and constant variances. For each low-level attribute, we learn an approximate posterior , parameteriz

ed by an encoder neural network, that samples latent code

which represents the -th low-level feature.

The latent codes are then passed through the remaining three components: (1) Discriminator: To ensure that incorporates information of the assigned low-level feature, it is passed through a discriminator represented by a function to reconstruct the low-level feature label ; (2) Reconstruction: All latent codes are fed into a global decoder network which parameterizes the conditional probability to reconstruct the input X; (3) Cluster Inference: This component parameterizes the cluster inference probability , with representing the selected high-level feature. It can be approximated by [23], where the cluster state is predicted from each latent code instead of X.

To incorporate the “sliding knob” concept, we need to map the change of value of an arbitrary dimension on (denoted as , shown on Figure 1 as the darkened dimension) linearly to the change of value of the low-level feature label . After comparing across previous methods on conditioning and regularization [17, 40, 31, 36], we choose to adopt [36] which applies a latent regularization loss term written as , where and denotes the distance matrix of values and within a training batch respectively. We provide a detailed comparison study across each proposed method in Section 4.2. Hence, if we define:


then the entire training objective can be derived as:



is the KL weight hyperparameter

[21]. The first term in Eq. 2 represents the reconstruction loss. The second KL loss term (derived from the ELBO function of GM-VAE) correspond to the cluster inference component, which allows both supervised and unsupervised training setting, depending on the availability of label

. If we omit the cluster inference component, it could conform to a vanilla VAE by replacing this term with the KL loss term of VAE. The third term is the latent regularization loss applied during the encoding process. The last term is the reconstruction loss of the low-level feature labels, which corresponds to the discriminator component. All encoders and decoders are implemented with gated recurrent units (GRUs), and teacher-forcing is used to train all decoders.

4 Experimental Setup

In this work, we chose arousal (which refers to the energy level conveyed by the song [38]) as the high-level feature to be modelled. In order to select relevant low-level features, we refer to musicology papers such as [11, 14, 15], which suggest that arousal is related to features including rhythm density, note density, key, dynamic, tempo, etc. Among these low-level features, we focus on modelling the score-level features in this work (i.e. rhythm, note and key).

4.1 Data Representation and Hyperparameters

We use two polyphonic piano music datasets for training: the Yamaha Piano-e-Competition dataset [35], and the VGMIDI dataset[13], which contains piano arrangements of 95 video game soundtracks in MIDI, annotated with valence and arousal values in the range of -1 to 1. The arousal labels are used to guide the cluster inference component in our GM-VAE model using semi-supervised learning. We extract every 4-beat segment from each music sample, with a beat resolution of 4 (quarter-note granularity). Each segment is encoded into event-based tokens following [35] with a maximum sequence length of 100. This results in a total of 103,934 and 1,013 sequences from the Piano e-Competition and VGMIDI dataset respectively, which are split into train/validation/test sets with a ratio of 80/10/10.

Inspired by [43], we represent each rhythm label,

, as a sequence of 16 one-hot vectors with 3 dimensions, denoting an onset for any pitch, a holding state, or a rest. The

rhythm density value is calculated as the number of onsets in the sequence divided by the total sequence length. Each note label, , is represented by a sequence of 16 one-hot vectors with 16 dimensions, each dimension denoting the number of notes being played or held at that time step (we assume a minimum polyphony of 0 and a maximum of 15). The note density value is the average number of notes being played or held for per time step. For key, we use the key analysis tool from music21 [9]

to extract the estimated global key of each 4-beat segment. The key is represented using a 24-dimension one-hot vector, accounting for major and minor modes. In this work, we directly concatenate the key vector as a conditioning signal with

and as an input to the global decoder for reconstruction. For representing arousal, we split the arousal ratings into two clusters : high arousal cluster for positive labels, and low arousal cluster for negative labels. We remove labels annotated within the range [-0.1, 0.1] so as to reduce ambiguity in the annotations.

The hyperparameters are tuned according to the results on the validation set using grid search. The mean vectors of are all randomly initialized with Xavier initialization, whereas the variance vectors are kept fixed with value . We observe that the following annealing strategy for leads to the best balance between reconstruction and controllability: is set to 0 in the first 1,000 training steps, and is slowly annealed up to 0.2 in the next 10,000 training steps. We set the batch size to 128, all hidden sizes to 512, and the encoded z dimensions to 128. The Adam optimizer is used with a learning rate of .

4.2 Measuring the Controllability of Latent Features

Consistency Restrictiveness Linearity
Rhythm Density Note Density Rhythm Density Note Density Rhythm Density Note Density
Proposed (Vanilla VAE) 0.4367 0.0258 0.3490 0.0360 0.6645 0.0169 0.6481 0.0154 0.7805 0.0142 0.8255 0.0107
Proposed (GM-VAE) 0.5096 0.0248 0.4207 0.0309 0.6603 0.0164 0.6457 0.0132 0.7580 0.0124 0.7792 0.0177
Pati et al. [36] 0.4625 0.0264 0.5100 0.0150 0.6417 0.0171 0.5497 0.0206 0.7613 0.0171 0.8220 0.0143
CVAE [40] 0.2613 0.0376 0.4997 0.0355 0.6863 0.0221 0.7140 0.0130 0.4969 0.0166 0.3997 0.0411
Fader Networks [31] 0.2730 0.0366 0.4983 0.0425 0.6861 0.0163 0.7379 0.0149 0.5482 0.0283 0.4647 0.0292
GLSR [17] 0.1891 0.0346 0.1969 0.0831 0.6365 0.0276 0.7136 0.0185 0.2465 0.0197 0.1799 0.0209
Table 1: Experimental results (conducted on the Yamaha dataset test split) on the controllability of low-level features (rhythm density and note density) using disentangled latent variables. Bold marks the best performing model.

The proposed Music FaderNets model should meet two requirements: (i) Each “fader” independently controls one low-level musical feature without affecting other features (disentanglement), and (ii) the “faders” should change linearly with the controlled attribute of the generated output (linearity). For disentanglement, we follow the definition proposed in [39] which decomposes the concept of disentanglement into generator consistency and generator restrictiveness. Using rhythm density as an example:

  • Consistency on rhythm density means that for the same value of , the value of the output’s rhythm density should be consistent.

  • Restrictiveness on rhythm density means that changing the value of does not affect the attributes other than rhythm density (in our case, note density).

  • Linearity on rhythm density means that the value of rhythm density is directly proportional to the value of , which is analogous to a sliding fader.

We will be evaluating all three of these points in our experiment. For evaluating linearity, [36] proposed a slightly modified version of the interpretability metric by [1], which includes the following steps: (1) encode each sample in the test set, obtain the rhythm latent code and the dimension which has the maximum mutual information with regards to the attribute; (2) learn a linear regressor to predict the input attribute values based on . The linearity score is hence the coefficient of determination () score of the linear regressor. However, this method evaluates only the encoder and not the decoder. As we want the sliding knobs to directly impact the output, we argue that the relationship between and the output attributes should be more important. Hence, we propose to “slide” the values of the regularized dimension within a given range and decode them into reconstructed outputs. Then, instead of predicting the input attributes given the encoded , the linear regressor learns to predict the corresponding output attributes given the “slid” values of .

Figure 2:

Workflow of obtaining evaluation metrics for “faders” controlling rhythm density.

We demonstrate a single workflow to calculate the consistency, restrictiveness and linearity scores of a given model based on the low-level features (we use rhythm density as an example low-level feature for the discussion below), as depicted in Figure 2. After obtaining the rhythm density latent code for all samples in the training set and finding the minimum and maximum value of , we “slide” for steps by calculating . This results in a list of values denoted as . Then, we conduct the following steps:

  1. Randomly select samples from the test set, and encode each sample into and ;

  2. Alter the -th element in using the values in the range , to obtain for each sample ;

  3. Decode each new rhythm density latent code together with the unchanged note density latent code to get ;

  4. Calculate rhythm density and note density for each reconstructed output;

  5. Pair up the new rhythm density latent code with the resulting rhythm density of the output as training data points for a linear regressor.

The final evaluation scores are then calculated as follows:



denotes the standard deviation, and

denotes the linear regressor model. In other words, consistency calculates the average standard deviation across all output rhythm density values given the same , whereas restrictiveness calculates the average standard deviation across all output note density values given the changing . In a perfectly disentangled and linear model, the consistency, restrictiveness and linearity scores should be equal to 1, and higher scores indicate better performance.

5 Experiments and Results

We compare the evaluation scores of our proposed model, using both a vanilla VAE (omitting the cluster inference component) and GM-VAE, with several models proposed in related work on controllable synthesis: CVAE [40], Fader Networks [31], GLSR [17] and Pati et al. [36]. We repeat the above steps for 10 runs for each model and report the mean and standard deviation of each score. Table 1 shows the evaluation results. Overall, our proposed models achieve a good all-rounded performance on every metric as compared to other models, especially in terms of linearity, models that use [36]’s regularization method largely outperform other models. Our model shares similar results with [36], however as compared to their work, we encode a multi-dimensional, regularized latent space instead of a single dimension value for each low-level feature, thus allowing more flexibility. Our model can also be used for “generation via analogy” as mentioned in EC2-VAE [43], by mix-matching from one sample with from another. Moreover, the feature latent vectors can be used to infer interpretable and semantically meaningful clusters.

5.1 Inferring High-Level Features from Latent Low-Level Representations

Figure 3: Visualization of rhythm (top) and note (bottom) density latent space in the GM-VAE. Each column is colored in terms of: (left) original density values, (middle) regularized values, (right) arousal cluster labels (0 refers to low arousal and 1 refers to high arousal).

Figure 3 visualizes the rhythm and note density latent space learnt by GM-VAE using t-SNE dimensionality reduction. We observe that both spaces successfully learn a Gaussian-mixture space with two well-separated components, which correspond to high and low arousal clusters, even though it was trained with only around 1% of labelled data. We also find that the regularized values capture the overall trend of the actual rhythm and note density values. Interestingly, the model learns the implicit relationship between high/low arousal and the corresponding levels of rhythm/note density. From Figure 3, we observe that the high arousal cluster corresponds to higher rhythm density and lower note density, whereas the low arousal cluster corresponds to lower rhythm density and higher note density. This is reasonable as music segments with high arousal often consist of fast running notes and arpeggios, being played one note at a time, whereas music segments with low arousal often exhibit a chordal texture with more sustaining notes and relatively less melodic activity.

To further inspect the importance of using low-level features, we train a separate GM-VAE model with only one encoder (without discriminator component), which encodes only a single latent vector for each segment. The model is trained to infer the arousal label with the single latent vector similarly in a semi-supervised manner, and the hyperparameters are kept the same. From Figure 4, we can observe that the latent space learnt without using low-level features is not well-segregated into two separate components, suggesting that the right choice of low-level features helps the learning of a more discriminative and disentangled feature latent space.

The major advantage demonstrated from the results above is that by carefully choosing low-level features supported by domain knowledge, semi-supervised (or weakly supervised) training can be leveraged to learn interpretable representations that can capture implicit relationships between high-level and low-level features, overcoming the difficulties mentioned in the introduction section. This is an important insight for learning representations of abstract musical qualities under label scarcity conditions in future.

Figure 4: Arousal cluster visualization of GM-VAE with (left), and without (right) using low-level features.
Figure 5: Examples of arousal transfer on music samples.
Figure 6:

Subjective listening test results. Left: Heat map of annotated arousal level change against actual arousal level change. Right: Bar plot of opinion scores for each musical quality, with 95% confidence interval.

5.2 Style Transfer on High Level Features

Utilizing the learnt high-level feature representations enables the application of feature style transfer. Following [33], given the means of each Gaussian component, and , the “shifting vector” from high arousal to low arousal is , and vice versa. To shift a music segment from high to low arousal, we modify the latent codes by . Both new latent codes and are fed into the global decoder for reconstruction. For cases where , we choose to perform shifting only on the latent codes which are not lying within the target arousal cluster. Figure 5 shows several examples of arousal shift performed on given music segments. We can observe that the shift is clearly accompanied with the desired changes in rhythm density and note density, as mentioned in Section 5.1. More examples are available online.111 We also conducted a subjective listening test to evaluate the quality of arousal shift performed by Music FaderNets. We randomly chose 20 music segments from our dataset, and performed a low-to-high arousal shift on 10 segments and a high-to-low arousal shift on the other 10. Each subject listened to the original sample and then the transformed sample, and was asked whether (1) the arousal level changes after the transformation, and; (2) how well the transformed sample sounds in terms of rhythm, melody, harmony and naturalness, on a Likert scale of 1 to 5 each.

A total of 48 subjects participated in the survey. We found that 81.45% of the responses agreed with the actual direction of level change in arousal, shifted by the model. This showed that our model is capable of shifting the arousal level of a piece to a desired state. From the heat map shown in Figure 6, we observe that shifting from high to low arousal has a higher rate of agreement (92.5%) than shifting from low to high arousal (70.41%). Meanwhile, the mean opinion score of rhythm, melody, harmony and naturalness were reported at 3.53, 3.39, 3.41 and 3.33 respectively, showing that the quality of the generated samples are generally above moderate level.

6 Conclusion and Future Work

We propose a novel framework called Music FaderNets222Source code available at:, which can generate new variations of music samples by controlling levels (“sliding knobs”) of low-level attributes, trained with latent regularization and feature disentanglement techniques. We also show that the framework is capable of inferring high-level feature representations (“presets”, e.g. arousal) on top of latent low-level representations by utilizing the GM-VAE framework. Finally, we demonstrate the application of using learnt high-level feature representations to perform arousal transfer, which was confirmed in a user experiment. The key advantage of this framework is that it can learn interpretable mixture components that reveal the intrinsic relationship between low-level and high-level features using semi-supervised learning, so that abstract musical qualities can be quantified in a more concrete manner with limited amount of labels.

While the strength of arousal transfer is gradually increased, we find that the identity of the original piece is also gradually shifted. A recent work on text generation using VAEs

[42] observed this similar trait and attributed its cause to the “latent vacancy" problem by topological analysis. A possible solution is to adopt the Constrained-Posterior VAE [42], in which we aim to explore in future work. Future work will also focus on applying the framework on other sets of abstract musical qualities (such as valence [38], tension [18], etc.), and extending the framework to model multi-track music with longer duration to produce more complete music.

7 Acknowledgements

We would like to thank the anonymous reviewers for their constructive reviews. We also thank Yin-Jyun Luo for the insightful discussions on GM-VAEs. This work is supported by MOE Tier 2 grant no. MOE2018-T2-2-161 and SRG ISTD 2017 129. The subjective listening test is approved by the Institutional Review Board under SUTD-IRB 20-315. We would also like to thank the volunteers for taking the subjective listening test.


  • [1] T. Adel, Z. Ghahramani, and A. Weller (2018) Discovering interpretable representations for both deep generative and discriminative models. In

    International Conference on Machine Learning

    pp. 50–59. Cited by: §4.2.
  • [2] T. Akama (2019) Controlling symbolic music generation based on concept learning from domain knowledge. In Proc. of the International Society for Music Information Retrieval Conference, Cited by: §2.2.
  • [3] A. Aljanaki, Y. Yang, and M. Soleymani (2015) Emotion in music task at mediaeval 2015.. In MediaEval, Cited by: §1.
  • [4] R. Bresin and A. Friberg (2000) Emotional coloring of computer-controlled music performances. Computer Music Journal 24 (4), pp. 44–63. Cited by: §1.
  • [5] J. Briot, G. Hadjeres, and F. Pachet (2019) Deep learning techniques for music generation. Vol. 10, Springer. Cited by: §2.1.
  • [6] G. Brunner, A. Konrad, Y. Wang, and R. Wattenhofer (2018) MIDI-vae: modeling dynamics and instrumentation of music with applications to style transfer. In Proc. of the International Society for Music Information Retrieval Conference, Cited by: §2.2.
  • [7] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel (2016) InfoGAN: interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2172–2180. Cited by: §2.2.
  • [8] K. Choi, C. Hawthorne, I. Simon, M. Dinculescu, and J. Engel (2020) Encoding musical style with transformer autoencoders. In International Conference of Machine Learning, Cited by: §2.1.
  • [9] M. S. Cuthbert and C. Ariza (2010) Music21: a toolkit for computer-aided musicology and symbolic music data. In Proc. of the International Society for Music Information Retrieval Conference, Cited by: §4.1.
  • [10] S. Dai, Z. Zhang, and G. G. Xia (2018) Music style transfer: a position paper. In Proc. of International Workshop on Musical Metacreation, Cited by: §2.1.
  • [11] S. K. Ehrlich, K. R. Agres, C. Guan, and G. Cheng (2019) A closed-loop, music-based brain-computer interface for emotion mediation. PloS one 14 (3). Cited by: §1, §4.
  • [12] J. Engel, M. Hoffman, and A. Roberts (2017) Latent constraints: learning to generate conditionally from unconditional generative models. In International Conference of Learning Representations, Cited by: §2.1.
  • [13] L. N. Ferreira and J. Whitehead (2019) Learning to generate music with sentiment. In Proc. of the International Society for Music Information Retrieval Conference, Cited by: §1, §4.1.
  • [14] A. Gabrielsson and E. Lindström (2001) The influence of musical structure on emotional expression.. Cited by: §4.
  • [15] P. Gomez and B. Danuser (2007) Relationships between musical structure and psychophysiological measures of emotion.. Emotion 7 (2), pp. 377. Cited by: §4.
  • [16] R. Habib, S. Mariooryad, M. Shannon, E. Battenberg, R. Skerry-Ryan, D. Stanton, D. Kao, and T. Bagby (2020) Semi-supervised generative modeling for controllable speech synthesis. In International Conference of Learning Representations, Cited by: §1, §2.2.
  • [17] G. Hadjeres, F. Nielsen, and F. Pachet (2017) GLSR-vae: geodesic latent space regularization for variational autoencoder architectures. In IEEE Symposium Series on Computational Intelligence (SSCI), pp. 1–7. Cited by: §2.1, §3.2, Table 1, §5.
  • [18] D. Herremans and E. Chew (2017) MorpheuS: generating structured music with constrained patterns and tension. IEEE Transactions on Affective Computing. Cited by: §6.
  • [19] D. Herremans, C. Chuan, and E. Chew (2017) A functional taxonomy of music generation systems. ACM Computing Surveys (CSUR) 50 (5), pp. 1–30. Cited by: §2.1.
  • [20] D. Herremans and C. Chuan (2020) The emergence of deep learning: new opportunities for music and audio technologies. Neural Computing and Applications 32], pp. 913–914. Cited by: §2.1.
  • [21] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner (2017) Beta-vae: learning basic visual concepts with a constrained variational framework.. In International Conference of Learning Representations, Cited by: §2.2, §3.2.
  • [22] W. Hsu, Y. Zhang, and J. Glass (2017) Unsupervised learning of disentangled and interpretable representations from sequential data. In Advances in neural information processing systems, pp. 1878–1889. Cited by: §2.2.
  • [23] W. Hsu, Y. Zhang, R. J. Weiss, H. Zen, Y. Wu, Y. Wang, Y. Cao, Y. Jia, Z. Chen, J. Shen, et al. (2019) Hierarchical generative modeling for controllable speech synthesis. In International Conference of Learning Representations, Cited by: §3.2.
  • [24] C. A. Huang, A. Vaswani, J. Uszkoreit, N. Shazeer, I. Simon, C. Hawthorne, A. M. Dai, M. D. Hoffman, M. Dinculescu, and D. Eck (2019) Music transformer: generating music with long term structure. In International Conference of Learning Representations, Cited by: §2.1, §3.2.
  • [25] Y. Hung, Y. Chen, and Y. Yang (2018) Learning disentangled representations for timber and pitch in music audio. arXiv preprint arXiv:1811.03271. Cited by: §2.2.
  • [26] Y. Hung, I. Chiang, Y. Chen, Y. Yang, et al. (2019) Musical composition style transfer via disentangled timbre representations. In

    International Joint Conference of Artificial Intelligence

    Cited by: §2.2.
  • [27] Z. Jiang, Y. Zheng, H. Tan, B. Tang, and H. Zhou (2016) Variational deep embedding: an unsupervised and generative approach to clustering. arXiv preprint arXiv:1611.05148. Cited by: §1, §3.1, §3.1.
  • [28] H. Kim and A. Mnih (2018) Disentangling by factorising. In International Conference of Machine Learning, Cited by: §2.2.
  • [29] Y. Kim, S. Wiseman, and A. M. Rush (2018) A tutorial on deep latent variable models of natural language. arXiv preprint arXiv:1812.06834. Cited by: §1.
  • [30] D. P. Kingma and M. Welling (2014) Auto-encoding variational bayes. In International Conference of Learning Representations, ICLR, Cited by: §3.1.
  • [31] G. Lample, N. Zeghidour, N. Usunier, A. Bordes, L. Denoyer, and M. Ranzato (2017) Fader networks: manipulating images by sliding attributes. In Advances in Neural Information Processing Systems, pp. 5967–5976. Cited by: §2.1, §3.2, Table 1, §5.
  • [32] S. R. Livingstone, R. Muhlberger, A. R. Brown, and W. F. Thompson (2010) Changing musical emotion: a computational rule system for modifying score and performance. Computer Music Journal 34 (1), pp. 41–64. Cited by: §1.
  • [33] Y. Luo, K. Agres, and D. Herremans (2019) Learning disentangled representations of timbre and pitch for musical instrument sounds using gaussian mixture variational autoencoders. In Proc. of the International Society for Music Information Retrieval Conference, Cited by: §2.2, §5.2.
  • [34] M. Mirza and S. Osindero (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. Cited by: §2.1.
  • [35] S. Oore, I. Simon, S. Dieleman, D. Eck, and K. Simonyan (2018) This time with feeling: learning expressive musical performance. Neural Computing and Applications, pp. 1–13. Cited by: §2.1, §3.2, §4.1.
  • [36] A. Pati and A. Lerch (2019) Latent space regularization for explicit control of musical attributes. In ICML Machine Learning for Music Discovery Workshop (ML4MD), Extended Abstract, Long Beach, CA, USA, Cited by: §2.1, §3.2, §4.2, Table 1, §5.
  • [37] A. Roberts, J. Engel, C. Raffel, C. Hawthorne, and D. Eck (2018) A hierarchical latent vector model for learning long-term structure in music. In International Conference of Machine Learning, Cited by: §2.1.
  • [38] J. A. Russell (1980) A circumplex model of affect.. Journal of personality and social psychology 39 (6), pp. 1161. Cited by: §4, §6.
  • [39] R. Shu, Y. Chen, A. Kumar, S. Ermon, and B. Poole (2020) Weakly supervised disentanglement with guarantees. In International Conference of Learning Representations, Cited by: §4.2.
  • [40] K. Sohn, H. Lee, and X. Yan (2015) Learning structured output representation using deep conditional generative models. In Advances in Neural Information Processing Systems, pp. 3483–3491. Cited by: §2.1, §3.2, Table 1, §5.
  • [41] Y. Wang, D. Stanton, Y. Zhang, R. Skerry-Ryan, E. Battenberg, J. Shor, Y. Xiao, F. Ren, Y. Jia, and R. A. Saurous (2018) Style tokens: unsupervised style modeling, control and transfer in end-to-end speech synthesis. In International Conference of Machine Learning, Cited by: §2.2.
  • [42] P. Xu, J. C. K. Cheung, and Y. Cao (2020) On variational learning of controllable representations for text without supervision. In International Conference on Machine Learning, Cited by: §6.
  • [43] R. Yang, D. Wang, Z. Wang, T. Chen, J. Jiang, and G. Xia (2019) Deep music analogy via latent representation disentanglement. In Proc. of the International Society for Music Information Retrieval Conference, Cited by: §2.2, §4.1, §5.
  • [44] L. Yingzhen and S. Mandt (2018) Disentangled sequential autoencoder. In International Conference on Machine Learning, pp. 5670–5679. Cited by: §2.2.