Semi-Recurrent CNN-based VAE-GAN for Sequential Data Generation

06/01/2018 ∙ by Mohammad Akbari, et al. ∙ 0

A semi-recurrent hybrid VAE-GAN model for generating sequential data is introduced. In order to consider the spatial correlation of the data in each frame of the generated sequence, CNNs are utilized in the encoder, generator, and discriminator. The subsequent frames are sampled from the latent distributions obtained by encoding the previous frames. As a result, the dependencies between the frames are maintained. Two testing frameworks for synthesizing a sequence with any number of frames are also proposed. The promising experimental results on piano music generation indicates the potential of the proposed framework in modeling other sequential data such as video.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

One important problem in unsupervised learning is generating sequential data such as music. Recurrent Neural Networks (RNNs) and Long Short Term Memory Networks (LSTMs) have shown considerable performance in this area. However, they have difficulties in handling the vanishing and the exploding gradient problems

[1]. In order to deal with these issues, RNNs have been combined with the most recent deep generative architectures such as Variational Auto-encoders (VAEs) and Generative Adversarial Networks (GANs) [2, 3, 4, 5, 6, 7], which are typically used for learning complex structures of data.

VAEs are generally easy to train, but the generated results have low quality due to imperfect measures such as the squared error. On the other hand, GANs generate samples with higher quality, but they suffer from training instability. In order to improve the training process and the quality of the generated samples, some researchers suggested hybrid VAE-GAN models [8, 9].

Although most of the sequential data generation methods are based on RNNs, some recent works have shown that Convolutional Neural Networks (CNNs) are also capable of generating realistic sequential data such as music [10, 11]. One advantage of CNNs is that they are practically faster to train and easier to parallelize than RNNs. In addition, applying convolutions to the time dimension can result in significant performance in some applications [12].

(a) Training framework.
(b) Two testing frameworks.
Figure 1: The training and testing frameworks of the proposed semi-recurrent hybrid VAE-GAN model (: encoder, : generator, and : discriminator).

Considering the sequential data generation as a problem of generating a sequence of discrete frames, two problems need to be addressed: strong spatial correlation of the data in each of the frames, and the dependencies between them (temporal correlation). In this work, we propose a semi-recurrent convolution-based VAE-GAN for generating a sequence of individual frames where the above problems are efficiently addressed. In order to maintain strong local correlation of the data in each frame generated, we use CNN, which is a very effective architecture for this matter. Moreover, each frame is generated from the latent distribution of the previous frame encoded by an encoder. As a result, the dependencies across the frames are also kept.

Figure 1 illustrates the overall training and testing frameworks proposed in this work. The model includes an encoder, a generator (decoder), and a discriminator. To the best of our knowledge, this is the first hybrid VAE-GAN framework introduced for generating sequential data. The feasibility of this model is evaluated on piano music generation, which shows that the proposed framework is a viable way of training networks that model music, and has potential for modelling many other types of sequential data such as videos.

2 Preliminaries and Related Works

In recent years, deep generative models have achieved significant success, especially in generating natural images [13, 14, 15, 16, 17]. In these models, complex structures of the data can be learned using deep architectures with multiple layers. VAEs [13, 15] and GANs [14, 16, 17] are two powerful frameworks for learning deep generative models in an unsupervised manner.

2.1 Variational Auto-encoder (VAE)

A VAE consists of an encoder and a decoder [13]. The encoder, denoted by , encodes a data sample

to a latent (hidden) representation

: . The decoder, denoted by

, decodes the latent representation back to the probability distribution of the data (in data space):


The VAE regularizes the encoder by imposing a prior over the latent distribution where

. The loss function of the VAE is the expected log likelihood with a regularizer:


where the first and second terms are the reconstruction loss and a prior regularization term that is the Kullback-Leibler (KL) divergence, respectively.

2.2 Generative Adversarial Network (GAN)

Another popular generative model is GAN in which two models are trained at the same time [14]. The generator model captures the data distribution by mapping the latent to data space, while the discriminator model estimates the probability that is a real training sample or a fake sample synthesized by

. These two models compete in a two-player minimax game in which the objective function is to find a binary classifier

that discriminates the real data from the fake (generated) ones, and simultaneously encourage to fit the true data distribution. This goal is achieved by minimizing/maximizing the binary cross entropy:


where tries to minimize this objective against that tries to maximize it.

Although GANs are powerful generative models, they suffer from training instability and low-quality generated samples. Different approaches have been proposed to improve GANs. For example, Wasserstein GAN (WGAN) [18] used Wasserstein distance as an objective for training GANs to improve the stability of learning, Laplacian GAN (LAPGANs) [19] achieved coarse-to-fine conditional generation through Laplacian pyramids, and Deep Convolutional GAN (DCGAN) [16] proposed an effective and stable architecture for and using deeper CNNs to achieve remarkable image synthesis results.

2.3 Sequential Data Generation: Music Generation

Different learning-based approaches for sequential data generation, especially music, have been introduced by various researchers. In [20], a RNN-based architecture using LSTMs was proposed in which a piano-roll sequence of notes and chords were generated using an iterative feed-forward strategy. In [21]

a Restricted Boltzmann Machine (RBM) was utilized for modeling and generating polyphonic music by learning a model from an audio corpus. DeepBach architecture

[22], which was specialized for Bach’s chorales, combined two LSTMs and two feed-forward networks (forward and backward in time networks).

VAE, as one of the effective approaches considered for content generation, has been used by some researchers in order to generate musical content. In [2]

, a VAE-based method named Variational Recurrent Auto-Encoder (VRAE) was proposed in which the encoder and decoder parts were LSTMs. Variational Recurrent Autoencoder Supported by History (VRASH)

[3] used the same architecture as in VRAE, but also used the output of the decoder back into the decoder. In [4], the objective function used in DeepBach was reformulated using VAE to have a better control on the embedding of the data into the latent space.

Although RNNs are more commonly used to model time-series signals, some non-RNN approaches have been introduced using CNNs [10, 23, 11]. A system for generating raw audio music waveforms named WaveNet was proposed in [10] in which an extended CNN called dilated causal convolution was incorporated. In this work, the authors argued that dilated convolutions allowed the receptive field to grow longer in a much cheaper way than using LSTMs. Another CNN-based architecture is convolutional RBM (C-RBM) [23], which was developed for the generation of MIDI polyphonic music. In this work, convolution was performed in the time domain to model temporally invariant motives.

Some works have exploited GANs for generating music [5, 11]. An example of the use of GAN is C-RNN-GAN [5] with both and being LSTMs in which the goal was to transform random noise into melodies. A bidirectional RNN was utilized in to take contexts from both past and future. In [11], a convolutional GAN architecture named MidiNet was proposed to generate pop music melodies from random noise (in piano-roll like format). In this approach, both and were composed of convolutional networks. Similar to what a recurrent network would do in considering the history, the information from previous musical measure was incorporated into intermediate layers.

3 Semi-Recurrent CNN-based VAE-GAN

In this section, the semi-recurrent hybrid VAE-GAN model proposed for generating temporal data such as music, is described. As illustrated in Figure 1, the model is composed of three units: the encoder (), the generator/decoder (), and the discriminator (). In this work, the VAE decoder and the GAN generator are collapsed into one by letting them share parameters and training them jointly.

The main architecture of the three networks used in this work is CNNs. Convolutions are rarely used in modelling signals with invariance in time such as music, but they have been very successful in the models whose data has strong spatially local correlation such as images, which is also important for sequential data. In this work, we consider the input time-dependent data as a sequence of individual frames, which have internal spatial correlation. Thus, we exploit CNNs for separate generation of each of these frames, while keeping the dependencies across them as follows.

For each pair of sequential frames, the previous frame is encoded to its corresponding latent representation using . Next, tries to generate (predict) the subsequent frame from the latent distribution of the previous frame. As a result, the history and the information from previous frames are incorporated for generating the next ones. The current real training frame in each pair and the synthesized frame are then forwarded to as real and fake data, respectively.

3.1 Formulation and Objective

Let be a sequence from the training data with frames, the network maps a training frame (previous time frame) to the mean and the covariance

of the latent vector:


Then, the latent vector can be sampled as follows:


where and is the element-wise multiplication. In order to reduce the gap between the prior and the encoder’s distribution and measure how much information is lost, KL loss is used:


The network then generates two frames and by decoding the latent representations (sampled using ) and

(sampled from a normal distribution) back to the data space, respectively:


Element-wise reconstruction errors are generally inadequate for signals with invariances [8]. As a result, in order to measure the quality of the reconstructed samples in this work, the following pair-wise feature matching loss between the real data and the synthesized data and is utilized:


where denotes the features (hidden representation) of an intermediate layer of the network . Thus, the loss of network is calculated:


In order to distinguish the real training data from the synthesized frames and , the following objection function is minimized by :


while tries to fool by minimizing


where is the pair-wise feature matching loss (Equation 7), which is a shared error signal between and .

Finally, our goal is to minimize the following hybrid loss function: .

4 Experiments: Piano Music Generation

We applied the proposed approach to piano music generation. The source code and some generated samples are shared on GitHub111 In this experiment, we used the Nottingham dataset 222 lisa/deep/data as our training data, which contains 695 pieces of folk piano music in MIDI file format. Each MIDI file was divided into separate bars, and a bar is represented by a real-valued 2-D matrix where and represent the number of MIDI notes/pitches (i.e., in this work) and the number of time steps (i.e., with pitch sampling of 0.125sec), respectively. The value of each element of the matrix is the velocity (volume) of a note at a certain time step. The sequence of bars is denoted by where and are two sequential bars.

The details of the networks , , and are summarized in Table 1. The output layer of is a fully-connected layer with 256 hidden units where its first and second 128 units are respectively considered as the mean and covariance used for representing the latent of dimension 128 (Equations 3 and 4). The latent and a normal distribution (of dimension 128) are projected to to output the synthesized bars . Before the Tanh layer of , another convolution is applied to map to the number of output channels (that is 1 in this work). An extra convolution is also applied before the Sigmoid layer of to represent the output by a 1-D feature map, which is used as for calculating the pair-wise feature matching loss (Equation 7). This network takes the 2-D matrices and as inputs and predicts whether they are real or generated MIDI bars.

Layers (filters) Size AF In Out
conv (8, 16, 32),
Fully-connected layer
ELU {}
deconv (64, 32, 16, 8),
Tanh layer
conv (8, 16, 32, 64),
Sigmoid layer
LeakyReLU 0 or 1
Table 1: The network architecture of the encoder (E), generator (G), and discriminator (D). AF, In, and Out

are respectively the activation functions used after each conv/deconv layer, the input, and the output of each network.

All models were trained with mini-batch stochastic gradient descent (SGD) with a mini-batch size of 64. The Adam optimizer with momentum of 0.5 and learning rate of 0.0005 for

and , and 0.0001 for was used. In order to keep the losses corresponding to , , and balanced in each iteration, we trained and twice and once.

Two models illustrated in Figure 1 were proposed to sequentially generate music with an arbitrary number of bars. In model 1 (top model in Figure 1), the input to , denoted by , is a bar randomly selected from training data samples, which is considered as the first bar of the generated music. is then mapped to the latent using . synthesizes the next bar by decoding back to the data space. By feeding the generated bar to , this process is repeatedly performed to generate a sequence of bars. In model 2 (bottom model in Figure 1), the same recurrent process is applied, but the first bar is also a bar synthesized using from a random noise . Two 5-bar sample music generated using model 1 (top model in Figure 1) are illustrated in Figure 2.

Figure 2: Two 5-bar sample music generated using the proposed testing model 1 (top model in Figure 1).

4.1 Results

In order to evaluate the music samples generated using our approach, the following measurements were taken into account [5]: scale consistency (the percentage for the best matching musical scale that a sample is part of), uniqueness (the percentage of unique tones used in a sample), velocity span (the velocity range in which the tones are played), recurrence (repetitions of short subsequences of length 2 in a sample), tone span (the number of half-tone steps between the lowest and the highest tones in a sample), and diversity (the average pairwise Levenshtein edit distance [24] of the generated data ). Figure 3 shows the results of evaluating generated pieces of music of length 10 seconds (i.e., 5 two-second bars).

Figure 3: Measurements used for evaluating

music samples generated at 300 epochs: scale consistency, intensity span, uniqueness, tone span, and recurrence.

As seen in Figure 3, the scale consistency (with an average of ) shows that the generated music significantly follows the standard scales in all samples, which outperforms C-RNN-GAN [5] with an average of . A variety of velocities exist in the music generated, which is illustrated by the oscillating velocity span. The average percentage of the unique tones used in the generated piece is . Compared to the velocity span, less variability is seen in the tone span (with minimum and maximum of 10 and 21) of the generated music due to the low tone span in the training samples (the majority of the music in the dataset is played in 1 or 2 octaves). The number of 2-tone repetitions is in average. Diversity is another metric we took into account to evaluate how realistic the generated music sounds. Compared to ORGAN [6] with an average of 0.551, a higher diversity with an average of was achieved in this work.

5 Conclusion

A semi-recurrent VAE-GAN model for generating sequential data was presented in this work. The model consisted of three networks (encoder, generator, and discriminator) in which convolutions were utilized to spatially learn the local correlation of the data in individual frames. Each frame was sampled from a latent distribution obtained by mapping the previous frame using the encoder. As a consequence, the consistencies between the frames in a generated sequence was also preserved. Our experiments on piano music generation presented promising results, which were comparable to the state-of-the-art. One potential direction of this work is to use this framework for modelling and generating other types of sequential data such as video.

6 Acknowledgement

This work was supported by the Natural Sciences and Engineering Research Council (NSERC) of Canada under grant RGPIN312262 and RGPAS478109.