Stochastic WaveNet: A Generative Latent Variable Model for Sequential Data

06/15/2018 ∙ by Guokun Lai, et al. ∙ 0

How to model distribution of sequential data, including but not limited to speech and human motions, is an important ongoing research problem. It has been demonstrated that model capacity can be significantly enhanced by introducing stochastic latent variables in the hidden states of recurrent neural networks. Simultaneously, WaveNet, equipped with dilated convolutions, achieves astonishing empirical performance in natural speech generation task. In this paper, we combine the ideas from both stochastic latent variables and dilated convolutions, and propose a new architecture to model sequential data, termed as Stochastic WaveNet, where stochastic latent variables are injected into the WaveNet structure. We argue that Stochastic WaveNet enjoys powerful distribution modeling capacity and the advantage of parallel training from dilated convolutions. In order to efficiently infer the posterior distribution of the latent variables, a novel inference network structure is designed based on the characteristics of WaveNet architecture. State-of-the-art performances on benchmark datasets are obtained by Stochastic WaveNet on natural speech modeling and high quality human handwriting samples can be generated as well.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Learning to capture complex distribution of sequential data is an important machine learning problem and has been extensively studied in recent years. The autoregressive neural network models, including Recurrent Neural Network (Hochreiter and Schmidhuber, 1997; Chung et al., 2014), PixelCNN (Oord et al., 2016) and WaveNet (Van Den Oord et al., 2016), have shown strong empirical performance in modeling natural language, images and human speeches. All these methods are aimed at learning a deterministic mapping from the data input to the output. Recently, evidence has been found (Fabius and van Amersfoort, 2014; Gan et al., 2015; Gu et al., 2015; Goyal et al., 2017; Shabanian et al., 2017) that probabilistic modeling with neural networks can benefit from uncertainty introduced to their hidden states, namely including stochastic latent variables in the network architecture. Without such uncertainty in the hidden states, RNN, PixelCNN and WaveNet would parameterize the randomness only in the final layer by shaping a output distribution from the specific distribution family. Hence the output distribution (which is often assumed to be Gaussian for continuous data) would be unimodal or the mixture of unimodals given the input data, which may be insufficient to capture the complex true data distribution and to describe the complex correlations among different output dimensions (Boulanger-Lewandowski et al., 2012). Even for the non-parametrized discrete output distribution modeled by the softmax function, a phenomenon referred to as softmax bottleneck (Yang et al., 2017a) still limits the family of output distributions. By injecting the stochastic latent variables into the hidden states and transforming their uncertainty to outputs by non-linear layers, the stochastic neural network is equipped with the ability to model the data with a much richer family of distributions.

Motivated by this, numerous variants of RNN-based stochastic neural network have been proposed. STORN (Bayer and Osendorfer, 2014) is the first to integrate stochastic latent variables into RNN’s hidden states. In VRNN (Chung et al., 2015), the prior of stochastic latent variables is assumed to be a function over historical data and stochastic latent variables, which allows them to capture temporal dependencies. SRNN (Fraccaro et al., 2016) and Z-forcing (Goyal et al., 2017) offer more powerful versions with augmented inference networks which better capture the correlation between the stochastic latent variables and the whole observed sequence. By introducing stochasticity to the hidden states, these RNN-based models achieved significant improvements over vanilla RNN models on log-likelihood evaluations on multiple benchmark datasets from various domains (Goyal et al., 2017; Shabanian et al., 2017).

In parallel with RNN, WaveNet (Van Den Oord et al., 2016) provides another powerful way of modeling sequential data with dilated convolutions, especially in the natural speech generation task. While RNN-based models must be trained in a sequential manner, training a WaveNet can be easily parallelized. Furthermore, the parallel WaveNet proposed in (Oord et al., 2017) is able to generate new sequences in parallel. WaveNet, or dilated convolutions, has also been adopted as the encoder or decoder in the VAE framework and produces reasonable results in the text (Semeniuta et al., 2017; Yang et al., 2017b) and music (Engel et al., 2017) generation task.

In light of the advantage of introducing stochastic latent variables to RNN-based models, it is natural to raise a problem whether this benefit carries to WaveNet-based models. To this end, in this paper we propose Stochastic WaveNet, which associates stochastic latent variables with every hidden states in the WaveNet architecture. Compared with the vanilla WaveNet, Stochastic WaveNet is able to capture a richer family of data distributions via the added stochastic latent variables. It also inherits the ease of parallel training with dilated convolutions from the WaveNet architecture. Because of the added stochastic latent variables, an inference network is also designed and trained jointly with Stochastic WaveNet to maximize the data log-likelihood. We believe that after model training, the multi-layer structure of latent variables leads them to reflect both hierarchical and sequential structures of the data. This hypothesis is validated empirically by controlling the number of layers of stochastic latent variables.

The rest of this paper is organized as follows: we briefly review the background in Section 2. The proposed model and optimization algorithm are introduced in Section 3. We evaluate and analyze the proposed model on multiple benchmark datasets in Section 4. Finally, the summary of this paper is included in Section 5.

2 Preliminary

2.1 Notation

We first define the mathematical symbols used in the rest of this paper. We denote a set of vectors by a bold symbol, such as

, which may utilize one or two dimension subscripts as index, such as or . represents the general function that transforms an input vector to a output vector. And is a neural network function parametrized by . For a sequential data sample , represents its length.

2.2 Autoregressive Neural Network

Autoregressive network model is designed to model the joint distribution of the high-dimensional data with sequential structure, by factorizing the joint distribution of a data sample as


where , indexes the temporal time stamps, and

represents the model parameters. Then the autoregressive model can compute the likelihood of a sample and generate a new data sample in a sequential manner.

In order to capture richer stochasticities of the sequential generation process, stochastic latent variables for each time stamp have been introduced, referred to as stochastic neural network (Chung et al., 2015; Fraccaro et al., 2016; Goyal et al., 2017). Then the joint distribution of the data together with the latent variables is factorized as,


where has the same sequence length as the data sample, is its dimension for one time stamp. is also generated sequentially, namely the prior of

is conditional probability given

and .

2.3 WaveNet

WaveNet (Van Den Oord et al., 2016) is a convolutional autoregressive neural network which adopts dilated causal convolutions (Yu and Koltun, 2015) to extract the sequential dependency in the data distribution. Different from recurrent neural network, dilated convolution layers can be computed in parallel during the training process, which makes WaveNet much faster than RNN in modeling sequential data. A typical WaveNet structure is visualized in Figure 1. Beside the computation advantage, WaveNet has shown the start-of-the-art result in speech generation task (Oord et al., 2017).

Figure 1: Visualization of a WaveNet structure from (Van Den Oord et al., 2016)

3 Stochastic WaveNet

In this section, we introduce a sequential generative model (Stochastic WaveNet), which imposes stochastic latent variables with the multi-layer dilated convolution structure. We firstly introduce the generation process of Stochastic WaveNet, and then describe the variational inference method.

3.1 Generative Model

Similar as stochastic recurrent neural networks, we inject the stochastic latent variable in each WaveNet hidden node in the generation process, which is illustrated in Figure 1(a). More specifically, for a sequential data sample with length , we introduce a set of stochastic latent variables , where is the number of the layers of WaveNet architecture. Then the generation process can be described as,


The generation process can be interpreted as this. At each time stamp , we sample the stochastic latent variables from a prior distribution which are conditioned on the lower level latent variables and historical records including the data samples and latent variables . Then we sample the new data sample according to all sampled latent variables and historical records. Through this process, new sequential data samples are generated in a recursive way.

In Stochastic WaveNet, the prior distribution

is defined as a Gaussian distribution with the diagonal covariance matrix. The sequential and hierarchical dependency among the latent variables are modeled by the WaveNet architecture. In order to summarize all historical information, we introduce two stochastic hidden variables

and , which are calculated as,


Where mimics the design of the dilated convolution in WaveNet, and is a fully connected layer to summarize the hidden states and the sampled latent variable. Different from the vanilla WaveNet, the hidden states are stochastic because of the random samples

. We parameterize the mean and variance of the prior distributions by the hidden representations

, which is and . Similarly, we parameterize the emission probability as a neural network function over the hidden representations.

3.2 Variational Inference for Stochastic WaveNet

Instead of directly maximizing log-likelihood for a sequential sample , we optimize its variational evidence lower bound (ELBO) (Jordan et al., 1999). Exact posterior inference of the stochastic variables of Stochastic WaveNet is intractable. Hence, we describe a variational inference method for Stochastic WaveNet by utilizing the reparameterization trick introduced in (Kingma and Welling, 2013). Firstly, we write the ELBO as,


We can derive the second equation by taking Eq. 3 into the first equation, and

denotes the loss function for the sample

. Here another problem needs to be addressed is how to define the posterior distribution . In order to maximize the ELBO, we factorize the posterior as,


Here the posterior distribution for is conditioned on the stochastic latent variables sampled before it and the entire observed data . By utilizing the future data , we can better maximize the first term in , the reconstruction loss term. In opposite, the prior distribution of is only conditioned on , so encoding information may increase the degree of distribution mismatch between the prior and posterior distribution, namely enlarging the KL term in loss function.

Exploring the dependency structure in WaveNet. However, by analyzing the dependency among the outputs and hidden states of WaveNet, we would find that the stochastic latent variables at time stamp t, would not influence whole posterior outputs. So the inference network would only require partial posterior outputs to maximize the reconstruction loss term in the loss function. Denote the set of outputs that would be influenced by as . The posterior distribution can be modified as,


The modified posterior distribution removes unnecessary conditional variables, which makes the optimization more efficient. To summarize the information from posterior outputs , we design a reversed WaveNet architecture to compute the hidden feature , illustrated in Figure 1(b), and is formulated as,


where we define that , and and is the dilated convolution layer, whose structure is a reverse version of in Eq.3. Finally, we inference the posterior distribution by and , which is and . Here, we reuse the stochastic hidden states in the generative model in order to compress the number of the model parameters.

KL Annealing Trick. It is well known that the deep neural networks with multi-layers stochastic latent variables is difficult to train, of which one important reason is that the KL term in the loss function limited the capacity of the stochastic latent variable to compress the data information in early stages of training. The KL Annealing is a common trick to alleviate this issue. The objective function is redefined as,


During the training process, the is annealed from 0 to 1. In previous works, researchers usually adopt the linear annealing strategy (Fraccaro et al., 2016; Goyal et al., 2017). In our experiment, we find that it still increases too fast for Stochastic WaveNet. We propose to use cosine annealing strategy alternatively, namely the is following the function , where scans from to .

(a) Generative Model
(b) Inference Model
Figure 2: This figure illustrates a toy sample of Stochastic WaveNet, which has two layers. The left one is the generative model, and the right one is the inference model. Both the solid line and dash line represent the neural network functions. The in the right figure is identical to the one in the left. The in the generative model are sampled from prior distributions, and the ones in the inference model are from posterior.

4 Experiment

In this section, we evaluate the proposed Stochastic WaveNet on several benchmark datasets from various domains, including natural speech, human handwriting and human motion modeling tasks. We show that Stochastic WaveNet, or SWaveNet in short, achieves state-of-the-art results, and visualizes the generated samples for the human handwriting domain. The experiment codes are publicly accessible. 111

Baselines. The following sequential generative models proposed in recent years are treated as the baseline methods:

  • RNN: The original recurrent neural network with the LSTM cell.

  • VRNN: The generative model with the recurrent structure proposed in (Chung et al., 2015). It firstly formulates the prior distribution of as a conditional probability given historical data and latent variables .

  • SRNN: Proposed in (Fraccaro et al., 2016), and it augments the inference network by a backward RNN to better optimize the ELBO.

  • Z-forcing: Proposed in (Goyal et al., 2017), whose architecture is similar to SRNN, and it eases the training of the stochastic latent variables by adding auxiliary cost which forces model to use stochastic latent variables to reconstruct the future data .

  • WaveNet: Proposed in (Van Den Oord et al., 2016) and produce state-of-the-art result in the speech generation task.

We evaluate different models by comparing the log-likelihood on the test set (RNN, WaveNet) or its lower bound (VRNN, SRNN, Z-forcing and our method). For fair comparison, a multivariate Gaussian distribution with the diagonal covariance matrix is used as the output distribution for each time stamp in all experiments. The Adam optimizer (Kingma and Ba, 2014) is used for all models, and the learning rate is scheduled by the cosine annealing. Following the experiment setting in (Fraccaro et al., 2016), we use 1 sample to approximate the variational evidence lower bound to reduce the computation cost.

4.1 Natural Speech Modeling

In the natural speech modeling task, we train the model to fit the log-likelihood function of the raw audio signals, following the experiment setting in (Fraccaro et al., 2016; Goyal et al., 2017). The raw signals, which correspond to the real-valued amplitudes, are represented as a sequence of 200-dimensional frames. Each frame is 200 consecutive samples. The preprocessing and dataset segmentation are identical to (Fraccaro et al., 2016; Goyal et al., 2017). We evaluate the proposed model in the following benchmark datasets:

  • Blizzard (Prahallad et al., 2013): The Blizzard Challenge 2013, which is a text-to-speech dataset containing 300 hours of English from a single female speaker.

  • TIMIT 222 TIMIT raw audio data sets, which contains 6,300 English sentence, read by 630 speakers.

For Blizzard datasets, we report the average log-likelihood over the half-second segments of the test set. For TIMIT datasets, we report the average log-likelihood over each sequence of the test set, which is following the setting in (Fraccaro et al., 2016; Goyal et al., 2017). In this task, we use 5-layer SWaveNet architecture with 1024 hidden dimensions for Blizzard and 512 for TIMIT. And the dimensions of the stochastic latent variables are 100 for both datasets.

lrr Method & Blizzard & TIMIT
RNN & 7413 & 26643
VRNN & &
SRNN & &
Z-forcing(+kla) & &
Z-forcing(+kla,aux)* & &
WaveNet & -5777 & 26074
SWaveNet & &
& &

Table 1: Test set Log-likelihoods on the natural speech modeling task. The first group is all RNN-based models, while the second group is WaveNet-based models. Best results are highlighted in bold. denotes that the training objective is equipped with an auxiliary term which other methods don’t have. For SWaveNet, we report the mean (standard deviation) produced by 10 different runs.

The experiment results are illustrated in Table 1. The proposed model has produced the best result for both datasets. Since the performance gap is not significant enough, we also report the variance of the proposed model performance by rerunning the model with 10 random seeds, which shows the consistence performance. Compared with the WaveNet model, the one without stochastic hidden states, SWaveNet gets a significant performance boost. Simultaneously, SWaveNet still enjoys the advantage of the parallel training compared with RNN-based stochastic models. One common concern about SWaveNet is that it may require larger hidden dimension of the stochastic latent variables than RNN based model due to its multi-layer structure. However, the total dimension of stochastic latent variables for one time stamp of SWaveNet is 500, which is twice as the number in the SRNN and the Z-forcing papers (Fraccaro et al., 2016; Goyal et al., 2017). We will further discuss the relationship between the number of stochastic layers and the model performance in section 4.3.

(a) Ground Truth (b) RNN (c) VRNN (d) SWaveNet
Figure 3: Generated handwriting sample: (a) are the samples from the ground truth data. (b) (c), (d) are from RNN, VRNN and SWaveNet, respectively. Each line is one handwriting sample.

4.2 Handwriting and Human Motion Generation

Next, we evaluate the proposed model by visualizing generated samples from the trained model. The domain we choose is human handwriting, whose writing tracks are described by a sequential sample points. The following dataset is used to train the generative model:

IAM-OnDB (Liwicki and Bunke, 2005): The human handwriting datasets contains 13,040 handwriting lines written by 500 writers. The writing trajectories are represented as a sequence of 3-dimension frames. Each frame is composed of two real-value numbers, which is coordinate for this sample point, and a binary number indicating whether the pen is touching the paper. The data preprocessing and division are same as (Graves, 2013; Chung et al., 2015).

Method IAM-OnDB
RNN (Chung et al., 2015) 1016
VRNN (Chung et al., 2015)
WaveNet 1021
Table 2: Log-likelihood results on IAM-OnDB dataset. The best result are highlighted in bold.

The quantitative results are reported in Table 2. SWaveNet achieves similar result compared with the best one, and still shows significant improvement to the vanilla WaveNet architecture. In Figure 3, we plot the ground truth samples and the ones randomly generated from different models. Compared with RNN and VRNN, SWaveNet shows clearer result. It is easy to distinguish the boundary of the characters, and we can obverse that more of them are similar to the English-characters, such as “is” in the fourth line and “her” in the last line.

4.3 Influence of Stochastic Latent Variables

The most prominent distinction between SWaveNet and RNN-based stochastic neural networks is that SWaveNet utilizes the dilated convolution layers to model multi-layer stochastic latent variables rather than one layer latent variables in the RNN models. Here, we perform the empirical study about the number of stochastic layers in SWaveNet model to demonstrate the efficiency of the design of multi-layers stochastic latent variables. The experiment is designed as follows. Firstly, we retain the total number of layers and only change the number of stochastic layers, namely the layer contains stochastic latent variables. More specifically, For a SWaveNet with layers and stochastic layers, , we eliminate the stochastic latent variables in the bottom part, which is in Eq.4. Then for each time stamp, when the model has dimension stochastic variables in total, each layer would have dimension stochastic variables. In this experiment, we set .

We plot the experiment results in Figure 4. From the plots, we find that SWaveNet can achieve better performance with multiple stochastic layers. This demonstrates that it is helpful to encode the stochastic latent variables with a hierachical structure. And in the experiment on Blizzard and IAM-OnDB, we observe that the performance will decrease when the number of stochastic layers is large enough. Because too large number of stochastic layers would result in too small number of latent variables for a layer to memorize valuable information in different hierarchy levels.

We also study how the model performance would be influenced by the number of stochastic latent variables. Similar to previous one, we only tune the total number of stochastic latent variables and keep rest settings unchanged, which is 4 stochastic layers. The results are plotted in Figure 5. They demonstrate that Stochastic WaveNet would be benefited from even a small number of stochastic latent variables.

(a) Blizzard
(c) IAM-OnDB
Figure 4: The influence of the number of stochastic layers of SWaveNet.
(a) Blizzard
(c) IAM-OnDB
Figure 5: The influence of the number of stochastic latent variables of SWaveNet.

5 Conclusion

In this paper, we present a novel generative latent variable model for sequential data, named as Stochastic WaveNet, which injects stochastic latent variables into the hidden state of WaveNet. A new inference network structure is designed based on the characteristic of WaveNet architecture. Empirically results show state-of-the-art performances on various domains by leveraging additional stochastic latent variables. Simultaneously, the training process of WaveNet is greatly accelerated by parallell computation compared with RNN-based models. For future work, a potential research direction is to adopt the advanced training strategies (Goyal et al., 2017; Shabanian et al., 2017) designed for sequential stochastic neural networks, to Stochastic WaveNet.