I Introduction
The sequential data modeling, especially on the high dimensional data such as speech and music data, has been a challenge in machine learning. Real world applications often require the sequential model to learn the true data distributions and generate data points with variations, i.e, a generative sequential model. Historically, Dynamic Bayesian Networks (DBNs) such Hidden Markov Model (HMM) have been widely studied and used for sequential data modeling. Training these models do not require optimization of model parameters in the high dimensional space. As the available computing power increases, there is a resurgence of interest in using recurrent neural networks for modeling the sequential data. The original RNN has overturned DBNs in modeling the sequential data. However, the original RNN has the deterministic state transition structure, comparing to the random variable hidden state in DBNs. As the sequential data in the real world often come with random transition between the adjacent underling states of the observations, introducing more randomness to RNN has been widely studied. There are different ways of introducing extra randomness to the original RNN structure. Latent variables are the hidden variables in the sequential model with certain probability distribution. Introducing latent variable to original RNN is proven to be an efficient way to introduce extra randomness and improve the performance for RNN to learn the distribution of the sequential data.
Variational Recurrent Neural Network (VRNN) is a sequential latent variable model that includes latent states in the transition structure of the deterministic RNN [chung2015recurrent]
. Due to the introduced random variables in the RNN, training the network with gradient descent on the log likelihood objective function becomes difficult as the objective function becomes intractable. Estimators based on different variational bounds for the sequential latent variable model have been proposed and studied. Evidence Lower Bound (ELBO) has produced state of the art results in maximum likelihood estimation (MLE) for the latent variable model with a sequential structure. Important Weighted Autoencoders (IWAE), and filtering variational objectives (FIVOs) are extensions of ELBO, defined by the particle filter’s estimator of the marginal log likelihood on sequential observations.
ELBO comes from the study on variational autoencoder (VAE) [kingma2013auto] [kingma2019introduction], and they then are extended to sequential latent variable models. VAE has emerged as a popular approach for learning on complicated data distributions. Sepcifically, let x denote the observation of random variable in a high dimensional space , and denote a latent random variable involved in the process of generating the observations. VAE maximizes a lower bound, i.e. ELBO, of marginal log likelihood as below:
(1) 
Here is a variational approximation of the posterior . We can interpret this lower bound as reconstruction and regulation. The second item of this bound is reconstruction of from the approximated posterior distribution on the latent variable , and the other item with KullbackLeibler (KL) divergence regularizes the reconstructed distribution by imposing the prior distribution on the inference posterior distribution .
ELBO is a variational objective taking a variational posterior as an argument. In [maddison2017filtering], the important question that whether variational bounds like ELBO achieve the marginal loglikelihood at its optimal is studied. It shows that gradient decent optimization on the variational bounds can not reconstruct the marginal loglikelihood on data observations at the optimal . The sharpness of ELBO is not guaranteed by existing methods. With these consideration, the performance of state of the art latent variable models can be improved by a better regularization method.
ELBO has been extensively used in sequential latent variable models such as VRNN, by introducing factorization on the sequence distributions. Our work focuses on improving the regularization of the reconstructed sequential data distribution by introducing adversarial neural networks in the model. We propose a model called Adversarial Regularized Variational RNN (AVRNN) that achieves the optimal state at its optimal posterior approximation . This has not been achieved by existing methods. For experiments, we train the AVRNN model using speech sequential data, and show superior performance.
Ii Background
Iia Sequence modeling with Recurrent Neural Networks
RNN is a family of neural network structures that are widely used in modeling sequential data. It has a basic cell unit that takes the sequence input recursively. Each data point is processed to with the same set to cell trainable parameters while maintaining its internal hidden state . At each timestep , RNN reads and updates its hidden state and produces and output based on the new hidden state. is updated by
(2) 
where f is a deterministic nonlinear transition function represented by the RNN cell unit, and
is the parameter set of the cell unit. The state of the art RNN cells such as long shortterm memory (LSTM) and gated recurrent unit (GRU) are implemented with gated activation functions
[hochreiter1997long], [cho2014learning]. An output function maps the RNN hidden state to a probability distribution of the output data:(3) 
where is the output function represented by output gate in the RNN cell and returns the parameters of the output data distribution, conditioned on . The joint sequence probability distribution is factorized by a product of these conditional probability as:
(4) 
The choice of output distribution models depends on the observations. Gaussian mixture model (GMM) is a common choice for modeling highdimensional sequences. The internal transition of original RNN models is entirely deterministic and the variability and randomness of the observed sequences is modeled by the conditional output probability density.
The need for introducing more variability to original RNN models has been previously noted. Sequential latent variable models (SLVM) based on RNN such as STORN and VRNN have proved to produce better performance for modeling highly variable sequential data such as speech and music [chung2015recurrent], [bayer2014learning]. These RNN based SLVM introduces latent variables to the transition structure. At each timestep , the transition function computes next the hidden state based on both the previous hidden state and the latent variable . STORN and VRNN have different ways of generating latent variable . STORN models as a sequence of independent random variables, while VRNN models the prior distribution dependent on all the preceding inputs via the previous RNN hidden state . VRNN can be interpreted as containing a VAE at every timestep. The generative model of VRNN is factorized as:
(5) 
The inference of the approximate posterior in VRNN is factorized as:
(6) 
IiB Evidence Lower Bound in sequential model
The approach of training RNN models with latent variables is inspired and similar as in the standard VAE. If we look at each timestep of VRNN models, it includes several operations:

Compute the prior distributions of conditioned on previous hidden state . This operation is also called ’transition’.

Generate based on generating distribution conditioned on both and . This operation is also called ’emission’.

Update the hidden state by running the deterministic RNN for on step, taking previous hidden state , latent state and observation .

Run inference of the approximate posterior distribution of using the mean from the prior and taking and as input. This operation is also called ’proposal’.
Just like in VAE, the generating (emission) and inference (proposal) networks are trained jointly by maximizing the variational lower bound with respect to their parameters [chung2015recurrent]. By using concavity and Jansen’s inequality on the joint log probability of the whole data sequence, we have:
(7) 
(8) 
(9) 
The ELBO in variational RNN becomes factorized variational lower bound:
(10) 
Similarly, this bound has a reconstruction item
and a regularization item in the form of KullbackLeibler divergence
[kullback1997information].Iii Related Work
As aforementioned, the early work of using latent variable on generative models is VAE [kingma2013auto]
. This approach has been successful on solving unsupervised learning and semisupervised learning problems
[walker2016uncertain], [abbasnejad2017infinite]. Different encoder and decoder neural networks in VAE have been studied to improve the performance in these problems [pu2016variational]. Related works in VAE show its limitation in approximation of the posterior distributions. Reference [burda2015importance] uses multiple samples to approximate the posterior, giving it increased flexibility to model complex posteriors, which improve the capability of VAE in modeling complex distributions. In [sonderby2016ladder], a data dependent approximate likelihood is studied to correct the generative distribution.RNN has been widely used in the sequential data modeling, and the latent variable approach is extended to sequential generative model with RNN as base models [chung2015recurrent, maddison2017filtering, fraccaro2016sequential]
. The latent variables bring additional variance and randomness to RNN, and make the model more expressive. Different ways of using RNN for structured variational approximation in latent variable sequential models are studied in
[krishnan2017structured]. The difference is mainly about the dependency of conditional posterior distributions.After GAN was introduced to train generative models, the GAN model has been adopted and improved in related research works [goodfellow2014generative]. To use GAN in sequential model training, the generator model generally need to be a conditional distribution model. In [mirza2014conditional], the conditional version of GAN is proposed with conditions in both generator and discriminator model. In our approach, the generator models in the adversarial training will be conditional distributions depending on the observations and latent variables. Some other works in the area of GAN combine adversarial training with Convectional Neural Networks (CNN), demonstrating its applicability in image data representations [denton2015deep], [radford2015unsupervised].
Adversarial training on VAE is studied in related works like [makhzani2015adversarial],[mescheder2017adversarial], with combination of VAE and GAN. The model proposed in [makhzani2015adversarial]
is called Adversarial AutoEncoder (AAE), which performs variational inference by matching the aggregated posterior of the hidden code vector of the autoencoder with an arbitrary prior distribution. However, these approaches are not derived for sequential latent variable models.
Our contributions to this research area are:

We propose a novel approach of regularizing sequential latent variable model, with adversarial training.

We prove our regularization approach achieves theoretical optimum when training a sequential latent variable model. This improves the model sharpness comparing to using ELBO as the training objective.

We prove the equivalence of the reconstruction loss and the Evidence Lower Bound when the adversarial training achieves optimum.

Our theoretical analysis proves that when the discriminator is at its optimum, adversarial training objective is EarthMover distance. This shows our regularization approach makes the training smooth and stable.

We propose a novel approach with separated optimization steps for the autoencoder and latent distribution models. This gives clear tracks on the factorization of the posterior distribution.
Iv Adversarial regularized variational RNN (AVRNN)
In this section, we introduce a new version of sequential latent variable model AVRNN, regularized by adversarial neural networks. AVRNN keeps the flexibility of the original VRNN for modeling highly nonlinear sequential dynamics, while providing smoother and sharper distance measurement for regularization on latent prior and posterior distribution [weng2019gan]. In our proposed model, the optimization on the variational bound become separate training steps, reconstruction and regularization. The regularization training also has separate steps, discriminator training and adversarial training. This brings benefit for model training stability and improves the posterior approximation in the inference network [fraccaro2016sequential].
Recall that in the latent variable sequential model, we have observation . AVRNN model takes the observation data as input once at time, and updates its hidden state at each time step. The output of each timestep is a probability distribution conditioned on a latent variable and current hidden state. In AVRNN, an discriminator neural network is added in the structure to play minimax game with the inference network. The discriminator takes samples from both proposal model and transition model . Note that in these two models, past and are connected indirectly with current by conditional probability model based on . Our proposed model is illustrated in Fig. 1. The discriminator outputs when it believes the sample is from transition model output prior distribution of , and outputs when it believes the sample is from proposal model approximate posterior distribution.
The transition model (TR) is a conditional Gaussian distribution parameterized by neural networks. TR computes the prior distribution
. The emission model (EM) is a conditional Bernoulli distribution, also parametrized by neural networks. EM computes the output target distribution
. is the sample drawn from the output distribution. The proposal model (PS) is a conditional Gaussian distribution. PS computes the approximate posterior based on observation and previous hidden state.The general objective of training this sequential latent variable model is to learn the data distribution based on the observed training data. The reconstruction phase when training AVRNN is to maximize the reconstructed log likelihood of given the approximate posterior of . The factorized reconstruction training can be expressed as:
(11) 
The regularization phase is a minimax game to impose a regularization on the approximated posterior of with a prior distribution conditioned on previous data observations and latent variables through . The regularization phase of the training can be expressed as:
(12) 
Algorithm 1 and Algorithm 2 describe our proposed learning algorithm. In the adversarial training part, we use Wasserstein GAN (WGAN) instead of the original GAN method [arjovsky2017wasserstein]
. We use RMSProp optimizer as recommened in WGAN
[tieleman2012rmsprop]. is the deterministic function of RNN cell, and is the trainable parameters of the neural networks.are highly flexible functions parameterized by neural networks. These three functions compute the parameters of the conditional probability distribution in ’Transition’, ’Emission’, and ’Proposal’.
are the trainable parameters in these three neural networks. is the discriminator model represented by a neural networks that computes the probability that a sample of come from prior distribution by (positive sample), rather than from posterior distribution by . is the trainable parameters in discriminator model.V Theoretical analysis
In the following analysis, we use the same notations in Algorithm 1 and Algorithm 2. We denote the ELBO for VRNN in (10) as . We define set and in the latent variable space : , .
Theorem 1
Given the assumption:

, and have enough capacity

is allowed to achieve its optimum given and .
When adversarial discrimination loss achieves , we have .
Proof:
Since has enough capacity, when achieves its optimum at each discriminator training step, is at minimum. With large training data samples
, according to the law of large numbers, we have:
(13) 
is Wasserstein distance, i.e. EM (EarthMover) distance. This means before adversarial training step starts, becomes EM distance.
Since and have enough capacity, when the adversarial discrimination loss achieves at adversarial training step, we have:
(14) 
(15) 
According to [arjovsky2017wasserstein], when EM distance is , we have the prior and posterior distribution at optimum distance:
(16) 
We have the ELBO in unfactorization form in the right side of (8):
(17) 
As in (16), we have . Then we have
(18) 
As is the minus log likelihood of joint probability of each , we have:
(19) 
Theorem 1 essential says that when the adversarial regularization loss is minimized to 0 though updating on parameters of and , the reconstruction phase training in Algorithm 1 is equivalent to optimization on ELBO.
Vi Experiments
We choose TIMIT speech data in our experiment [garofolo1993darpa]
. For both transition and emission model, we use the conditional normal distribution model. The sigma and mu of these conditional normal distributions are factorized by MLP neural networks. For proposal model, we use the normal approximated posterior distribution, similarly, factorized by MLP neural networks. The AVRNN model includes a basic RNN model with trainable parameters in the RNN cell. The inputs to the RNN are the encoded latent variable
of previous step as well as the encoded observed data point of current step. We use fully connected neural networks for the data encoder and latent encoder in this AVRNN model. The data encoder accept the data points and encode it with desired dimensions before used as input to the RNN cell. Similarly, the latent encoder accepts and encodes it before used as input to RNN.For the discriminator , one layer of LSTM and two layers of fully connect neural networks (FNET) are stacked together [hochreiter1997long]. Hyper parameters are tuned with LSTM state size set as and number of nodes in the FNET set as .
The experimental results are shown in Fig. 2 and 3. These numerical results have validated the Theorem 1. Fig. 2 shows the decreasing reconstruction loss and increasing ELBO bound of the observed data points during the training process. The loss and bound results are summarized every training steps and averaged by length of the training sample sequences. The largest log likelihood per timestep (ELBO per time step) achieves during the training. Fig. 3 shows the converged discriminator loss and adversarial loss. Adversarial loss is the opposite of discriminator loss and is the EM distance of and .
Vii Discussion
In AAE model, prior is know as GMM distribution, it is know prior. in AVRNN, z prior is conditional distribution parameterized by the transition model, ie. TR neural networks in Fig. 1. The conditional normal distribution prior presented by the TR gives AVRNN the capability to model the dependency between adjacent latent variables. The objective of the adversarial training in both AAE and AVRNN is to minimize the distance between prior distribution and posterior distributions of . In AAE model, this distance is measured with JensenShannon Divergence, the same as the original GAN model [goodfellow2014generative]. With this distance measurement, it is hard to achieve Nash equilibrium and unstable training process due to gradient vanishing [salimans2016improved]. With our approach, the AVRNN model use Wasserstein distance to achieve better training stability. Furthermore, our approach optimizes the trainable parameters in both factorized prior and posterior distribution model, i.e. TR and PS. This is to adapt to the different prior comparing to the fixed known prior in AAE. The experiment results show that this approach can achieve convergence within training steps, while the approach used in AAE cannot converge using only the proposal loss.
Viii Conclusions
We propose a novel training approach that addresses problems in regularization of the sequential latent variable model. Our approach use adversarial training in the model to regularize the latent variable distributions. Comparing to state of the art, our approach have below advantages:

Our approach has achieves optimum in the training algorithms and provides better model training robustness;

Our approach improves the posterior approximation and has a clear track of the factorization of the posterior distribution;

The symmetric EM distance used in our approach provides smooth measure of latent variable distribution distance between prior and posterior. This gives better training stability;