1 Introduction
The variational autoencoder (VAE) [1]
is a deep probabilistic model with great popularity, where a topdown probabilistic decoder is paired with a bottomup probabilistic encoder that is designed to approximate the posterior inference over continuous latent variables. When the VAE is applied to a certain task in practice, the first thing to be determined is to assign a right probability distribution to each dimension of the input data. For example, we assign multivariate Gaussian distribution to realvalued attributes or independent Bernoulli distribution to each binary attribute of the input data. This is a critical but cumbersome step to successfully train a VAE on a dataset in practical applications.
The Gaussian VAE makes assumption that the probabilistic decoder is modelled as multivariate Gaussian distribution with diagonal covariance matrix. In such a case, it was observed that the Gaussian VAE was not successful in capturing a fruitful representation of realvalued MNIST dataset. This enforces us to binarize the data, followed by training the Bernoulli VAE on. In other words, the independent multivariate Gaussian assumption on the decoder in the Gaussian VAE limits its capability to capture the data manifold. To this end, we extends the Gaussian VAE, approximating the local covariance matrix of the decoder as an outer product of the principal direction at a position determined by a sample drawn from Gaussian distribution. A closelyrelated work is the nonlocal estimation of manifold structure
[2] where local manifold structure is determined by modelling linear tangent space as a nonlinear function (for instance, MLP) of data, which can model smoothly changing tangent spaces. While this work chooses the membership of tangent space with NN, our model chooses its membership by setting local dependency as a function of latent variables, which is efficiently inferred by the recognition network. We show that this model, referred to as VAEROC, better captures the data manifold, compared to the standard Gaussian VAE where independent multivariate Gaussian was used to model the decoder.We extends this idea to model tangent space on mixed variate data using copula. Copula is a way of modeling multivariate distribution from bottomup way. Specifically, we can model it with two steps: (1) modelling marginal distribution for each input dimension. (2) modeling dependency structure among them with copula [3]
. Because each random variable can be transformed to the uniform random variable on [0, 1] with the corresponding cumulative distribution function (CDF), we can handle dependency among mixed variate data easily using copulas. There are three points we must mention before explaining how we address modeling tangent space in mixed variate data: (1) modeling mixed variate using copula, (2) parsimonious parameterization using copula, (3) mixture models in copula.
First, in earlier days, Everitt introduced a latent Guassian variable for each categorical variable and model them with observed continuous variables in form of multivariate Normal distribution
[4]. Hoff proposed extended rank likelihood for mixed variate data, where it ignores the form of margins by calculating likelihood of a set of projected rank preserved data on unit hypercube [5]. Continuous Extension (CE) is a way of incorporating discrete variable into copula by introducing jittered uniform variable to a discrete variable [6]. In GCVAE, we follows the idea of simulated likelihood by margin out jittered uniform variable with one sample [7]. GCVAE cannot use extended rank likelihood because it models copula as well as marginal distributions.Second, for parsimonious parameterization of dependency in copula, Copula Gaussian Graphical Models (CGGM) introduce sparsity in precision matrix to reflect conditional independence across input dimensions [8]. Murray proposed a latent factor model for normal scores, which assume structure in correlation matrix in Gaussian copula [9]
. GCVAE shares the idea of Bayesian Gaussian Copula Factor Model but introduce simple enough correlation across input dimension by normalizing rank one covariance matrix that consists of outer product of principal direction vector with isotropic noise.
Finally, Gaussian copula can be extended to a finite mixture model that takes a subset of data that shows similar dependency among dimensions into one mixture component [10]. GCVAE extends this idea into infinite mixture model each component of which is indexed by a vector on latent space. Along with smooth change of latent space, the membership of each mixture component is smoothly changing (marginal distribution is smoothly changing) as well as local dependency across input dimension in the mixture component (parsimonious parameter of copula is smoothly changing).
In this paper we address two issues on the VAE: (1) better representation learning by capturing local principal directions at different positions which vary along the data manifold; (2) managing mixed continuous and discrete data in practice.
2 Background
In this section, we briefly review the variational autoencoder [1] and copulas [3] on which we base our development of Gaussian copula variational autoencoder.
2.1 Variational Autoencoders
The VAE is a probabilistic model for a generative process of observed variables with continuous latent variables , which pairs a topdown probabilistic decoder with a bottomup probabilistic encoder that approximates the posterior inference over latent variables . In the case where both decoder and encoder are multivariate Gaussians with diagonal covariance matrices, the decoder of Gaussian VAE is described by
(1) 
where mean vector and diagonal covariance matrix with are computed by MLPs:
(2) 
This is an example of MLP with a single hidden layer. Certainly, multiple hidden layers can be considered. In order to emphasize the dependency of and to the latent variable , we also use and . The set of parameters, , for the decoder, is given by
The encoder of Gaussian VAE is described by
(3) 
where and parameters are also computed by MLPs:
(4) 
The set of parameters, , for the encoder, is given by
Suppose that we are given a training dataset consisting of i.i.d. observations, each of which is a dimensional random vector. The marginal loglikelihood of is a sum over the marginal loglikelihood of individual observations:
where the single factor of the marginal loglikelihood, , is described by its variational lowerbound:
The variational lowerbound, , on the marginal loglikelihood of observation is given by
(5)  
where is the KLdivergence between the distributions and . The second term in (5) is analytically computed and the first term is calculated by the stochastic gradient variational Bayes (SGVB) where the Monte Carlo estimates are performed with the reparameterization trick:
where ( represents the elementwise product) and . A single sample is often sufficient to form this Monte Carlo estimates in practice, thus, in this paper we simply use
where denotes a single sample with abuse of notation.
2.2 Copulas
A dimensional copula is a distribution function on unit cube with each univariate marginal distribution being uniform on . A classical result of Sklar [11] relates a joint cumulative distribution function (CDF), , of random variables , to a copula function via univariate marginal CDFs, , i.e.,
(6) 
If all are continuous, then the copula satisfying (6) is unique and is given by
(7) 
for where Otherwise it is uniquely determined on where is the range of . The relation in (6
) can also be represented in terms of the joint probability density function (PDF),
when ’s are continous:(8) 
where is the copula density and are marginal PDFs.
The Gaussian copula with covariance matrix , that we consider in this paper, is given by
where is the dimensional Gaussian CDF with covariance matrix with diagonal entries being equal to one and is the univariate standard Gaussian CDF. The Gaussian copula density is given by
where with normal scores for . Invoking (8) with this Gaussian copula density, the joint density function is written as:
(9) 
If ’s are discrete, the copula are uniquely determined on the range of . In such a case, the joint probability mass function (PMF) of is given by
(10)  
where , the limit of at from the left, and . The PMF requires the evaluation of terms, which is not manageable even for a moderate value of (for instance,
). A continuous extension (CE) of discrete random variables
[6, 7] avoids the fold summation in (10), associating a continuous random variable
with the integervalued , where is uniform on [0,1] and is independent of as well as of for . Continuous random variables produced by jittering yields the CDF and PDF given bywhere represents the nearest integer less than or equal to . The joint PDF for jittered variables is determined by substituting and into (9). Then, averaging this joint PDF over the jitters lead to the joint PFM for :
where
Given a set of data points , in the case of Gaussian copula, the loglikelihood is given by invoking (8) and (9)
Denote by the parameters that involve specifying the marginal PDF . Then the parameters and appearing in the Gaussian copula density are estimated by the twostep method, known as inference for margin [12]:
3 VAE for Manifold Tangent Learning
We describe our model VAEROC in the perspective of manifold learning, making a simple modification in the Gaussian VAE described in Section 2.1. Note that VAEROC is limited to continuous data only and its extension to mixeddata is presented in Section 4
As in VAE, VAEROC also constitutes a pair of probabilistic decoder and encoder. The probabilistic encoder in VAEROC is the same as (2.1) but the decoder is slightly different from the VAE, which is described below in detail. In order to find the local principal direction at a specific location , we use the following model for the probabilistic decoder
(11) 
where the local covariance matrix is of the form and each of is parameterized by an individual MLP which takes as input. For instance,
(12) 
In fact, VAEROC, can be viewed as an infinite mixture of probabilistic principal component analyzers. Introducing a dummy Gaussian latent variable , the probability distribution over given and is given by
The latent variable can be viewed as an indicator variable involving a local region to be approximated by a tangent plane.. However, it is not a discrete variable unlike the standard finite mixture model. The variable is drawn from the standard multivariate Gaussian distribution and determines a location where we approximate the local manifold structure as local principal direction . Besides this subtle difference, it can be viewed as the mixture of probabilistic principal component analyzers. Marginalizing out yields the distribution for the decoder of VAEROC, given in (11).
Given a training dataset consisting of i.i.d. observations, each of which is a dimensional random vector, the variational lowerbound on the marginal loglikelihood of is given by
(13)  
where the decoder is given in (11) and the encoder is described in (3) and (2.1). The second term in (13) can be analytically calculated as:
where the superscript is used for and to reflect their dependence on . The first term in (13) is calculated by the stochastic gradient variational Bayes (SGVB) where the Monte Carlo estimates are performed with the reparameterization trick [1]:
where ( represents the elementwise product) and . A single sample is often sufficient to form this Monte Carlo estimates in practice, thus, in this paper we simply use
where .
Denote by is a minibatch of size , which consists of randomly drawn samples from the full dataset with points. Then an estimator of the variational lowerbound is constructed by
(14) 
where
where . Applying the ShermanMorrison formula to leads to
(15)  
With the maximization of the variational lowerbound, we consider two regularization terms, each of which is explained below:

Locality Regularization: The first regularization is the regularization applied to the local principal direction to enforce a bound on its length. It was observed in our experiments that both and norm worked well where norm gave more spare local principal direction.

Rank Regularization: Denote by the data distribution. The aggregated posterior on latent variables is expected to match the Gaussian prior distribution . Adversarial autoencoder [13]
is a recent example which incorporates this regularization. In this paper, we take a different approach, which is close to the idea used in learning neural networks to estimate the distribution function
[14]. Note that the random variable (where denotes the standard Gaussian CDF) is uniform in [0,1]. Thus, we consider the following penalty function:where
are randomly drawn samples from uniform distribution in [0,1] and
. Samples and are sorted in ascending order to relate to .
Thus our regularized variational lowerbound is given by
(16)  
With the random minibatch of datapoints, this regularized variational lowerbound in (16) is maximized to determine parameters and . This training is repeated with different minibatches until the convergence of parameters is achieved.
4 Gaussian Copula VAE
In this section we present a Gaussian copula variational autoencoder (GCVAE) where we extend the VAEROC (described in the previous section) developed for continuous data to handle mixed continuous and discrete data, employing Gaussian copula explained in Section 2.2.
Suppose we are given a set of data points, each of which is dimensional vector, i.e, . We use superscripts and to indicate variables associated with continuous and discrete data, respectively, i.e., , where the semicolon is used to represent stacking vectors in a column. Denote by the vector which collects continuous attributes and by the vector containing discrete attributes. The th entry of is represented by .
In principle, GCVAE allows for various continuous or discrete distributions together. In this paper, however, we assume Gaussian distributions for continuous variables and categorical distributions for discrete variables. That is,
where and is the indicator function which yields 1 when the input argument is true and otherwise 0.
In GCVAE, we use the Gaussian copula to model the probabilistic decoder , given by
where the expectation is approximated using a single sample drawn from the uniform distribution on [0,1] and the Gaussian copula density (see Appendix A for proof) is given by
where
(17)  
and the covariance matrix in Gaussian copula is of the form . Diagonal entries of are denoted by and represents a diagonal matrix with diagonal entries being . When , the Gaussian copula density is equal to 1, i.e., GCVAE is identical to VAE.
As in the VAEROC, each of is parameterized by an individual MLP which takes as input. For instances,
The probabilistic encoder is parameterized, as in VAE or VAEROC, as described in (3) and (2.1). As in VAEROC, GCVAE is learned by maximizing the reqularized variational lowerbound (16), where the lowerbound is given by
(18)  