# Gaussian Copula Variational Autoencoders for Mixed Data

The variational autoencoder (VAE) is a generative model with continuous latent variables where a pair of probabilistic encoder (bottom-up) and decoder (top-down) is jointly learned by stochastic gradient variational Bayes. We first elaborate Gaussian VAE, approximating the local covariance matrix of the decoder as an outer product of the principal direction at a position determined by a sample drawn from Gaussian distribution. We show that this model, referred to as VAE-ROC, better captures the data manifold, compared to the standard Gaussian VAE where independent multivariate Gaussian was used to model the decoder. Then we extend the VAE-ROC to handle mixed categorical and continuous data. To this end, we employ Gaussian copula to model the local dependency in mixed categorical and continuous data, leading to Gaussian copula variational autoencoder (GCVAE). As in VAE-ROC, we use the rank-one approximation for the covariance in the Gaussian copula, to capture the local dependency structure in the mixed data. Experiments on various datasets demonstrate the useful behaviour of VAE-ROC and GCVAE, compared to the standard VAE.

## Authors

• 1 publication
• 19 publications
• ### Stick-Breaking Variational Autoencoders

We extend Stochastic Gradient Variational Bayes to perform posterior inf...
05/20/2016 ∙ by Eric Nalisnick, et al. ∙ 0

• ### On the Encoder-Decoder Incompatibility in Variational Text Modeling and Beyond

Variational autoencoders (VAEs) combine latent variables with amortized ...
04/20/2020 ∙ by Chen Wu, et al. ∙ 8

• ### Autoencoding Variational Autoencoder

Does a Variational AutoEncoder (VAE) consistently encode typical samples...
12/07/2020 ∙ by A. Taylan Cemgil, et al. ∙ 15

• ### Constructing the Matrix Multilayer Perceptron and its Application to the VAE

Like most learning algorithms, the multilayer perceptrons (MLP) is desig...
02/04/2019 ∙ by Jalil Taghia, et al. ∙ 18

• ### Dirichlet Variational Autoencoder

This paper proposes Dirichlet Variational Autoencoder (DirVAE) using a D...
01/09/2019 ∙ by Weonyoung Joo, et al. ∙ 0

• ### Diagnosing and Enhancing VAE Models

Although variational autoencoders (VAEs) represent a widely influential ...
03/14/2019 ∙ by Bin Dai, et al. ∙ 0

• ### Variational Autoencoders: A Harmonic Perspective

In this work we study Variational Autoencoders (VAEs) from the perspecti...
05/31/2021 ∙ by Alexander Camuto, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The variational autoencoder (VAE) [1]

is a deep probabilistic model with great popularity, where a top-down probabilistic decoder is paired with a bottom-up probabilistic encoder that is designed to approximate the posterior inference over continuous latent variables. When the VAE is applied to a certain task in practice, the first thing to be determined is to assign a right probability distribution to each dimension of the input data. For example, we assign multivariate Gaussian distribution to real-valued attributes or independent Bernoulli distribution to each binary attribute of the input data. This is a critical but cumbersome step to successfully train a VAE on a dataset in practical applications.

The Gaussian VAE makes assumption that the probabilistic decoder is modelled as multivariate Gaussian distribution with diagonal covariance matrix. In such a case, it was observed that the Gaussian VAE was not successful in capturing a fruitful representation of real-valued MNIST dataset. This enforces us to binarize the data, followed by training the Bernoulli VAE on. In other words, the independent multivariate Gaussian assumption on the decoder in the Gaussian VAE limits its capability to capture the data manifold. To this end, we extends the Gaussian VAE, approximating the local covariance matrix of the decoder as an outer product of the principal direction at a position determined by a sample drawn from Gaussian distribution. A closely-related work is the nonlocal estimation of manifold structure

[2] where local manifold structure is determined by modelling linear tangent space as a nonlinear function (for instance, MLP) of data, which can model smoothly changing tangent spaces. While this work chooses the membership of tangent space with -NN, our model chooses its membership by setting local dependency as a function of latent variables, which is efficiently inferred by the recognition network. We show that this model, referred to as VAE-ROC, better captures the data manifold, compared to the standard Gaussian VAE where independent multivariate Gaussian was used to model the decoder.

We extends this idea to model tangent space on mixed variate data using copula. Copula is a way of modeling multivariate distribution from bottom-up way. Specifically, we can model it with two steps: (1) modelling marginal distribution for each input dimension. (2) modeling dependency structure among them with copula [3]

. Because each random variable can be transformed to the uniform random variable on [0, 1] with the corresponding cumulative distribution function (CDF), we can handle dependency among mixed variate data easily using copulas. There are three points we must mention before explaining how we address modeling tangent space in mixed variate data: (1) modeling mixed variate using copula, (2) parsimonious parameterization using copula, (3) mixture models in copula.

First, in earlier days, Everitt introduced a latent Guassian variable for each categorical variable and model them with observed continuous variables in form of multivariate Normal distribution

[4]. Hoff proposed extended rank likelihood for mixed variate data, where it ignores the form of margins by calculating likelihood of a set of projected rank preserved data on unit hypercube [5]. Continuous Extension (CE) is a way of incorporating discrete variable into copula by introducing jittered uniform variable to a discrete variable [6]. In GCVAE, we follows the idea of simulated likelihood by margin out jittered uniform variable with one sample [7]. GCVAE cannot use extended rank likelihood because it models copula as well as marginal distributions.

Second, for parsimonious parameterization of dependency in copula, Copula Gaussian Graphical Models (CGGM) introduce sparsity in precision matrix to reflect conditional independence across input dimensions [8]. Murray proposed a latent factor model for normal scores, which assume structure in correlation matrix in Gaussian copula [9]

. GCVAE shares the idea of Bayesian Gaussian Copula Factor Model but introduce simple enough correlation across input dimension by normalizing rank one covariance matrix that consists of outer product of principal direction vector with isotropic noise.

Finally, Gaussian copula can be extended to a finite mixture model that takes a subset of data that shows similar dependency among dimensions into one mixture component [10]. GCVAE extends this idea into infinite mixture model each component of which is indexed by a vector on latent space. Along with smooth change of latent space, the membership of each mixture component is smoothly changing (marginal distribution is smoothly changing) as well as local dependency across input dimension in the mixture component (parsimonious parameter of copula is smoothly changing).

In this paper we address two issues on the VAE: (1) better representation learning by capturing local principal directions at different positions which vary along the data manifold; (2) managing mixed continuous and discrete data in practice.

## 2 Background

In this section, we briefly review the variational autoencoder [1] and copulas [3] on which we base our development of Gaussian copula variational autoencoder.

### 2.1 Variational Autoencoders

The VAE is a probabilistic model for a generative process of observed variables with continuous latent variables , which pairs a top-down probabilistic decoder with a bottom-up probabilistic encoder that approximates the posterior inference over latent variables . In the case where both decoder and encoder are multivariate Gaussians with diagonal covariance matrices, the decoder of Gaussian VAE is described by

 pθ(\boldmathz) = N(\boldmathz|0,\boldmathI), pθ(\boldmathx|\boldmathz) = N(\boldmathx|\boldmathμ,%\boldmath$Λ$σ), (1)

where mean vector and diagonal covariance matrix with are computed by MLPs:

 μ = \boldmathWμ\boldmathh+\boldmathbμ, log\boldmathσ2 = \boldmathWσ\boldmathh+\boldmathbσ, h = tanh(\boldmathWh\boldmathz+\boldmathbh). (2)

This is an example of MLP with a single hidden layer. Certainly, multiple hidden layers can be considered. In order to emphasize the dependency of and to the latent variable , we also use and . The set of parameters, , for the decoder, is given by

 θ={\boldmathWμ,\boldmathWσ,\boldmathWh,\boldmathbμ,\boldmathbσ,\boldmathbh}.

The encoder of Gaussian VAE is described by

 qϕ(\boldmathz|\boldmathx) = N(\boldmathz|\boldmathη,\boldmathΛτ), (3)

where and parameters are also computed by MLPs:

 η = \boldmathVη\boldmathy+\boldmathbη, log\boldmathτ2 = \boldmathVτ\boldmathy+\boldmathbτ, y = tanh(\boldmathVy\boldmathx+\boldmathby). (4)

The set of parameters, , for the encoder, is given by

 ϕ={\boldmathVη,\boldmathVτ,\boldmathVy,\boldmathbη,\boldmathbτ,\boldmathby}.

Suppose that we are given a training dataset consisting of i.i.d. observations, each of which is a -dimensional random vector. The marginal log-likelihood of is a sum over the marginal log-likelihood of individual observations:

 logpθ(\boldmathX)=N∑n=1logpθ(\boldmathx(n)),

where the single factor of the marginal log-likelihood, , is described by its variational lower-bound:

 logpθ(\boldmathx(n)) = log∫pθ(\boldmathx(n),\boldmathz(n))d\boldmathz(n) ≥ qϕ(\boldmathz(n)|\boldmathx(n))log⎡⎣pθ(\boldmathx(n)|\boldmathz(n))pθ(\boldmathz(n))qϕ(\boldmathz(n)|\boldmathx(n))⎤⎦d\boldmathz(n) = F(θ,ϕ;\boldmathx(n)).

The variational lower-bound, , on the marginal log-likelihood of observation is given by

 F(θ,ϕ;\boldmathx(n)) = Eqϕ(\boldmathz(n)|\boldmathx% (n))[logpθ(\boldmathx(n)|\boldmathz(n))] (5) − KL[qϕ(\boldmathz(n)|\boldmath% x(n))∥pθ(\boldmathz(n))],

where is the KL-divergence between the distributions and . The second term in (5) is analytically computed and the first term is calculated by the stochastic gradient variational Bayes (SGVB) where the Monte Carlo estimates are performed with the reparameterization trick:

 Eqϕ(\boldmathz(n)|\boldmathx% (n))[logpθ(\boldmathx(n)|\boldmathz(n))] ≈ 1LL∑l=1logpθ(\boldmathx(n)|\boldmathz(n,l)),

where ( represents the elementwise product) and . A single sample is often sufficient to form this Monte Carlo estimates in practice, thus, in this paper we simply use

 Eqϕ(\boldmathz(n)|\boldmathx% (n))[logpθ(\boldmathx(n)|\boldmathz(n))] ≈ logpθ(\boldmathx(n)|\boldmathz(n)),

where denotes a single sample with abuse of notation.

### 2.2 Copulas

A -dimensional copula is a distribution function on unit cube with each univariate marginal distribution being uniform on . A classical result of Sklar [11] relates a joint cumulative distribution function (CDF), , of random variables , to a copula function via univariate marginal CDFs, , i.e.,

 F(x1,…,xD)=C(F1(x1),…,FD(xD)). (6)

If all are continuous, then the copula satisfying (6) is unique and is given by

 C(u1,…,uD)=F(F−11(u1),…,F−1D(uD)), (7)

for where Otherwise it is uniquely determined on where is the range of . The relation in (6

) can also be represented in terms of the joint probability density function (PDF),

when ’s are continous:

 p(x1,…,xD)=c(F1(x1),…,FD(xD))D∏i=1pi(xi), (8)

where is the copula density and are marginal PDFs.

The Gaussian copula with covariance matrix , that we consider in this paper, is given by

 CΦ(u1,…,uD)=ΦΣ(Φ−1(u1),…,Φ−1(uD)|\boldmathΣ),

where is the -dimensional Gaussian CDF with covariance matrix with diagonal entries being equal to one and is the univariate standard Gaussian CDF. The Gaussian copula density is given by

 cΦ(u1,…,uD) = ∂DCΦ(u1,…,uD)∂u1⋯∂uD = |\boldmathΣ|−12exp{−12\boldmathq⊤(\boldmathΣ−1−\boldmathI)\boldmathq},

where with normal scores for . Invoking (8) with this Gaussian copula density, the joint density function is written as:

 p(\boldmathx)=|\boldmathΣ|−12exp{−12\boldmathq⊤(\boldmathΣ−1−\boldmathI)\boldmathq}D∏i=1pi(xi). (9)

If ’s are discrete, the copula are uniquely determined on the range of . In such a case, the joint probability mass function (PMF) of is given by

 p(x1,…,xD) = 2∑j1=1⋯2∑jD=1(−1)j1+⋯jD (10) ΦΣ(Φ−1(u1,j1),…,Φ−1(uD,jD)),

where , the limit of at from the left, and . The PMF requires the evaluation of terms, which is not manageable even for a moderate value of (for instance,

). A continuous extension (CE) of discrete random variables

[6, 7] avoids the -fold summation in (10

), associating a continuous random variable

with the integer-valued , where is uniform on [0,1] and is independent of as well as of for . Continuous random variables produced by jittering yields the CDF and PDF given by

 F∗i(ξ) = Fi([ξ])+(ξ−[ξ])P(xi=[ξ+1]), p∗i(ξ) = P(xi=[ξ+1]),

where represents the nearest integer less than or equal to . The joint PDF for jittered variables is determined by substituting and into (9). Then, averaging this joint PDF over the jitters lead to the joint PFM for :

 p(x1,…,xD) = E\boldmathv[|\boldmathΣ|−12exp{−12\boldmathq∗⊤(% \boldmathΣ−1−\boldmathI)\boldmathq∗}D∏i=1p∗i(xi−vi)],

where

 \boldmathq∗=[Φ−1(F∗1(x1−v1)),…,Φ−1(F∗D(xD−vD))]⊤.

Given a set of data points , in the case of Gaussian copula, the log-likelihood is given by invoking (8) and (9)

 N∑n=1logp(x(n)1,…,x(n)D) = −N2log|\boldmathΣ|+12N∑n=1\boldmathq(n)⊤(\boldmathI−\boldmathΣ−1)\boldmathq(n) +N∑n=1D∑i=1logpi(x(n)i).

Denote by the parameters that involve specifying the marginal PDF . Then the parameters and appearing in the Gaussian copula density are estimated by the two-step method, known as inference for margin [12]:

 ˆϑi = \operatornamewithlimitsargmaxN∑n=1D∑i=1logpi(x(n)i;ϑi), ˆ\boldmathΣ =

## 3 VAE for Manifold Tangent Learning

We describe our model VAE-ROC in the perspective of manifold learning, making a simple modification in the Gaussian VAE described in Section 2.1. Note that VAE-ROC is limited to continuous data only and its extension to mixed-data is presented in Section 4

As in VAE, VAE-ROC also constitutes a pair of probabilistic decoder and encoder. The probabilistic encoder in VAE-ROC is the same as (2.1) but the decoder is slightly different from the VAE, which is described below in detail. In order to find the local principal direction at a specific location , we use the following model for the probabilistic decoder

 pθ(\boldmathx|\boldmathz) = N(\boldmathμ,ω\boldmathI+\boldmatha\boldmatha⊤), (11)

where the local covariance matrix is of the form and each of is parameterized by an individual MLP which takes as input. For instance,

 pθ(\boldmathz) = N(0,\boldmathI), μ = \boldmathWμ\boldmathh+\boldmathbμ, logω = \boldmathw⊤ω\boldmathh+bω, a = \boldmathWa\boldmathh+\boldmathba, h = tanh(\boldmathWh\boldmathz+\boldmathbh). (12)

In fact, VAE-ROC, can be viewed as an infinite mixture of probabilistic principal component analyzers. Introducing a dummy Gaussian latent variable , the probability distribution over given and is given by

 p(\boldmathx|s,\boldmathz)=N(%\boldmath$a$s+\boldmathμ,ω\boldmathI).

The latent variable can be viewed as an indicator variable involving a local region to be approximated by a tangent plane.. However, it is not a discrete variable unlike the standard finite mixture model. The variable is drawn from the standard multivariate Gaussian distribution and determines a location where we approximate the local manifold structure as local principal direction . Besides this subtle difference, it can be viewed as the mixture of probabilistic principal component analyzers. Marginalizing out yields the distribution for the decoder of VAE-ROC, given in (11).

Given a training dataset consisting of i.i.d. observations, each of which is a -dimensional random vector, the variational lower-bound on the marginal log-likelihood of is given by

 F(θ,ϕ;\boldmathx(n)) = Eqϕ(\boldmathz(n)|\boldmathx% (n))[logpθ(\boldmathx(n)|\boldmathz(n))] (13) − KL[qϕ(\boldmathz(n)|\boldmath% x(n))∥pθ(\boldmathz(n))],

where the decoder is given in (11) and the encoder is described in (3) and (2.1). The second term in (13) can be analytically calculated as:

 KL[qϕ(\boldmathz(n)|\boldmath% x(n))∥pθ(\boldmathz(n))] = −12K∑k=1(1+2logτ(n)k−(τ(n)k)2−(η(n)k)2),

where the superscript is used for and to reflect their dependence on . The first term in (13) is calculated by the stochastic gradient variational Bayes (SGVB) where the Monte Carlo estimates are performed with the reparameterization trick [1]:

 Eqϕ(\boldmathz(n)|\boldmathx% (n))[logpθ(\boldmathx(n)|\boldmathz(n))] ≈ 1LL∑l=1logpθ(\boldmathx(n)|\boldmathz(n,l)),

where ( represents the elementwise product) and . A single sample is often sufficient to form this Monte Carlo estimates in practice, thus, in this paper we simply use

 Eqϕ(\boldmathz(n)|\boldmathx% (n))[logpθ(\boldmathx(n)|\boldmathz(n))]≈logpθ(\boldmathx(n)|\boldmathz(n)),

where .

Denote by is a minibatch of size , which consists of randomly drawn samples from the full dataset with points. Then an estimator of the variational lower-bound is constructed by

 F(θ,ϕ;\boldmathX)≈NMM∑m=1F(θ,ϕ;\boldmathx(m)), (14)

where

 F(θ,ϕ;\boldmathx(m)) =−D2log2π−12log∣∣∣ω(m)% \boldmathI+\boldmatha(m)\boldmatha(m)⊤∣∣∣ −12˜\boldmathx(m)⊤(ω(m)\boldmathI+\boldmatha(m)\boldmatha(m)⊤)−1˜\boldmathx(m) +12K∑k=1(1+2logτ(m)k−(τ(m)k)2−(η(m)k)2),

where . Applying the Sherman-Morrison formula to leads to

 F(θ,ϕ;\boldmathx(m)) (15) =−D2log2π−D2logω(m)−12log⎛⎜⎝1+\boldmatha(m)⊤\boldmatha(m)ω⎞⎟⎠ =−12˜\boldmathx(m)⊤⎛⎜⎝1ω(m)\boldmathI−\boldmatha(m)% \boldmatha(m)⊤ω(m)2+ω(m)\boldmatha(m)⊤\boldmatha(m)⎞⎟⎠˜\boldmathx% (m) +12K∑k=1(1+2logτ(m)k−(τ(m)k)2−(η(m)k)2).

With the maximization of the variational lower-bound, we consider two regularization terms, each of which is explained below:

• Locality Regularization: The first regularization is the regularization applied to the local principal direction to enforce a bound on its length. It was observed in our experiments that both and norm worked well where norm gave more spare local principal direction.

• Rank Regularization: Denote by the data distribution. The aggregated posterior on latent variables is expected to match the Gaussian prior distribution . Adversarial autoencoder [13]

is a recent example which incorporates this regularization. In this paper, we take a different approach, which is close to the idea used in learning neural networks to estimate the distribution function

[14]. Note that the random variable (where denotes the standard Gaussian CDF) is uniform in [0,1]. Thus, we consider the following penalty function:

 M∑m=112(u(m)−ΦI(\boldmathz(m)))2,

where

are randomly drawn samples from uniform distribution in [0,1] and

. Samples and are sorted in ascending order to relate to .

Thus our regularized variational lower-bound is given by

 ˜F(θ,ϕ;\boldmathX) = NMM∑m=1(F(θ,ϕ;% \boldmathx(m))−λa∥\boldmatha(m)∥2p) (16) − NMλr2M∑m=1(u(m)−ΦI(\boldmathz(m)))2.

With the random minibatch of datapoints, this regularized variational lower-bound in (16) is maximized to determine parameters and . This training is repeated with different minibatches until the convergence of parameters is achieved.

## 4 Gaussian Copula VAE

In this section we present a Gaussian copula variational autoencoder (GCVAE) where we extend the VAE-ROC (described in the previous section) developed for continuous data to handle mixed continuous and discrete data, employing Gaussian copula explained in Section 2.2.

Suppose we are given a set of data points, each of which is -dimensional vector, i.e, . We use superscripts and to indicate variables associated with continuous and discrete data, respectively, i.e., , where the semicolon is used to represent stacking vectors in a column. Denote by the vector which collects continuous attributes and by the vector containing discrete attributes. The th entry of is represented by .

In principle, GCVAE allows for various continuous or discrete distributions together. In this paper, however, we assume Gaussian distributions for continuous variables and categorical distributions for discrete variables. That is,

 pi(xci) = N(μi,σ2i), pi(xsi) = J∏j=1βI(xsi=j)i,j,

where and is the indicator function which yields 1 when the input argument is true and otherwise 0.

In GCVAE, we use the Gaussian copula to model the probabilistic decoder , given by

 pθ(\boldmathx|\boldmathz) = pθ(\boldmathxc,\boldmathxs|%\boldmath$z$) = E\boldmathv[cΨ(⋅)dc∏i=1pi(xci)ds∏i=1p∗i(xsi−vi)] ≈ cΨ(⋅)dc∏i=1pi(xci)ds∏i=1p∗i(xsi−vi) = cΨ(⋅)dc∏i=1N(μi,σ2i)ds∏i=1J∏j=1βI(xsi=j)i,j,

where the expectation is approximated using a single sample drawn from the uniform distribution on [0,1] and the Gaussian copula density (see Appendix A for proof) is given by

 cΨ(⋅) = cΨ(F1(xc1),…,Fdc(xcdc), F∗1(xs1−v1),…,F∗ds(xsds−vds)) =

where

 ˜\boldmathq⊤=[ψ1Φ−1(F1(xc1)),…,ψdcΦ−1(Fdc(xcdc)), (17) ψdc+1Φ−1(F∗1(xs1−v1)),…,ψdc+dsΦ−1(F∗ds(xsds−vds))] ,

and the covariance matrix in Gaussian copula is of the form . Diagonal entries of are denoted by and represents a diagonal matrix with diagonal entries being . When , the Gaussian copula density is equal to 1, i.e., GCVAE is identical to VAE.

As in the VAE-ROC, each of is parameterized by an individual MLP which takes as input. For instances,

 pθ(\boldmathz) = N(0,\boldmathI), μ = \boldmathWμ\boldmathh+\boldmathμ, log\boldmathσ2 = \boldmathWσ\boldmathh+\boldmathbσ, βi,j = exp{\boldmathw⊤β,i,j% \boldmathh+bβ,i,j}∑Jj′=1exp{% \boldmathw⊤β,i,j′\boldmathh+bβ,i,j′}, logω = \boldmathw⊤ω\boldmathh+bω, a = \boldmathWa\boldmathh+\boldmathba, h = tanh(\boldmathWh\boldmathz+\boldmathbh).

The probabilistic encoder is parameterized, as in VAE or VAE-ROC, as described in (3) and (2.1). As in VAE-ROC, GCVAE is learned by maximizing the reqularized variational lower-bound (16), where the lower-bound is given by

 F(θ,ϕ;\boldmathx(m),˜\boldmathq(m)) (18) =dc+ds∑i=1logψi−12log∣∣∣ω(m)\boldmathI+\boldmatha(m)\boldmatha(m)⊤∣∣∣ −12˜\boldmathq(m)⊤[(ω(m)\boldmathI+\boldmatha(m)\boldmatha(m)⊤)−1−\boldmathS−1]˜% \boldmathq(m) +ds∑i=1J∑j=1I(xs,(m)i=j