 # Variational Gaussian Copula Inference

We utilize copulas to constitute a unified framework for constructing and optimizing variational proposals in hierarchical Bayesian models. For models with continuous and non-Gaussian hidden variables, we propose a semiparametric and automated variational Gaussian copula approach, in which the parametric Gaussian copula family is able to preserve multivariate posterior dependence, and the nonparametric transformations based on Bernstein polynomials provide ample flexibility in characterizing the univariate marginal posteriors.

## Code Repositories

### VariationalGaussianCopula

Matlab Code for Variational Gaussian Copula Inference

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

A crucial component of Bayesian inference is approximating the posterior distribution, which represents the current state of knowledge about the latent variables

after data have been observed. When intractable integrals are involved, variational inference methods find an approximation to the posterior distribution by minimizing the Kullback-Leibler (KL) divergence , providing a lower bound for the marginal likelihood.

To make inference tractable, mean-field variational Bayes (MFVB) methods (Jordan et al., 1999; Wainwright and Jordan, 2008) assume is factorized over a certain partition of the latent variables , , with marginal densities in free-form and correlations between partitions neglected. The structured mean-field approaches (Saul and Jordan, 1996; Hoffman and Blei, 2015) preserve partial correlations and apply only to models with readily identified substructures. The variational Gaussian (VG) approximation (Barber and Bishop, 1998; Opper and Archambeau, 2009) allows incorporation of correlations by postulating a multivariate Gaussian parametric form

. The VG approximation, with continuous margins of real variables, are not suitable for variables that are inherently positive or constrained, skewed, or heavy tailed. For multi-modal posteriors, a mixture of MFVB

(Jaakkola and Jordan, 1998) or a mixture of uniformly-weighted Gaussians (Gershman et al., 2012) may be employed, which usually requires a further lower bound on the average over the logarithm of the mixture distribution.

To address the limitations of current variational methods in failing to simultaneously characterize the posterior dependencies among latent variables while allowing skewness, multimodality, and other characteristics, we propose a new variational copula framework. Our approach decouples the overall inference task into two subtasks: (i) inference of the copula function, which captures the multivariate posterior dependencies; (ii) inference of a set of univariate margins, which are allowed to take essentially any form. Motivated by the work on automated (black-box) variational inference (Ranganath et al., 2014; Mnih and Gregor, 2014; Titsias and Lázaro-Gredilla, 2014; Nguyen and Bonilla, 2014; Kingma and Welling, 2014), we present a stochastic optimization algorithm for generic hierarchical Bayesian models with continuous variables, which (i) requires minimal model-specific derivations, (ii) reproduces peculiarities of the true marginal posteriors, and (iii) identifies interpretable dependency structure among latent variables.

Using copulas to improve approximate Bayesian inference is a natural idea that has also been explored recently in other contexts (Li et al., 2015; Ferkingstad and Rue, 2015). Independently from our work, Tran et al. (2015) presented a copula augmented variational method with fixed-form marginals, and utilizes regular vines to decompose the multivariate dependency structure into bivariate copulas and a nest of trees. Our method provides complementary perspectives on nonparametric treatment of univariate marginals.

## 2 Variational Copula Inference Framework

Sklar’s theorem (Sklar, 1959)

ensures that any multivariate joint distribution

can be written in terms of univariate marginal distributions , and a copula which describes the dependence structures between variables, such that

 Q(x1,…,xp)=C[F1(x1),…,Fp(xp)]. (1)

Conversely, if is a copula and are distribution functions, then the function defined by (1) is a dimensional joint distribution function with marginal distributions , owing to the marginally closed property (Song, 2000). Assuming has

-order partial derivatives, the joint probability density function (PDF) is

, where is the PDF of the

th variable and it is related to the corresponding cumulative distribution function (CDF) by

, is the copula density with parameter .

Sklar’s theorem allows separation of the marginal distributions from the dependence structure, which is appropriately expressed in the copula function . As a modeling tool, the specified copula function and margins can be directly fitted to the observed data (Liu et al., 2009; Wauthier and Jordan, 2010; Lopez-Paz et al., 2013)

with their parameters optimized via Bayesian or maximum likelihood estimators (see

Smith (2013) and the references therein). In contrast, our goal is to use a copula as an inference engine

for full posterior approximation. All the unknowns (variables/parameters) in the user-specified hierarchical model are encapsulated into a vector

, and the optimal variational approximation to the true posterior is found under the Sklar’s representation. This approach provides users with full modeling freedom and does not require conditional conjugacy between latent variables; thus the approach is applicable to general models. Within some tractable copula family , and assuming and to be differentiable, we construct the variational proposal as , where , such that the approximation satisfies

 q⋆C(x) =\operatornamewithlimitsargminqC(x)KL{qC(x)||p(x|y)} (2) =\operatornamewithlimitsargminqC(x)KL{qC(x)||p(x)}−EqC(x)[lnp(y|x)],

where is the likelihood and is the prior. Letting the true posterior in Sklar’s representation be , where , and are the true underlying copula density and marginal posterior densities, respectively, the KL divergence decomposes into additive terms (derivations are provided in Supplementary Material),

 KL{qC(x)||p(x|y)} =KL{c[F(x)]||c⋆[F⋆(x)]} +∑jKL{fj(xj)||f⋆j(xj)}. (3)

Classical methods, such as MFVB and the VG approximation are special cases of the proposed VC inference framework. We next compare their KL divergence under Sklar’s representation and offer a reinterpretation of them under the proposed framework.

### 2.1 Special Case 1: Mean-field VB

The mean-field proposal corresponds to the independence copula with free-form marginal densities . Given we have . If MFVB is not fully factorized, i.e. , the independence copula is the only copula satisfying the marginal closed property, according to the impossibility theorem (Nelsen, 2007). MFVB assumes an independence copula and only optimizes the free-form margins,

 KL{qVB(x)||p(x|y)} =KL{cΠ[F(x)]||c⋆[F⋆(x)]} +∑jKL{fj(xj)||f⋆j(xj)}. (4)

The lowest achievable KL divergence in MFVB is , which is achieved when the true posterior marginals are found, i.e. , in which case the overall KL divergence is reduced to the KL divergence between the independence copula and the true copula. As is shown in (2.1), the objective function contains two terms, both involving marginal CDFs . Since in general , the optimal minimizing the first term will not be equal to . Therefore, minimizing (2.1

) will not lead to the correct marginals and this partially explains the reason why MFVB usually cannot find the true marginal posteriors in practice (e.g., variances can be severely underestimated

### 2.2 Special Case 2: VG Approximation

In fixed-form variational Bayes (Honkela et al., 2010), such as VG approximation, the multivariate Gaussian proposal can be written as . VG not only assumes the true copula function is a Gaussian copula (Song, 2000) with parameter , , but is also restricted to univariate Gaussian marginal densities ,

 KL{qVG(x)||p(x|y)} =KL{cG[Φ(x)]||c⋆[F⋆(x)]} +∑jKL{ϕj(xj)||f⋆j(xj)}. (5)

We can see in (2.2) that if the margins are misspecified, even if the true underlying copula is a Gaussian copula, , there could still be a discrepancy between margins, and is not zero.

Concerning analytical tractability and simplicity, in the sequel we concentrate on variational Gaussian copula (VGC) proposals constructed via Gaussian copula with continuous margins, i.e. , where . Our VGC method extends MFVB and VG, and improves upon both by allowing simultaneous updates of the Gaussian copula parameter and the adaptation of marginal densities . First, the univariate margins in VGC is not restricted to be Gaussian. Second, the Gaussian copula in VGC is more resistant to local optima than the independence copula assumed in MFVB and alleviates its variance underestimation pitfall, as is demonstrated in Section 6.3.

## 3 Variational Gaussian Copula Approximation

A Gaussian copula function with correlation matrix is defined as where is a shorthand notation of the CDF of , and is the CDF of . The Gaussian copula density is

 cG(u1,…,up|Υ) =1√|Υ|exp{−zT(Υ−1−Ip)z2},

where .

In the proposed VGC approximation, the variational proposal is constructed as a product of Gaussian copula density and continuous marginal densities. The evidence lower bound (ELBO) of VGC approximation is

 LC[qVGC(x)] =∫[cG[F(x)]×p∏j=1fj(xj)]lnp(y,x)dx +H[cG(u)]+p∑j=1H[fj(xj)], (6)

where , .

However, directly optimizing the ELBO in (3) w.r.t. the Gaussian copula parameter and the univariate marginals often leads to a non-trivial variational calculus problem. For computational convenience, we present several equivalent proposal constructions based on Jacobian transformation and reparameterization.

### 3.1 Equivalent Variational Proposals

We incorporate auxiliary variables by exploiting the latent variable representation of the Gaussian copula: , , . Letting be bijective monotonic non-decreasing functions, , , the Jacobian transformation gives

 qVGC(x) =∫[p∏j=1δ(xj−gj(zj))]qG(z;0,Υ)dz (7) =qG(g−1(x);0,Υ)[p∏j=1ddxjg−1j(xj)],

where is the Dirac delta function.

It is inconvenient to directly optimize the correlation matrix of interest, since is a positive semi-definite matrix with ones on the diagonal and off-diagonal elements between . We adopt the parameter expansion (PX) technique (Liu et al., 1998; Liu and Wu, 1999), which has been applied in accelerating variational Bayes (Qi and Jaakkola, 2006) and the sampling of correlation matrix (Talhouk et al., 2012). Further considering , , , , thus , where are also bijective monotonic non-decreasing functions, the variational proposal is further written as

 qVGC(x) =∫[p∏j=1δ(xj−hj(˜zj))]qG(˜z;μ,Σ)d˜z (8) =qG(h−1(x);μ,Σ)[p∏j=1ddxjh−1j(xj)].

Given the transformations , can be further reparameterized by the Cholesky decomposition (Challis and Barber, 2013; Titsias and Lázaro-Gredilla, 2014), where is a square lower triangular matrix. Table 1 summarizes four translatable representations of variational proposals.

### 3.2 VGC with Fixed-form Margins

The ELBO under Sklar’s representation (3) is therefore translated into the Jacobian representation

 LC[qVGC(x)]=EN(˜z;μ,Σ)[ℓs(˜z)−lnqG(˜z)], (9) ℓs(˜z,h)=lnp(y,h(˜z))+p∑j=1lnh′j(˜zj). (10)

The monotonic transformations can be specified according to the desired parametric form of marginal posterior, if the inverse CDF is tractable. For example, the multivariate log-normal posterior can be constructed via a Gaussian copula with log-normal (LN) margins,

 qVGC-LN(x)=CG(u|Υ)p∏j=1LN(xj;μj,σ2j). (11)

This also corresponds to imposing exponential transform on Gaussian variables, , . In this case, controls the location and dispersion of the marginal density; does not have any additional parameters to control the shape and takes a simple form. VGC-LN is further discussed in Section 6.2 and Section 6.3.

Given the copula function , we only need to find

one-dimensional margins. However, without knowing characteristics of the latent variables, specifying appropriate parametric form for margins is a difficult task in general cases. First, the marginals might exhibit multi-modality, high skewness or kurtosis, which are troublesome for particular parametric marginals to capture. Second, a tractable inverse CDF with optimizable arguments/parameters, as required here, are available only in a handful of cases. Instead of using some arbitrary parametric form, we construct bijective transform functions via kernel mixtures, which lead to highly flexible (ideally free-form) marginal proposals.

## 4 Bernstein Polynomials based Monotone Transformations

The marginal densities in VGC can be recovered through Jacobian transformation,

 fj(xj) =qG(h−1j(xj);μj,σ22)ddxjh−1j(xj) (12) =qG(h−1j(xj);μj,σ22)1h′j(h−1j(xj)), (13)

where the term is interpreted as a marginal-correction term. To guarantee analytical tractability, we require to be (i) bijective; (ii) monotonic non-decreasing; (iii) having unbounded/constrained range; (iv) differentiable with respect to both its argument and parameters; and (v) sufficiently flexible. We propose a class of continuous and smooth transformations constructed via kernel mixtures that automatically have these desirable properties.

### 4.1 Continuous Margins Constructed via Bernstein Polynomials

The Bernstein polynomials (BPs) have a uniform convergence property for continuous functions on unit interval and have been used for nonparametric density estimation (Petrone, 1999). It seems more natural to use kernel mixtures directly as the variational proposal. However, the difficulty lies in tackling the term involving the inverse CDF of mixtures (not analytical) and the need of a further lower bound on the entropy of mixtures. In this paper, we overcome this issue by using a sandwich-type construction of the transform 111The index on is temporarily omitted for simplicity, and is added back when necessary. which maps from to some target range building upon BP,

 h(˜z) =Ψ−1[B(Φ(˜z);k,ω)], (14) B(u;k,ω) =k∑r=1ωr,kIu(r,k−r+1), (15)

where is the regularized incomplete beta function. is the standard normal CDF mapping from to , and

is some predefined tractable inverse CDF with fixed parameters; for example, the inverse CDF of the exponential distribution helps map from

to for positive variables. relocates the probability mass on the unit interval . The degree is an unknown smoothing parameter, and is the unknown mixture weights on the probability simplex . The proposed sandwich-type transformation avoids the difficulty of specifying any particular types of marginals, while still leads to tractable derivations presented in Section 5.

### 4.2 Variational Inverse Transform

Considering a 1-d variational approximation problem ( is a scalar, the true posterior is known up to the normalizing constant), fix , thus , we can learn the monotonic transformation

on the base uniform distribution

by solving a variational problem,

i.e., if we generate , then . is closest to the true distribution with the minimum KL divergence. This can be interpreted as the variational counterpart of the inverse transform sampling (Devroye, 1986), termed as variational inverse transform (VIT). Our BP-based construction is one appropriate parameterization scheme for the inverse probability transformation . VIT-BP offers two clear advantages. First, as opposed to fixed-form variational Bayes, it does not require any specification of parametric form for . Second, the difficult task of calculating the general inverse CDFs is lessened to the much easier task of calculating the predefined tractable inverse CDF . Some choices of include CDF of for variables in , for truncated variables in .

To be consistent with VIT, we shall set in (14) to be , instead of , such that is always uniformly distributed. Ideally, BP itself suffices to represent arbitrary continuous distribution function on the unit interval. However, it might require a higher order . As is demonstrated in Section 6.1, this requirement can be alleviated by incorporating auxiliary parameters in VGC-BP, which potentially help in changing location and dispersion of the probability mass.

## 5 Stochastic VGC

The derivations of deterministic VGC updates are highly model-dependent. First, due to the cross terms often involved in the log likelihood/prior, the corresponding Gaussian expectations and their derivatives may not be analytically tractable. Second, owing to the non-convex nature of many problems, only locally optimal solutions can be guaranteed. In contrast, stochastic implementation of VGC only requires the evaluation of the log-likelihood and log-prior along with their derivatives, eliminating most model-specific derivations, and it provides a chance of escaping local optima by introducing randomness in gradients.

### 5.1 Coordinate transformations

Applying the coordinate transformations222If necessary, the Gaussian copula can be replaced with other appropriate parametric forms. The coordinate transformation supports many other distributions as well, for example, those described in Appendix C.2. of Rezende et al. (2014). of stochastic updates, , , introduced in (Rezende et al., 2014; Titsias and Lázaro-Gredilla, 2014), the gradient of the ELBO w.r.t. variational parameter can be written as

 ∇μLC =EqG(˜z)[∇˜zℓs(˜z,h)−∇˜zlnqG(˜z)], (16) ∇CLC =EqG(˜z)[∇˜z(ℓs(˜z,h)−∇˜zlnqG(˜z))ϵT], (17)

 ∇˜zjℓs(˜z) =∇˜zjlnp(y,h(˜z))+∇˜zjlnh′j(˜zj) (18) =\definecolor[named]pgfstrokecolorrgb0,0,0\pgfsys@color@gray@stroke0\pgfsys@color@gray@fill0∂lnp(y,x)∂xjh′j(˜zj)+∇˜zjlnh′j(˜zj).

According to the chain rule, the first derivative of

w.r.t is,

 h′(˜z) =dΨ−1[B(Φ(˜z);k,ω)]dB(Φ(˜z);k,ω)dB(Φ(˜z);k,ω)dΦ(˜z)dΦ(˜z)d˜z =b(Φ(˜z);k,ω)ϕ(˜z)ψ(h(˜z)), (19)

where , is the beta density . Therefore, and all take analytical expressions, where

 h′′j(˜zj) =[ρ′1(˜zj)ρ2(˜zj)ρ3(˜zj)+ρ1(˜zj)ρ′2(˜zj)ρ3(˜zj) (20) −ρ1(˜zj)ρ2(˜zj)ρ′3(˜zj)]/[ρ3(˜zj)]2,

where , , , , , , , is the PDF of , and are the predefined PDF and its derivative respectively. Defining , the derivative is written as a combination of two polynomials of lower degree .

In stochastic optimization, the gradients expressed in terms of expectations are approximated using Monte Carlo integration with finite samples. The gradients contain expectations on additive terms. Note that Rezende et al. (2014) and Titsias and Lázaro-Gredilla (2014) ignore the stochasticity in the entropy term and assume and . This creates an inconsistency as we only take finite samples in approximating , and perhaps surprisingly, this also results in an increase of the gradient variance and the sensitivity to the learning rates. Our method is inherently more stable, as the difference between the gradients, , , tends to zero when the convergent point is approached. In contrast, the gradients in previous method diffuses with a constant variance even around the global maximum. This phenomenon is illustrated in Section 6.2.

The alternative log derivative approach are also applicable to VGC inference and other types of copulas, see Paisley et al. (2012); Mnih and Gregor (2014); Rezende et al. (2014) for references. We leave this exploration open for future investigation.

### 5.2 Update the BP Weights

Under a given computational budget, we prefer a higher degree , as there is no over-fitting issue in this variational density approximation task. Given , the basis functions are completely known, depending only on index . The only parameter left to be optimized in the Bernstein polynomials is the mixture weights. Therefore, this construction is relatively simpler than Gaussian mixture proposals (Gershman et al., 2012; Nguyen and Bonilla, 2014). Assuming permissibility of interchange of integration and differentiation holds, we have , with the stochastic gradients

 ∇ω(j)ℓs(˜z,h,y)=∇ω(j)lnp(y,h(˜z))+∇ω(j)lnh′j(˜zj) (21) =∂lnp(y,x)∂xj[∂hj(˜zj)∂ω(j)r,k]r=1:k+[∂lnh′j(˜zj)∂ω(j)r,k]r=1:k,

where

 ∂hj(˜zj)∂ω(j)r,k =∂Ψ−1[B(uj;k,ω(j))]∂ω(j)r,k=Iuj(r,k−r+1)ψ(hj(˜zj)),
 ∂lnh′j(˜zj)/∂ω(j)r,k =β(uj;r,k−r+1)/b(uj;k,ω(j)) (22) −ψ′(hj(˜zj)){ψ(hj(˜zj))}2Iuj(r,k−r+1).

The gradients w.r.t turn into expectation straightforwardly, to enable stochastic optimization of the ELBO. To satisfy the constraints of on the probability simplex, we apply the gradient projection operation introduced in Duchi et al. (2008) with complexity . The above derivations related to BPs together with those in Section 5.1 are all analytic and model-independent. The only two model-specific terms are and . The stochastic optimization algorithm is summarized in Algorithm 1, with little computational overhead added relative to stochastic VG. The stability and efficiency of the stochastic optimization algorithm can be further improved by embedding adaptive subroutines (Duchi et al., 2011) and considering second-order optimization method (Fan et al., 2015).

## 6 Experiments

We use Gaussian copulas with fixed/free-form margins as automated inference engines for posterior approximation in generic hierarchical Bayesian models. We evaluate the peculiarities reproduced in the univariate margins and the posterior dependence captured broadly across latent variables. This is done by comparing VGC methods to the ground truth and other baseline methods such as MCMC, MFVB, and VG (see Supplementary Material for detailed derivations). Matlab code for VGC is available from the GitHub repository: https://github.com/shaobohan/VariationalGaussianCopula

### 6.1 Flexible Margins

We first assess the marginal approximation accuracy of our BP-based constructions in Section 4.2, i.e., via d variational optimization, where in VIT-BP, and in VGC-BP. For fixed BP order , the shape of is adjusted solely by updating , according to the variational rule. In VGC-BP, the additional marginal parameters also contribute in changing location and dispersion of . Examining Figure 1, VGC-BP produces more accurate densities than VIT-BP under the same order . Hereafter, the predefined for real variables, positive real variable, and truncated [0,1] variables are chosen to be the CDF of , and , respectively.

### 6.2 Bivariate Log-Normal

The bivariate log-normal PDF (Aitchison and Brown, 1957) is given by

 p(x1,x2) =exp(−ζ/2)/[2πx1x2σ1σ2√1−ρ2], (23) ζ =11−ρ2[α21(x1)−2ρα1(x1)α2(x2)+α22(x2)],

where , , .

We construct a bivariate Gaussian copula with (i) Log-normal margins (VGC-LN) and (ii) BP-based margins (VGC-BP). We set and , or (first and second row in Figure 2). Both VGC-LN and VGC-BP methods presume the correct form of the underlying copula (bivariate Gaussian) and learn the copula parameters . VGC-LN further assumes exactly the true form of the univariate margins (log-normal) while VGC-BP is without any particular assumptions on parametric form of margins. Figure 2 shows that VGC-BP find as accurate joint posteriors as VGC-LN, even though the former assumes less knowledge about the true margins.