# High-dimensional copula variational approximation through transformation

Variational methods are attractive for computing Bayesian inference for highly parametrized models and large datasets where exact inference is impractical. They approximate a target distribution - either the posterior or an augmented posterior - using a simpler distribution that is selected to balance accuracy with computational feasibility. Here we approximate an element-wise parametric transformation of the target distribution as multivariate Gaussian or skew-normal. Approximations of this kind are implicit copula models for the original parameters, with a Gaussian or skew-normal copula function and flexible parametric margins. A key observation is that their adoption can improve the accuracy of variational inference in high dimensions at limited or no additional computational cost. We consider the Yeo-Johnson and G&H transformations, along with sparse factor structures for the scale matrix of the Gaussian or skew-normal. We also show how to implement efficient reparametrization gradient methods for these copula-based approximations. The efficacy of the approach is illustrated by computing posterior inference for three different models using six real datasets. In each case, we show that our proposed copula model distributions are more accurate variational approximations than Gaussian or skew-normal distributions, but at only a minor or no increase in computational cost.

## Authors

• 12 publications
• 9 publications
• 20 publications
• ### Transformation Models for Flexible Posteriors in Variational Bayes

The main challenge in Bayesian models is to determine the posterior for ...
06/01/2021 ∙ by Sefan Hörtling, et al. ∙ 0

• ### Flexible Variational Bayes based on a Copula of a Mixture of Normals

Variational Bayes methods approximate the posterior density by a family ...
06/28/2021 ∙ by David Gunawan, et al. ∙ 0

• ### Variational Gaussian Copula Inference

We utilize copulas to constitute a unified framework for constructing an...
06/19/2015 ∙ by Shaobo Han, et al. ∙ 0

• ### Transformation of arbitrary distributions to the normal distribution with application to EEG test-retest reliability

Many variables in the social, physical, and biosciences, including neuro...
01/05/2018 ∙ by Sacha Jennifer van Albada, et al. ∙ 0

• ### Variational Bayes approach for model aggregation in unsupervised classification with Markovian dependency

We consider a binary unsupervised classification problem where each obse...
05/04/2011 ∙ by Stevenn Volant, et al. ∙ 0

• ### Stein variational gradient descent with local approximations

Bayesian computation plays an important role in modern machine learning ...
04/13/2021 ∙ by Liang Yan, et al. ∙ 0

• ### Bayesian inference on high-dimensional multivariate binary data

It has become increasingly common to collect high-dimensional binary dat...
06/03/2021 ∙ by Antik Chakraborty, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction and Literature Review

Variational methods are an increasingly popular tool for computing posterior inferences for highly parametrized models and large datasets; see Ormerod and Wand (2010) and Blei et al. (2017)

for overviews. Unlike conventional Monte Carlo methods, which are able in principle to estimate quantities of interest with any desired precision, variational methods are inherently approximate. However, they are often substantially faster, and can be used to estimate models where exact inference is impractical. The selection of an approximation that balances accuracy with computational viability is key to the success of variational inference. In this paper we suggest a general approach to variational inference for a high-dimensional target distribution using Gaussian or skew-normal copula-based approximations. They are formed by using Gaussian or skew-normal distributions for element-wise parametric transformation of the target. Parsimonious factor parametrizations of the scale matrix of these distributions are used to enable feasible computation. For the transformations, we consider the Yeo-Johnson

(Yeo and Johnson, 2000) and G&H families (Tukey, 1977). These allow for skewness and more complex features in the marginal densities of the copula model, without requiring a large number of additional variational parameters– something that is important for maintaining computational efficiency in high dimensions. We also show how efficient reparametrization gradient methods can be used for such implicit copula models, including for the skew-normal by making use of its latent Gaussian structure. We show in a number of examples that our Gaussian and skew-normal copula models are more accurate approximations than the corresponding Gaussian and skew-normal distributions. Importantly, this increase in accuracy usually comes at only a minor increase in computational time, while in some instances the copula models are actually faster to calibrate.

Variational inference methods for Bayesian computation approximate a target posterior or augmented posterior distribution using another distribution which is more tractable. The form of the approximation is commonly derived either from an assumed factorization of the density, or the adoption of some convenient parametric family. In the current work, we consider parametric families of approximations. For parametric variational approaches, a multivariate Gaussian family is the most common choice. Important early work on Gaussian approximations can be found in Opper and Archambeau (2009), where they considered models having a Gaussian prior and factorizing likelihood, and showed that in this class of models the number of variational parameters does not proliferate with increasing dimension. Challis and Barber (2013)

discussed Gaussian approximations for models where the posterior could be expressed in a certain form, and show an equivalence between local variational methods and Kullback-Leibler divergence minimization methods in their setup. They also considered various parametrizations of the covariance matrix based on the Cholesky factor for the optimization. More recent work on Gaussian approximations has focused on stochastic gradient methods which largely remove any restriction on the kind of models to which the methodology applies. Key references here are papers by

Kingma and Welling (2014) and Rezende et al. (2014)

who introduced efficient variance reduction methods for stochastic gradient estimation in the variational optimization. These methods will be discussed further later. Some similar ideas were developed independently about the same time in

Titsias and Lázaro-Gredilla (2014) and Salimans et al. (2013). The latter authors also consider methods for Gaussian approximation able to use second derivative information from the log posterior, as well as methods for forming non-Gaussian approximations by making use of hierarchical structures or mixtures of normal approximations. Kucukelbir et al. (2017) consider an automatic differentiation approach to Gaussian variational approximation which considers both diagonal and dense Cholesky parametrizations of the covariance matrix and the use of fixed marginal transformations of parameters. Their approach is implemented in the statistical package Stan (Carpenter et al., 2017).

A key difficulty with Gaussian approximations is the way that the number of covariance parameters increases quadratically with the number of model parameters, making Gaussian variational approximation impractical unless more parsimonious parametrizations of the covariance matrix are adopted. While assuming a diagonal covariance matrix is one possibility, this leads to the inability to represent the posterior dependence. Work on structured approximations for covariance matrices in Gaussian approximation applicable to high-dimensional problems includes the work of Challis and Barber (2013) mentioned above, and Tan and Nott (2018), who parameterize the covariance matrix in terms of a sparse Cholesky factor of the precision matrix. Related methods for time series models are developed in Archer et al. (2016). Miller et al. (2016) and Ong et al. (2018) consider factor parametrizations of covariance matrices, with the former authors also considering mixture approximations, with Gaussian component covariance matrices having the factor structure. Earlier approaches which used a one factor approximation to the covariance or precision matrix were considered by Seeger (2000) and Rezende et al. (2014). Quiroz et al. (2018) consider combining factor parametrizations for state reduction with sparse precision Cholesky factors for capturing dynamic dependence structure in high-dimensional state space models. Guo et al. (2016) consider similar “variational boosting” mixture approximations to Miller et al. (2016), although they use different approaches to the specification of mixture components and to the optimization.

The references above relate to different approaches to variational inference based on Gaussian or mixtures of Gaussians approximations. However, there is also a large literature on other approaches to developing flexible variational families. Most pertinent to the present work are methods based on copulas. Tran et al. (2015) use vine copulas, but these can be slow to evaluate in high dimensions, and selection of the appropriate vine structure and component pair-copulas is difficult in general. Han et al. (2016)

also employ element-wise transformations to construct a Gaussian copula model, and their work is most closely related to ours. They consider dense Cholesky factor parametrizations for the covariance matrix in the copula, and employ approximations to the posterior marginals based on flexible Bernstein polynomial transformations. Our work differs from theirs in the focus on approximations that can be calibrated in high dimensions. In particular, we use parsimonious factor parametrizations for the copula scale matrix which are feasible to implement for a high-dimensional model parameter vector, as well as parametric transformations which are computationally efficient and do not employ too many variational parameters. We also go beyond Gaussian copula approximations by investigating skew-normal copulas as well. Skew-normal variational families are considered in

Ormerod (2011), who considers application to models which have a structure where the lower bound can be computed using one-dimensional quadrature. However, Ormerod (2011) does not consider skew-normal copulas.

Apart from copulas, there are many other ways to specify rich variational families. These include normalizing flows (Rezende and Mohamed, 2015), Stein variational gradient descent (Liu and Wang, 2016), real-valued non-volume preserving transformations (Dinh et al., 2016), methods based on transport maps (Spantini et al., 2018), implicit variational approximations where the variational family is specified through a generative process without a closed form density (Huszár, 2017) and hierarchical variational models (Ranganath et al., 2016). Some of these approaches attain their flexibility through using compositions of transformations of an initial density, but they do not fit into the copula framework discussed here.

The rest of the paper is organized as follows. Section 2 gives a brief introduction to variational inference methods, followed by a general description of our proposed implicit copula approach. Sections 3 and 4 consider Gaussian copula and skew-normal copula approximations, respectively. They illustrate our approach in six challenging examples, where the approximations are more accurate than the corresponding Gaussian approximations, but at limited or no computational cost. Section 5 gives some concluding discussion and directions for future work.

## 2 Variational Inference

In this section we first provide a short overview of variational inference. We then outline the implicit copulas formed through transformation that we employ as variational approximations.

### 2.1 Approximate Bayesian inference

We consider Bayesian inference with data having density , where is either a parameter vector, or a parameter vector augmented with some additional latent variables. The prior and posterior densities are denoted by and , respectively. We will consider variational inference methods, in which a member of some parametric family of densities is used to approximate , where is a vector of variational parameters. For example, for a multivariate normal family would consist of the distinct elements of the mean vector and covariance matrix. Approximate Bayesian inference is then formulated as an optimization problem, where a measure of divergence between and is minimized with respect to . The Kullback-Leibler divergence

 KL(p(\boldmathθ|\boldmathy)||qλ(\boldmathθ)) =∫logqλ(\boldmathθ)p(% \boldmathθ|\boldmathy)qλ(\boldmathθ)d\boldmathθ,

is typically used, and we employ it here. If denotes the marginal likelihood, then it is easily shown (see, for example, Ormerod and Wand (2010)) that

 KL(p(\boldmathθ|\boldmathy)||qλ(\boldmathθ)) =logp(\boldmathy)−∫logp(\boldmathθ)p(\boldmathy|\boldmathθ)qλ(% \boldmathθ)qλ(\boldmathθ)d\boldmathθ =logp(\boldmathy)−L(\boldmathλ% ), (1)

where is called the variational lower bound. Because does not depend on , minimization of the Kullback-Leibler divergence above with respect to is equivalent to maximizing the variational lower bound .

The lower bound takes the form of an intractable integral, so it seems challenging to optimize. However, notice that from (1) it can be written as an expectation with respect to as

 L(\boldmathλ)=Eqλ[logg(% \boldmathθ)−logqλ(\boldmathθ)], (2)

which allows easily application of stochastic gradient ascent (SGA) methods (Robbins and Monro, 1951, Bottou, 2010). In SGA we start from an initial value for and update it recursively as

 \boldmathλ(i+1) =\boldmathλ(i)+\boldmathρi∘ˆ∇λL(\boldmathλ(i)),%fori=1,2,…,

where is a vector of step sizes, ‘’ denotes the element-wise product of two vectors, and

is an unbiased estimate of the gradient of

at . For appropriate step size choices this will converge to a local mode of . Adaptive step size choices are often used in practice, and we use the ADADELTA method of Zeiler (2012).

To implement SGA unbiased estimates of the gradient of the lower bound are required. These can be obtained directly by differentiating (2), and evaluating the expectation in a Monte Carlo fashion by simulating from . However, variance reduction methods for the gradient estimation are often also important for fast convergence and stability. One of the most useful is the ‘reparametrization trick’ (Kingma and Welling, 2014, Rezende et al., 2014). In this approach, it is assumed that an iterate can be generated from by first drawing from a density which does not depend on , and then transforming by a deterministic function of and . From (2), the lower bound can be written as the following expectation with respect to :

 L(\boldmathλ) =Efε[logg(h(\boldmathε,\boldmathλ))−logqλ(h(\boldmathε,\boldmathλ))]. (3)

 ∇λL(\boldmathλ) =Efε[∇λ{logg(h(% \boldmathε,\boldmathλ))−logqλ(h(% \boldmathε,\boldmathλ))}], (4)

and approximating the expression (4) by Monte Carlo using one or more random draws from gives an unbiased estimate of . An intuitive reason for the success of the reparameterization trick is that it allows gradient information from the log-posterior to be used, by moving the variational parameters inside in (3). Xu et al. (2018) show how the trick reduces the variance of the gradient estimates when is a Gaussian with diagonal covariance matrix (the so-called ‘mean field’ Gaussian approximation). We employ the reparameterization trick, and specify a function , for a skew-normal copula in Section 4.

### 2.2 Variational approximations through transformations

Let be a family of one-to-one transformations onto the real line with parameter vector . To construct our variational approximation, we transform each parameter as and adopt a known distribution function , with vector of parameters , for . For example, if is the distribution function of a multivariate normal, then , where and are the mean and covariance matrix. If , then the density of the approximation can be recovered by computing the Jacobian of the element-wise transformation from to , so that

 qλ(\boldmathθ)=p(\boldmathψ;\boldmath% π)m∏i=1t′γi(θi), (5)

where the variational parameters are and . Moreover, if has known marginal distribution functions and densities for , with , the marginal densities of the approximation are

 qλi(θi)=pi(ψi;\boldmathπi)t′γi(θi), for i=1,…,m, (6)

with a sub-vector of .

The density at  (5) can also be represented using its copula decomposition as follows. If is the distribution function of , then

 qλ(\boldmathθ)=c(\boldmathu;~% \boldmathπ)m∏i=1qλi(θi), (7)

where , and is an -dimensional copula density with parameter vector . In much of the existing copula modeling literature, a parametric copula is selected for . When this is combined with pre-specified margins, this results in a flexible distributional form for ; for example, in the variational inference literature Tran et al. (2015) use a vine copula. However, here the copula is instead derived directly from (5) and (6) by inverting Sklar’s theorem, with copula density

 c(\boldmathu;~\boldmathπ)=p(\boldmathψ;\boldmathπ)∏mi=1pi(ψi;\boldmathπi)=p((F−11(u1),…,F−1m(um))⊤;\boldmathπ)∏mi=1pi(F−1i(ui);% \boldmathπi),

and copula function

 C(\boldmathu;~\boldmathπ)=F(F−11(u1;\boldmathπ1),…,F−1m(um;\boldmathπm);\boldmathπ),

determined by . Such a copula is called an ‘inversion copula’ (Nelsen, 2006, pp.51–52) or an ‘implicit copula’ (McNeil et al., 2005). In general, the copula parameters are given by , but with additional constraints to ensure they are identifiable in the copula; see Smith and Maneesoonthorn (2018) for examples. However, here the elements of are also parameters of the margins at (6), and this identifies in without any additional constraints.

The most popular choice for

is a Gaussian distribution, resulting in the Gaussian copula

(Song, 2000). More recently, there has been growing interest in selecting other distributions, such as the skew-t distribution (Demarta and McNeil, 2005, Smith et al., 2012) or those arising from state space models (Smith and Maneesoonthorn, 2018). These can produce distributional families for that are more flexible in their dependence structures. Later, we will illustrate our approach with sparse Gaussian and skew-normal distributions for , but note that other parametric distributions can also be used.

We observe that the expression at (5) is much easier to employ in variational inference than that at (7) for three reasons. First, as mentioned above, the constraints on required to identify do not need to be elucidated as is fully identified in (5). Second, evaluating (7) requires repeated computation of the vector (which involves numerical integrations), whereas evaluating (5) does not. Third, optimizing the lower bound with respect to proves more difficult than the unconstrained ; an observation made previously by Han et al. (2016) for Gaussian copula variational approximation.

### 2.3 Two transformations

Key to the success of our approach is the choice of an appropriate family of transformations . For this we consider two choices that have proven successful in transforming data to normality or symmetry. The first is the one parameter transformation of Yeo and Johnson (2000) (YJ hereafter), which extends the Box-Cox transformation to the entire real line. It is given here as

 tγ(θ)=⎧⎪⎨⎪⎩−(−θ+1)2−γ−12−γif θ<0(θ+1)γ−1γif θ≥0 for 0<γ<2.

The second is the two parameter (monotonic) G&H transformation of Tukey (1977), an overview of which can be found in Headrick et al. (2008), and is

 tγ(θ)=⎧⎪⎨⎪⎩exp(gθ)−1gexp(hθ2/2) if g≠0θexp(hθ22) if g=0 for γ=(g,h>0).

For both transformations, , so that if a parameter is constrained we first transform it to the real line; for example, with a scale or variance parameter we set to its logarithm. These transformations, their inverses and derivatives with respect to the model and variational parameters—all of which are required to implement the SGA algorithm—are fast to compute, with expressions given in Table 1.

## 3 Gaussian Copula Variational Approximation

### 3.1 Gaussian copula factor specification

The simplest implicit copula is the Gaussian copula, where is a multivariate normal distribution with mean and covariance matrix . In constructing a Gaussian copula, it is usual to also set and because these parameters are unidentified in the Gaussian copula function; for example, see the discussion in Song (2000). However, these parameters are fully identified in the density at (5) because they are also parameters of its margins , with at (6). To illustrate, Figure 1 plots for the YJ transformation, showing that this density can capture both positive or negative skew. Moreover, the direction and level of skew can differ in each margin, depending on the elements of , making a substantially more flexible approximation than a Gaussian.

When is of higher dimensions, we follow Ong et al. (2018) and adopt a factor structure for as follows. Let be an matrix with . For identifiability reasons it is assumed that the upper triangle of is zero. Let be a vector of parameters with , and denote by the diagonal matrix with entries . We assume that

 Σψ =BB⊤+D2, (8)

so that the number of parameters in grows only linearly with if is kept fixed. We note that this copula is equivalent to the Gaussian factor copula suggested by Murray et al. (2013) and Oh and Patton (2017) to model data, although they do not use it as a variational approximation. The Gaussian random vector has the generative representation , where and . By setting , , and the closed form reparameterization gradients in a Gaussian variational approximation with factor covariance structure given in Ong et al. (2018) can be used directly.111Here the ‘vech’ operator is the half-vectorization of a rectangular matrix, defined for an matrix with as with for .

### 3.2 Application: ordinal time series copula model

#### 3.2.1 The model and extended likelihood

To illustrate our proposed variational approximation we use it to estimate a complex model with a complex augmented posterior, where its greater flexibility may increase the accuracy of inference. We consider the copula time series model for an ordinal-valued random vector proposed by Loaiza-Maya and Smith (2018). These authors use a -dimensional parsimonious copula with density , where , to capture serial dependence in (this is not to be confused with the use of another copula for the variational approximation). The time series is assumed to be stationary with marginal distribution function , which is estimated non-parametrically in an initial step using the empirical distribution function.

The time series copula employed is a parsimonious drawable vine (D-vine) of Markov order , as given in Smith (2015), and defined as follows. Let be a stochastic process with , so that is marginally uniform. For , denote222Note that is the distribution function of evaluated at , and is the distribution function of evaluated at . , and , then the D-vine copula density is the product

 cDV(\boldmathv;\boldmathη)=T∏t=2min(t−1,p)∏k=1c\tiny MIXk(vt−k|t−1,vt|t−k+1;\boldmathηk), (9)

of bivariate copula densities called ‘pair-copulas’ (Aas et al., 2009), each with individual parameter vector . The D-vine copula therefore has parameter vector , which is parsimonious because it is not a function of . To capture the heteroskedasticity that exists in most ordinal-valued time series Loaiza-Maya and Smith (2018) use a five parameter mixture copula for , which we also use here and is outlined in Part A of the Online Appendix, leading to a total of model parameters. Given , the arguments of the pair-copulas in  (9) are computed using the recursive Algorithm 1 in Smith (2015).

It is widely known (Song, 2000, Genest and Nešlehová, 2007) that the mass function of this discrete-margined copula model is computationally intractable, so we use the extended likelihood of Smith and Khaled (2012) instead. This employs the vector , such that the joint mass function of is

 p(\boldmathy,\boldmathv|\boldmathη)=cDV(%\boldmath$v$;\boldmathη)T∏t=1I(at≤vt

with the indicator function if is true, and zero otherwise. It is straight-forward to show that the margin in of (10) is the required mass function . Evaluating the extended likelihood at (10) avoids the computational burden of evaluating directly.

#### 3.2.2 The variational approximation

We follow Loaiza-Maya and Smith (2018) and estimate the model by setting and approximating the augmented posterior , which uses the extended likelihood and a proper uniform prior . The target distribution therefore has dimension . These authors use the variational approximation , assuming independence between and , and a Gaussian distribution with a factor covariance structure for . However, because each is constrained to , it is transformed to the real line as , and independent normals used as approximations for .

Loaiza-Maya and Smith (2018) label this approximation ‘VA2’, and we extend it as follows. For we use a Gaussian copula formed through the YJ transformation with a factor structure, so that has elements (the unique elements in the factor decomposition plus the YJ transformation parameters). For each we use a normal approximation after a YJ transformation, so that has elements (the means and variances of the normals, plus the YJ transformation parameters). The full set of variational parameters are . They are calibrated using Algorithm 1 of Loaiza-Maya and Smith (2018), which employs SGA with control variates and the analytical gradient ; the latter of which is given in Appendix A for our copula approximation outlined here.

#### 3.2.3 Empirical illustration: monthly counts of attempted murder

We fit the time series model in Section 3.2.1 to monthly counts of Attempted Murder in New South Wales, Australia. Plots of the time series and the empirical distribution function used for margin can be found in (Loaiza-Maya and Smith, 2018, Fig.1). The parsimonious D-vine in  (9) has Markov order , and the target density is complex with dimension . We fit three parsimonious variational approximations: (i) the Gaussian copula outlined above with factors, (ii) a Gaussian distribution with factor covariance and factors, and (iii) a fully mean field Gaussian. Note that (ii) is equivalent to our copula approximation but with all YJ parameters set to (ie. an identity transformation), as is (iii) but with the additional constraint that is diagonal. Figure 2 plots lower bound values against step number for all three methods using the same SGA algorithm, and the copula approximation clearly dominates.

To assess the accuracy of the three variational approximations, we also estimate the posterior using the (very slow, but exact) data augmentation MCMC method of Smith and Khaled (2012). Figure 3

depicts the accuracy of the first three marginal posterior moments of the variational approximations. The panels provide scatterplots of the true moments against their approximations, with a blue scatter for the proposed copula approximation, and a red scatter for the Gaussian approximation. The left-hand panels give results for

and the right-hand panel for . More accurate variational approximations result in scatters that lie closer to the 45 degree line, and we make two observations. First, panels (e,f) show that the true posteriors are skewed, and that the copula approximation does a very good job of estimating the skew. Second, panel (c) reveals that by capturing the third moment in the augmented vector

, the posterior standard deviation of

is also estimated more accurately. Figure 4 compares the marginal densities for the four parameters which exhibit the most skew, and the tails are more accurately estimated using the copula approximation.

## 4 Skew-Normal Copula Approximation

### 4.1 Copula specification

An alternative implicit copula that we consider is based on the skew-normal distribution of Azzalini and Valle (1996) and Azzalini and Capitanio (2003). In this case, the transformed parameters are assumed to have joint density

 p(ψ;π)=2ϕm(ψ;μψ,Σψ)Φ1(α⊤ψS−1/2ψ(ψ−μψ)), (11)

where denotes the -dimensional multivariate normal density, is the distribution function of a univariate standard normal, , and is the ith diagonal element of . The parameters determine the level of skew in the marginals of , and when the distribution reduces to a Gaussian. As noted in Section 2.2, the parameters are fully identified in the parameterization of via both the implicit copula and margins.

Demarta and McNeil (2005), Smith et al. (2012) and Yoshiba (2018) show that implicit copulas constructed from skew-elliptical distributions are more flexible than elliptical copulas because they allow for asymmetric dependence.333This is not to be confused with asymmetry of the marginal distributions . Here, we focus on the skew-normal copula because it is typically faster and easier to calibrate than the skew-t copula. When it captures asymmetric dependence, making it more flexible than the Gaussian copula considered in Section 3, although the same factor structure discussed in Section 3.1 is adopted for the scale matrix . Therefore, the approximation to the target has variational parameters , where and are as defined in Section 3.1.

In our empirical examples, we employ the reparametrization trick to reduce the variance of the gradient estimate. This uses a simple generative representation of in terms of standardised random components. Using the properties of the skew-normal distribution (Azzalini and Valle, 1996), the following generative representation for can be derived (see Part B of the Online Appendix for details). If , and , then

where , , , , is distributed skew-normal with density at (11). Setting and , the gradient at (4) can be evaluated by first drawing from an distribution, and computing the derivatives analytically; see Appendix B for details.

### 4.2 Examples

To illustrate the use of a skew-normal copula as a variational approximation, we employ it to approximate the posterior of several logistic regressions examined previously in

Ong et al. (2018).

#### 4.2.1 Mixed logistic regression

The first uses the polypharmacy longitudinal data in Hosmer et al. (2013), which features data on 500 subjects over 7 years. The logistic regression is specified fully in Ong et al. (2018), and it includes 8 fixed effects (including an intercept), plus one subject-based random effect. The following approximations are fitted to the augmented posterior of , which comprises , the 8 fixed effect coefficients, and the 500 random effect values:

• Mean Field Normal: independent univariate Gaussians

• Mean Field YJ Transform: independent univariate distributions with densities at (6), where is a Gaussian density and is a YJ transform

• Gaussian: as in Ong et al. (2018)

• Skew-Normal

• Gaussian Copula: as outlined in Section 3.1, with a YJ transform

• Skew-normal Copula: as outlined in Section 4.1, where is a YJ transform

• Gaussian Copula: as outlined in Section 3.1, with a G&H transform

• Skew-normal Copula: as outlined in Section 4.1, where is a G&H transform

In approximations A3–A8, a factor structure with factors is used for the variance (A3) or scale matrix (A4) of the distribution, or the copula parameter matrix (A5–A8). For each approximation Table 2 lists the number of variational parameters , average lower bound value over the last 1000 steps of the SGA algorithm, and the time to complete 1000 steps using MATLAB on a standard laptop. Comparing the lower bound values for A2 and A1, it can be seen that allowing for asymmetry in the margins improves the approximation markedly; although using the skew-normal A4 is not as effective. The most accurate approximations are the Gaussian copulas A5 and A7. However, it proves much faster to calibrate the copulas constructed through the YJ, rather than the G&H, transformation. The time to complete 1000 SGA steps for A5 and A6 is only 29.8% and 56.6% longer compared to A1, making them computationally attractive choices.

To judge the approximation accuracy, the exact augmented posterior is computed using MCMC with data augmentation. Figure 5 plots the first three posterior moments of the approximations (vertical axes) against their true values (horizontal axes). Results are given for the approximations A3 (panels a,e,i), A4 (panels b,f,j), A5 (panels c,g,k) and A6 (panels d,h,l). All four identify the means well, but the striking result is that the two copula approximations capture the (Pearsons) skew coefficients remarkably well in panels (k,l). By doing so, the estimates of the second moment in panels (g,h) are also improved. Figure 6 illustrates further by plotting the exact posterior densities for the 9 model parameters (excluding the random effects), along with those of approximations A1, A3 and A5. Ignoring the dependence between parameters greatly understates the posterior standard deviation, which is well-known. However, adopting the Gaussian copula A5 improves the density estimates compared to a Gaussian – particularly for in panel (i). The latter is likely due to the skew in the posteriors of many random effect values, which is captured by the copula.

#### 4.2.2 Logistic regression

To illustrate the trade-off between speed and approximation accuracy, we consider the Spam, Krkp, Ionosphere and Mushroom test datasets considered in Ong et al. (2018). These have sample sizes and , respectively, and are used to fit logistic regressions with 104, 111, 37 and 95 covariates. We use the same prior on the linear coefficients of the covariates as these authors, and fit approximations A3–A8 using factors throughout. Table 3 reports the average lower bounds over the last 1000 steps. By this metric, either the skew-normal (A4) or skew-normal copula with a YJ transformation (A6) are the most accurate, followed by the skew-normal copula with a G&H transformation (A8). Figure 7 compares the calibration speed for approximations A3–A6, by plotting the lower bound against time to implement the SGA algorithm (in Matlab on a DELL workstation). This shows that including the YJ transformation can increase the calibration speed in these examples – something that can also be an important consideration when using variational inference in big data problems. Approximations A7 and A8 are much slower to calibrate (over ten times), and are not included in Figure 7 for presentation purposes because they are on a different scale.

## 5 Discussion

In this paper we show how to employ copula model approximations in variational inference using element-wise transformations of the parameter space. This type of copula is called an ‘implicit copula’, and is obtained from the choice of multivariate distribution for the transformed parameters . We suggest using parametric transformations that are effective in regression modelling, and illustrate with the power transformation of Yeo and Johnson (2000) and the G&H transformation of Tukey (1977). The implied margins of such transformations are available in closed form, and depend on both the transformation selected and the marginals of . While, in principle, any multivariate distribution can be selected, elliptical and skew-elliptical (Genton, 2004) distributions are good choices for two reasons. First, they give rise to implicit copulas which have been shown previously to be effective; for example, see Fang et al. (2002), Demarta and McNeil (2005) and Smith et al. (2012). Second, by employing a factor decomposition for the scale matrix of , the number of copula parameters only increases linearly with .

The approximation provides a balance between computational viability and accuracy. We illustrate here using Gaussian and skew-normal copulas of dimensions up to , although higher dimensions can also be considered. Our empirical work shows that the Yeo-Johnson transformation is particularly effective and is quickly calibrated using SGA; in most cases, faster than calibrating the elliptical or skew-elliptical distribution directly on the parameter vector. The approach of defining the copula approximation using element-wise transformations simplifies the computations required to implement variational inference, instead of selecting a copula function—such as a vine copula (Tran et al., 2015)—and marginals separately. Han et al. (2016) make a similar observation for a Gaussian copula, and we show this applies generally to all implicit copulas. Another important observation is that constraints on the parameters of usually employed to identify the implicit copula are not required here because they are identified through the margins .

Last, we comment on possible extensions to our work. One interesting possibility is to consider other flexible multivariate models for constructing the implicit copula. Truncated Gaussian graphical models (Su et al., 2016) are one interesting possibility here, since they include the skew-normal distribution as a special case, and similar to the skew-normal they have a latent Gaussian structure which may be amenable to implementation of reparametrization methods for gradient estimation in the optimization. It would also be interesting to implement our copula approximations in other challenging settings, such as when some of the parameters are discrete, or in likelihood-free inference applications. Here gradient estimation for the optimization becomes more challenging, as straightforward reparametrization techniques do not immediately apply.

## Appendix A

This appendix derives the gradient needed to implement the example in Section 3.2.1. In this example, , where are the model parameter and the vector of auxiliary variables. The approximation to the augmented posterior of is

 qλ(\boldmathθ)=qλa(η)qλb(v)=pa(\boldmathψa;πa)pb(\boldmathψb;πb)(m∏i=1t′γa,i(ηi))(T∏t=1t′γb,t(~vt)d~vtdvt)

with , , , , , , , , . It follows then that

 logqλa(η)=logpa(\boldmathψa;πa)+m∑i=1logt′γa,i(ηi).

For we use a Gaussian copula, so that and with and . From here we can show that the elements of the gradient are

 ∇μlog qλa(η)= (BB⊤+D2)−1(\boldmathψa−μ) ∇blog qλa(η)= vech(−(BB⊤+D2)−1B+(BB⊤+D2)−1(\boldmathψa−μ)(\boldmathψa−μ)⊤(BB⊤+D2)−1B) ∇dlog qλa(η)= diag(−(BB⊤+D2)−1D+(BB⊤+D2)−1(\boldmathψa−μ)(\boldmathψa−μ)⊤(BB⊤+D2)−1D). ∇γlog qλa(η)= −∂tγ(η)∂γa(BB⊤+D2)−1(\boldmathψa−μ)+(∂t′γa,1(η1)∂γa,11t′γa,1(η1),…,∂t′γa,m(ηm)∂γa,m1t′γa,m(ηm))⊤

with .

For we assume an independent Gaussian approximation , where , and . The implied approximation for is

 logqλb(v)=T∑t=1(12~v2t−ct−(ψbt−ζt)22exp(2ct)−log(bt−at)+log(t′γb,t(~vt))),

The gradient needed for optimization is with elements

 ∇ζlog qλb(v)= (ψb1−ζ1ω21,…,ψbT−ζTω2T)⊤ ∇clog qλb(v)= ((ψb1−ζ1)2ω21−1,…,(ψbT−ζT)2ω2T−1)⊤ ∇γlog qλb(v)=

with .

## Appendix B

This appendix provides details on the implementation of variational inference using the skew-normal approximation proposed in Section 4. Notice that by multiplying (11) by the Jacobian of the transformation from to , the approximating density is

 qλ(θ) =2ϕm(ψ;μψ,Σψ)Φ1(α⊤ψS−1/2ψ(ψ−μψ))m∏i=1t′γi(θi),

where and . The complete vector of variational parameters for this approximation is , where is the vectorization of omitting the zero upper triangular elements. As discussed in Section 2, to implement SGA using the reparameterization trick, the gradient

 ∇λL(\boldmathλ)= Efε[∇λ(logg(h(% \boldmathε,\boldmathλ))−logqλ(h(% \boldmathε,\boldmathλ)))] = Efε⎡⎣{dh(ε,λ)dλ}T(∇θlogg(h(ε,λ))−∇θlogqλ(h(ε,λ)))⎤⎦, (12)

needs approximating. This is undertaken by drawing an iterate of from a distribution, and then computing the derivatives inside (12) analytically. Below, we write as for clarity. To derive the derivatives, note that the gradient can be broken up into sub-vectors

where

 ∇μψL(\boldmathλ)= dθ(ε,λ)dμψ⊤(∇θlogg(θ)−∇θlogqλ(θ)) ∇αψL(\boldmathλ)= dθ(ε,λ)dαψ