 # Bayesian Inference for Regression Copulas

We propose a new semi-parametric distributional regression smoother for continuous data, which is based on a copula decomposition of the joint distribution of the vector of response values. The copula is high-dimensional and constructed by inversion of a pseudo regression, where the conditional mean and variance are non-parametric functions of the covariates modeled using Bayesian splines. By integrating out the spline coefficients, we derive an implicit copula that captures dependence as a smooth non-parametric function of the covariates, which we call a regression copula. We derive some of its properties, and show that the entire distribution - including the mean and variance - of the response from the copula model are also smooth non-parametric functions of the covariates. Even though the implicit copula cannot be expressed in closed form, we estimate it efficiently using both Hamiltonian Monte Carlo and variational Bayes methods. Using four real data examples, we illustrate the efficacy of these estimators, and show the properties and advantages of the copula model, for implicit copulas of dimension up to 40,981. The approach produces predictive densities of the response that are locally adaptive with respect to the covariates, and are more accurate than those from benchmark methods in every case.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Non- or semi-parametric regression methods typically estimate only the mean of a response variable as an unknown smooth function of a set of covariates. Yet in many applications, other features of the response distributions—such as higher order moments and tail probabilities—also vary with the covariates. For example,

Yau and Kohn (2003), Lázaro-Gredilla and Titsias (2011) and others consider heteroscedastic regressions, where both the mean and variance of the response are unknown smooth functions of the covariates. More recently, Rigby and Stasinopoulos (2005), Klein et al. (2015), Wood et al. (2016)

and others make all the parameters of a parametric response distribution unknown smooth functions of the covariates. However, these approaches all assume a specific parametric distribution for the response, conditional on the functions. In this paper, we propose a novel class of semi-parametric distributional regression models that avoids such an assumption. It uses a copula decomposition of the joint distribution of a vector of values from a single response variable. To do so, we employ a new flexible copula with a dependence structure that is an unknown smooth function of the covariate values, and model the marginal distribution of the response variable as non-parametric. The distributional regression is therefore non-parametric in two ways: in a distributional sense with respect to the margin of the response, and in a functional sense with respect to the covariates via the copula. It allows the entire distribution of the response—including higher order moments and quantiles—to be a smooth unknown function of the covariates.

Copula models (McNeil et al., 2005; Nelsen, 2006) are common in empirical analysis because the marginal distributions can be modeled arbitrarily and separately from the dependence structure, which is captured using a copula function. In this paper, the copula has dimension equal to the length of the vector of response values, so that its dimension can be very high – over 40,000 in our empirical work. Few existing copulas can be used in such a circumstance, although copulas constructed by the inversion of a parametric distribution (Nelsen, 2006, Section 3.1) can. Such copulas are called either ‘inversion’ or ‘implicit’ copulas, and those constructed by the inversion of Gaussian (Song, 2000) and t (Demarta and McNeil, 2005) distributions are popular. More flexible implicit copulas can be constructed by inverting the distribution of values of one or more response variables from parametric statistical models. We label these response variables ‘pseudo-responses’ because they are not observed directly. Examples include implicit copulas constructed from factor models (Murray et al., 2013; Oh and Patton, 2017), vector autoregressions (Biller, 2009; Smith and Vahey, 2016), nonlinear state space models (Smith and Maneesoonthorn, 2018), Gaussian processes (Wauthier and Jordan, 2010; Wilson and Ghahramani, 2010) and regularized regression (Klein and Smith, 2018). These implicit copulas reproduce the dependence structure of the pseudo-response variable, but combining them with arbitrary margins in a copula model produces a more flexible model that allows for a much wider range of data distributions.

In this paper we show how to construct an implicit copula from a heteroscedastic semi-parametric regression. Both the mean and variance of the pseudo-response are unknown smooth functions of covariates, each modeled using additive Bayesian P-splines. Because implicit copulas do not retain any information about the marginal (ie. unconditional on the covariates) location and scale of the pseudo-response, we normalize the pseudo-response to have zero mean and unit variance marginally. By integrating out the basis coefficients of the splines, we derive a copula that is a smooth function of the covariate values and P-spline smoothing parameters only. We call this a ‘regression copula’, because when used in a copula model for the vector of response values, it captures the effect of the covariates. The P-spline smoothing parameters become the copula parameters, and these require estimation.

There are two main challenges when estimating the copula parameters using likelihood-based methods: (i) the copula function and density are unavailable in closed form, and (ii) the copula has dimension equal to the sample size, which can be very high. We outline two approaches to overcome these challenges. The first is a Markov chain Monte Carlo (MCMC) sampler with a Hamiltonian Monte Carlo step

(Duane et al., 1987; Neal, 2011; Betancourt, 2017), which evaluates the posterior distribution exactly. Convergence is ensured by combining the leapfrog integrator with the dual averaging approach of Hoffman and Gelman (2014). The second is a variational Bayes (VB) estimator (Jordan et al., 1999; Ormerod and Wand, 2010) to compute approximate posterior inference quickly when the sample size—and hence the copula dimension—is high. The VB estimator is based on a Gaussian approximation (Opper and Archambeau, 2009) with a sparse factor representation of the covariance matrix (Miller et al., 2016). Following Ong et al. (2018), we maximize the variational lower bound using a stochastic gradient ascent algorithm (Honkela et al., 2010; Salimans and Knowles, 2013) with gradient estimates computed efficiently via the ‘re-parameterization trick’ (Kingma and Welling, 2014).

We derive key properties of the regression copula, including popular dependence metrics, and show that the independence copula is a limiting case. Given a dataset, the entire (Bayesian posterior) predictive distribution of the observed response can be computed from the copula model. This distribution is a smooth function of the covariates, and its first and second moments are estimates of the regression and variance functions for the observed response variable. Interestingly, the inclusion of the heteroscedastic term for the pseudo-response, creates a regression copula with a much more flexible dependence structure. This results in predictive density and regression mean and variance function estimates that are highly ‘locally adaptive’

(Brockmann et al., 1993) with respect to the covariates; something that is difficult to achieve in alternative approaches to distributional regression.

We demonstrate the features of the proposed regression copula, and the advantage of the copula model approach, using four real datasets examined previously in the literature. Each dataset has a non-Gaussian response, the margin of which we estimate non-parametrically. There are between 3,082 and 40,981 observations, so that the regression copulas are very high-dimensional. The estimated regression and variance functions are strongly and nonlinearly related to the covariates. Their estimates using the exact and approximate posteriors prove very similar, yet the latter are faster to evaluate using VB. The inclusion of a heteroscedastic term for the pseudo-response allows for a much richer dependence structure in the regression copula, compared to that of the implicit copula of a homoscedastic regression (which is a Gaussian copula). Using the predictive distributions, we show that in all examples the proposed copula model is more accurate than a P-spline regression with Gaussian disturbances, a heteroscedastic P-spline regression, and the most likely transformation regression estimator of Hothorn et al. (2017).

Finally, we note here that copulas have been used extensively in multivariate flexible regression frameworks; for examples, see Pitt et al. (2006), Song et al. (2009)Fan et al. (2016) and references therein. However, our approach is very different in two ways. First, previous approaches use a low-dimensional copula to capture the dependence between multiple response variables, whereas we use the copula to capture the dependence between different observations on a single response variable. Second, most previous methods employ Gaussian or vine copulas with densities that can be expressed in closed form. In contrast, while our copula does not have a closed form density, it is much more flexible – something we illustrate in our empirical work.

The paper is structured as follows. Section 2 shows how to construct a distributional regression model using a regression copula and an arbitrary margin. Our new implicit regression copula is outlined in Section 2.2, along with some of its properties. Section 3 outlines exact and approximate Bayesian posterior estimators, including distribution and functional prediction. Section 4 discusses the real data examples and comparison with benchmark regression alternatives, Section 5 outlines how to extend the approach to multiple covariates, and Section 6 concludes.

## 2 Distributional Regression using Copulas

In this section we first introduce the copula model used for distributional regression. Then we outline our proposed implicit copula, along with some of its key properties.

### 2.1 Copula model

Consider realizations of a continuous-valued response, with corresponding covariate values . Following Sklar (1959) the joint density of can always be written as

 p(\boldmathy(N)|~\boldmathx(N))=c†(F(y1|~\boldmathx1),…,F(yN|~\boldmathx% N)|~\boldmathx)N∏i=1p(yi|~% \boldmathxi), for N≥2.

Here, is a -dimensional copula density and is the distribution function of ; both of which are unknown. In this paper we model this joint distribution, also conditional on copula parameters , as

 p(\boldmathy(N)|~\boldmathx(N),\boldmathθ)=cH(FY(y1),…,FY(yN)|~\boldmathx(N),\boldmathθ)N∏i=1pY(yi). (1)

The distribution is assumed to be invariant with respect to , and has density and distribution function . However, the impact of the covariate values on is captured through the copula with density , where and . We call this a ‘regression copula’ because it is a function of . The copula parameters do not vary with the dimension . For we use the implicit copula proposed in the section below, and a major aim of this paper is to show that by doing so Eq. (1) provides a very flexible, but tractable, approach to distributional regression.

Before specifying , we stress that even though is assumed invariant with respect to , the response is still affected by the covariates in this regression copula model as follows. Consider a sample of size with , covariate values and . Then a new response with corresponding covariate values has predictive density

 p(yn+1|~\boldmathx(n+1),\boldmathθ)=∫p(\boldmathy,yn+1|~\boldmathx(n+1),% \boldmathθ)d\boldmathy=∫cH(\boldmathu(n+1)|~\boldmathx(n+1),\boldmathθ)d\boldmathupY(yn+1). (2)

This density is a function of all the covariate values , which includes . Moreover, integrating over the posterior of gives the posterior predictive density of from the regression model as

 p(yn+1|~\boldmathx(n+1),y)=∫p(yn+1|~\boldmathx(n+1),\boldmathθ)p(\boldmathθ|\boldmathy)d\boldmathθ. (3)

Eq. (3) forms the basis for our distributional regression predictions as a function of , and its first two moments are estimates of the regression mean and variance functions. In Section 3 we show how to compute Eq. (2) and Eq. (3) efficiently for our proposed implicit copula with density .

### 2.2 Implicit Regression Copula

The key to the success of our approach is the specification of a regression copula with density . We derive this as an implicit copula from a semi-parametric heteroscedastic regression model for a pseudo-response. To do so, we first outline the regression and then construct its copula with only the basis coefficients of the mean function integrated out, and it is a Gaussian copula. Next, to derive the implicit copula with the basis coefficients of the variance function also integrated out, it is represented as an integral of the Gaussian copula. Last, we show that such a representation is computationally efficient, and derive some key properties of the copula.

#### 2.2.1 Pseudo-Response Regression Model

Consider a regression model for a pseudo-response with covariates given by

 ~Zi = ~m(xi)+εi,εi∼N(0,σ2σ2i) σ2i = exp(g(wi)) (4)

This is a heteroscedastic semi-parametric regression model, where the first and second moments are smooth unknown functions and of the two covariates. To simplify the exposition, we only consider scalar covariates and here, although the approach is extended to multiple covariates in Section 5. We follow the P-spline literature (Eilers and Marx, 1996) and model and as linear combinations of B-spline basis functions and , such that and . With these approximations, the regression model is usually called semi-parametric.

For pseudo-response values , the regression can be written as the linear model

 ~Z =B\boldmathβ+\boldmathε,\boldmathε=(ε1,…,εn)′∼N(0,σ2Σ), (5) Σ =diag(σ21,…,σ2n), σ2i=exp(\boldmathv′i\boldmathα), for i=1,…,n,

where , , and the design matrices and have th rows and , respectively. To produce smooth and efficient function estimates it is usual to regularize the basis coefficient vectors and . In a Bayesian context, this corresponds to adopting the conditionally Gaussian priors

 \boldmathβ|\boldmathθβ,% \boldmathα,σ2 ∼N(0,σ2P(\boldmathθβ)−1) (6) \boldmathα|\boldmathθα ∼N(0,P(\boldmathθα)−1),

with smoothing (or ‘hyper’) parameters and . In the P-spline literature is usually symmetric and banded (Lang and Brezger, 2004). We use the precision matrix of a stationary AR(2) model, parameterized in terms of its disturbance variance and two partial autocorrelations ; for example, see Barndorff-Nielsen and Schou (1973). Thus, is of full rank, and , are the smoothing parameters for the two functions.

Note here that while it is popular to use a regularization prior constructed from random walk priors (Fahrmeir et al., 2013, p.433-448) the resulting precision matrix is of reduced rank. In this circumstance, the distribution of with integrated out is improper, and it does not have a proper copula density, so that we do not employ such a prior. Moreover, in our empirical work we found that a stationary AR(2) prior provides more accurate function estimates than an AR(1) prior.

#### 2.2.2 Copula Construction

In this paper we extract two copulas from the regression model defined at Eq. (2.2.1)–(6). They are called ‘implicit’ (McNeil et al., 2005, p.190) or ‘inversion’ (Nelsen, 2006, p.51) copulas because they are constructed by inverting Sklar’s theorem. The copulas are -dimensional with dependence structures that are (smooth) functions of the covariate values , with and , so that we also call them ‘regression copulas’ in this paper.

The first copula derived is the implicit copula of the distribution , which we label . To construct , note that the prior for is conjugate and can be integrated out of the distribution for analytically, giving

 ~\boldmathZ|\boldmathx,\boldmathw,σ2,\boldmathα,\boldmathθβ,\boldmathθα∼N(0,σ2[Σ−1−Σ−1BΩB′Σ−1]−1), (7)

where . The variance can be further simplified by applying the Woodbury formula to give

 [Σ−1−Σ−1BΩB′Σ−1]−1=Σ+BP(% \boldmathθβ)−1B′.

It is straightforward to show that the copula of a normal distribution is the widely employed Gaussian copula

(Song, 2000). It is obtained by standardizing the marginal means to zero and the variances to one. The margin in at Eq. (7) is , so that we normalize by the diagonal matrix with , to get . With this, the regression at Eq. (2.2.1) can be written as

 Zi=m(xi,wi)+siσεi,εi∼N(0,σ2σ2i), (8)

where , and both and are bivariate functions of and .

Denoting for conciseness, the distribution of the normalized vector with integrated out is

 \boldmathZ|\boldmathx,\boldmathw,σ2,\boldmathα,\boldmathθβ,% \boldmathθα ∼ N(0,R), with R≡R(\boldmathx,\boldmathw,% \boldmathα,\boldmathθβ) = S(Σ+BP(\boldmathθβ)−1B′)S, (9)

and margins for all elements . It is straightforward to show (Song, 2000) that the random vectors and (conditional on ) have the same Gaussian copula function

 C1(u|x,\boldmathw,\boldmathα,% \boldmathθβ)=Φ(Φ−11(u1),…,Φ−11(un);0,R),

where , and and are the distribution functions of and distributions, respectively. This is a regression copula because is a smooth function of and .

We make a number of observations on . First, the overall scale parameter does not feature in the expression for , and is unidentified in the copula, so that we set throughout the paper. Second, if the density of the distribution at Eq. (9) is denoted as , with marginal densities for , then the copula density is

 c1(u|x,\boldmathw,\boldmathα,% \boldmathθβ)=pZ(\boldmathz|\boldmathx% ,\boldmathw,\boldmathα,\boldmathθβ)∏ni=1pZi(zi|\boldmathx,\boldmathw,\boldmathα,\boldmathθβ)=ϕ(% \boldmathz;0,R)∏ni=1ϕ1(zi), (10)

where , , and and are the densities of and

distributions, respectively. Third, if a non-conjugate prior is used for

, then is not a Gaussian copula (something we do not consider in this paper). Last, because is a function of , so is the dependence structure of . If , then corresponds to the copula of a homoscedastic regression, similar to that discussed by Klein and Smith (2018).

The second regression copula derived is the implicit copula of with both and integrated out. We label this (for heteroskedastic regression copula), and it is this copula with density that is used to model the observed data at Eq. (1).

###### Theorem 1 (Definition of Ch and cH).

If follows the model for the pseudo-response at Eq. (2.2.1)–(6), is the normalized response at Eq. (8), are the covariate values and , then the implicit copula of the distribution has density

 cH(u|~\boldmathx,\boldmathθ)=∫c1(u|x,w,\boldmathα,θβ)p(% \boldmathα|\boldmathθα)d\boldmath% α=∫ϕ(\boldmathz;0,R)p(\boldmathα|\boldmathθα)d\boldmathα∏ni=1ϕ1(zi),

and copula function

 CH(u|~\boldmathx,\boldmathθ)=∫C1(u|x,w,\boldmathα,θβ)p(% \boldmathα|\boldmathθα)d\boldmath% α=∫Φ(z;0,R)p(\boldmathα|\boldmathθα)d\boldmathα,

where , the marginal distribution function , so that .
Proof: See Appendix A.

We make three observations on defined in Theorem 1. First, integration over is required to compute and . In Section 3 we show how to do this integration exactly using Hamilton Monte Carlo (HMC), and approximately using variational Bayes (VB) methods, when computing posterior inference. Second, the smoothing parameters of the splines of the regression model for the pseudo-response, are the dependence parameters of the regression copula . Last, it is much simpler to construct the implicit copula of , rather than here. This is because constructing the latter copula would involve evaluating (and inverting) the marginal distribution functions

 ~Fi(~zi|~\boldmathx,\boldmathθ)=∫Φ1(~zi/si)p(\boldmathα|% \boldmathθα)d\boldmathα,i=1,…,n.

Each of these involves computing a -dimensional integral using numerical methods. In contrast, the margin of is simply a standard normal, which greatly simplifies evaluation of .

### 2.3 Properties of Ch

Here, we state some properties of the regression copula . First, the independence copula is a limiting case of this copula, as outlined in Theorem 2 below:

###### Theorem 2.

Let be the independence copula function (Nelsen, 2006, p.11), and be the marginal variance of the AR(2) prior for at Eq. (6), then

 limγβ→0CH(\boldmathu|~\boldmathx,\boldmathθ)=Π(\boldmathu).

Proof: See Appendix A.

Note that for any given values of the prior partial correlations, if and only if , so that can be viewed as the copula parameter that determines the overall level of dependence in .

Below we give expressions for some common dependence metrics of the bivariate sub-copula of in elements . The derivations are given in Appendix A.

• For , if , the lower and upper quantile dependence are

 λLij(q|~\boldmathx,\boldmathθ) ≡ Pr(Uiq|Uj>q)=∫λU1,ij(q|% \boldmathx,\boldmathw,\boldmathα,\boldmathθβ)p(\boldmathα|\boldmathθα)d\boldmathα,

where and are the lower and upper pairwise quantile dependences of a bivariate Gaussian copula with correlation parameter given by the th element of in Eq. (9).

• The lower and upper extremal tail dependence

 λLij=limq↓0λLij(q|~\boldmath% x,\boldmathθ)=0, and λUij=limq↑1λUij(q|~\boldmathx,\boldmathθ)=0.
• Spearman’s rho is

 ρSij(~\boldmathx,\boldmathθ)=6π∫arcsin(rij/2)p(\boldmathα|\boldmathθα)d\boldmathα,

where is as defined above and is a function of .

• Kendall’s tau is

 τKij(~\boldmathx,\boldmathθ)=2π∫arcsin(rij)p(\boldmathα|\boldmathθα)d\boldmathα.

The dependence metrics at (ii)-(iv) above are functions of the copula parameters , and also all the covariate values , rather than just which correspond to the th and th observations. (We return to this feature in Section 4, where we show it corresponds to allowing local adaptivity in the distributional estimates from the copula model). The dependence metrics are computed with respect to the posterior of for the examples in Section 4.

## 3 Estimation

Estimation of the copula model at Eq. (1) requires estimation of both the marginal and the copula parameters . In the copula literature, it is popular to use two stage estimators—where is estimated first, followed by —as they are much simpler to implement, and only involve a minor loss of efficiency (Joe, 2005). For

we use the adaptive kernel density estimator (labelled ‘KDE’) of

Shimazaki and Shinomoto (2010) and a Dirichlet process mixture estimator Neal (2000) (labelled ‘DPhat’). For the latter, when estimating using MCMC, uncertainty with respect to the estimate of can also be integrated out by following Grazian and Liseo (2017) and using the draws of at each sweep, instead of conditioning on its posterior point estimate. In our empirical work we show this has only a very minor effect on the copula model and distributional estimates.

### 3.1 Likelihood

Estimation of based on Eq. (1) with observations is difficult because at Theorem 1 is expressed as an integral over . Nevertheless, the likelihood can still be evaluated by expressing it conditional on the spline coefficients and , and then integrating them out using Bayesian methods, which is the approach we employ. The Jacobian of the transformation from to is , and through a change of variables and Eq. (8), the conditional likelihood is

 p(\boldmathy|\boldmathx,\boldmathw,\boldmathα,\boldmathβ,\boldmathθβ,% \boldmathθα)=p(\boldmathz|\boldmathx,% \boldmathw,\boldmathα,\boldmathβ,\boldmath% θβ,\boldmathθα)JZ→Y=ϕ(\boldmathz;SB\boldmathβ,SΣS)n∏i=1pY(yi)ϕ1(zi), (11)

which can be evaluated in operations because and are diagonal. Below we show how to evaluate the posterior of exactly by generating using a Hamiltonian Monte Carlo (HMC) step within a MCMC scheme. To allow for estimation for large values of , we also develop a variational Bayes (VB) estimator for approximate inference requiring much less computational cost. Both approaches estimate the posterior of the parameters augmented with the spline coefficients, denoted as with dimension .

### 3.2 Exact Estimation

A MCMC sampler is used compute the augmented posterior. Each scalar element of (or a monotonic transformation of it) is generated using a normal approximation based on exact analytical derivatives of the logarithm of its conditional posterior. Through experimentation, we found generating transformations of the partial correlations of the AR(2) priors at Eq. (6)—as opposed to the autoregressive parameters—improves the convergence, stability and efficiency of the sampler. The coefficients are generated from a multivariate normal. Details on the sampler and these steps are given in the Web Appendix.

The most challenging aspect of this sampler is generating from the conditional posterior of . Gaussian or random walk proposals result in prohibitively poor mixing of the Markov chain, so that a HMC (Duane et al., 1987; Neal, 2011; Betancourt, 2017) step is employed instead. This augments by momentum variables, and draws from an extended target distribution that is proportional to the exponential of the Hamiltonian function. Dynamics specify how the Hamiltonian function evolves, and its volume-conserving property results in high acceptance rates of the proposed iterates.

We use a variant of the leapfrog integrator of Neal (2011), which employs the logarithm of the target density

 lα≡log(p(\boldmathα|\boldmathx,\boldmathz,{\boldmathϑ∖\boldmathα})) ∝−12n∑i=1(log(s2i)+log(σ2i))−12(\boldmathz′(SΣS)−1\boldmathz−2\boldmathβ′B′Σ−1S−1\boldmathz) −12\boldmathβ′Σ−1% \boldmathβ−12\boldmathα′P(% \boldmathθα)\boldmathα

 ∇αlα=−P(\boldmathθα)\boldmathα−12V′⎡⎣(∂s21∂\boldmathηαs−21,…,∂s2n∂\boldmathηαs−2n)′+(∂σ21∂\boldmathηασ−21,…,∂σ2n∂\boldmathηασ−2n)′⎤⎦ +12V′⎡⎣(B\boldmathβ)∘(B\boldmathβ)∘(1σ21,…,1σ2n)′−(∂κ21,1∂\boldmathηαz21,…,∂κ21,n∂\boldmathηαz2n)′⎤⎦ +V′⎡⎣\boldmathz∘(∂κ22,1∂\boldmathηα,…,∂κ22,n∂\boldmathηα)′∘(B\boldmathβ)⎤⎦,

where is the Hadamard product, , , , a closed form expression for is given in the Web Appendix and . The step size and the number of leapfrog steps in each sweep is set using the dual averaging approach of Hoffman and Gelman (2014) as follows. A trajectory length is obtained by preliminary runs of the sampler for our real data applications with a small (to ensure a small discretization error) and a large (to move far). The dual averaging algorithm uses this trajectory length and adaptively changes during iterations of the complete sampler with sweeps, in order to achieve a desired rate of acceptance . In our examples we choose ; see Hoffman and Gelman (2014) for recommendations. A reasonable starting value for is determined by Algorithm 4 of (Hoffman and Gelman, 2014). At sweep of the sampler, the dual averaging algorithm is given in Algorithm 1.

In our empirical work, a burnin of 40,000 iterates was employed, after which a Monte Carlo sample of size was collected. These are very conservative values, in that much smaller samples result in similar posterior estimates.

### 3.3 Approximate Estimation

The VB estimator approximates the augmented posterior with a tractable density . Here, is the conditional likelihood at Eq. (11), and

is a vector of ‘variational parameters’ which are calibrated by minimizing the Kullback-Leibler divergence between

and . It is straightforward to show—for example, see Ormerod and Wand (2010)—that this is equivalent to maximizing the so-called variational lower bound

 L(\boldmathλ)=∫qλ(\boldmathϑ)log{p(\boldmathy,\boldmathϑ)qλ(\boldmathϑ)}d\boldmathϑ=Eq{log(h(\boldmathϑ))−log(qλ(\boldmathϑ))}, (12)

with respect to . The expectation in Eq. (12) is with respect to the variational approximation (VA) with density , and cannot be computed in closed form. Therefore, a stochastic gradient ascent (SGA) algorithm (Honkela et al., 2010; Salimans and Knowles, 2013) is used to maximize

. This employs an unbiased estimate

of the gradient of to compute the update

recursively. If is a sequence of vector-valued learning rates that fulfil the Robbins-Monro conditions (Robert, 1995), then the sequence converges to a local optimum (Bottou, 2010). The learning rates are set adaptively using the ADADELTA method of Zeiler (2012).

For the SGA algorithm to be efficient, the estimate should exhibit low variance. To achieve this here, we use the so-called ‘re-parameterization trick’ (Kingma and Welling, 2014; Rezende et al., 2014). This expresses as a function of another random variate that has a density that does not depend on . In this case, the lower bound is

 L(\boldmathλ) =Epζ{logh(a(\boldmathζ,\boldmathλ))−logqλ(a(\boldmathζ,\boldmathλ))}, (13)

where is an expectation with respect to . Note that the variational parameters appear inside the function , and when differentiating Eq. (13) with respect to , information from the target posterior density is used, whereas it is not when differentiating Eq. (12).

Differentiating under the expectation in Eq. (13) gives

 ∇λL(\boldmathλ)=Epζ(∂a(\boldmathζ,\boldmathλ)′∂\boldmathλ∇\boldmathϑ{logh(a(\boldmathζ,\boldmathλ))−logqλ(a(\boldmathζ,\boldmathλ))}−∇λlogqλ(a(\boldmathζ,\boldmathλ))),

where an unbiased estimate of the expectation can be computed by simulating from . To further reduce the variance of the gradient estimate we follow a suggestion by Roeder and Wu (2017). These authors observe that if the variational density is exact (i.e. ) it can be shown that . This implies that if is a good approximation to , then

 ∇λL(\boldmathλ)≈Epζ(∂a(\boldmathζ,\boldmathλ)′