# Sparse online variational Bayesian regression

This work considers variational Bayesian inference as an inexpensive and scalable alternative to a fully Bayesian approach in the context of sparsity-promoting priors. In particular, the priors considered arise from scale mixtures of Normal distributions with a generalized inverse Gaussian mixing distribution. This includes the variational Bayesian LASSO as an inexpensive and scalable alternative to the Bayesian LASSO introduced in [56]. It also includes priors which more strongly promote sparsity. For linear models the method requires only the iterative solution of deterministic least squares problems. Furthermore, for n→∞ data points and p unknown covariates the method can be implemented exactly online with a cost of O(p^3) in computation and O(p^2) in memory. For large p an approximation is able to achieve promising results for a cost of O(p) in both computation and memory. Strategies for hyper-parameter tuning are also considered. The method is implemented for real and simulated data. It is shown that the performance in terms of variable selection and uncertainty quantification of the variational Bayesian LASSO can be comparable to the Bayesian LASSO for problems which are tractable with that method, and for a fraction of the cost. The present method comfortably handles n = p = 131,073 on a laptop in minutes, and n = 10^5, p = 10^6 overnight.

## Authors

• 8 publications
• 1 publication
• ### EP-GIG Priors and Applications in Bayesian Sparse Learning

In this paper we propose a novel framework for the construction of spars...
04/19/2012 ∙ by Zhihua Zhang, et al. ∙ 0

• ### On parameters transformations for emulating sparse priors using variational-Laplace inference

So-called sparse estimators arise in the context of model fitting, when ...
03/06/2017 ∙ by Jean Daunizeau, et al. ∙ 0

• ### Variational Full Bayes Lasso: Knots Selection in Regression Splines

We develop a fully automatic Bayesian Lasso via variational inference. T...
02/26/2021 ∙ by Larissa Alves, et al. ∙ 0

• ### FATSO: A family of operators for variable selection in linear models

In linear models it is common to have situations where several regressio...
04/11/2019 ∙ by Nicolás E. Kuschinski, et al. ∙ 0

• ### Variational Inference for Shrinkage Priors: The R package vir

We present vir, an R package for variational inference with shrinkage pr...
02/17/2021 ∙ by Suchit Mehrotra, et al. ∙ 0

• ### Bayesian variational regularization on the ball

We develop variational regularization methods which leverage sparsity-pr...
05/12/2021 ∙ by Matthew A. Price, et al. ∙ 0

• ### Efficient Smoothed Concomitant Lasso Estimation for High Dimensional Regression

In high dimensional settings, sparse structures are crucial for efficien...
06/08/2016 ∙ by Eugene Ndiaye, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Regression is a quintessential and ubiquitous task of machine learning. The simplest method one can use to solve regression tasks is a linear model with Gaussian noise and prior

[11]. The most attractive feature of linear Gaussian models is analytical tractability, from both frequentist and Bayesian viewpoints. However, once one employs basis function expansions, they also become quite flexible. There are numerous methods in which linear models can be embedded, such as total least squares [75] and mixtures of regressions [45], for example. Sparsity promoting priors have proven to be very useful for identifying useful features and avoiding overfitting, perhaps most notably the Lasso [68] and its incarnation as total variation (TV) regularization in imaging [71, 66]. However, as soon as a non-Gaussian prior is introduced then analytical tractability is lost and, in particular, the Bayesian solution becomes very expensive [56]

, requiring Markov chain Monte Carlo (MCMC) methods

[60, 36]. Furthermore, sparsity promoting priors are not differentiable, which prevents the use of simple Gaussian approximations such as Laplace approximation [7].

Here we introduce variational approximations of the posterior associated to sparsity priors given by scale mixtures of Normals, with a generalized inverse Gaussian mixing distribution. This includes the variational Bayesian Lasso (VBL) – a principled Gaussian approximation to the Bayesian Lasso presented in [56] 111See remark 2.4 for discussion of some recent related work which we discovered after writing this paper.

. It is derived through a variational Bayesian expectation maximization (VBEM) method

[6, 4], which requires only solution of (unconstrained) linear systems, and provides approximations of the mean and covariance of the target. Additionally, in parallel we perform classical expectation maximization (EM) [24]

to obtain the maximum a posteriori (MAP) estimator.

The approach presented here provides fast uncertainty quantification (UQ) in the context of sparse regression and linear inverse problems. The key points are:

• Fast and inexpensive variational Bayesian solution with sparsity promoting priors, such as , where with . The Bayesian formulation provides UQ in the estimate, which is becoming increasingly important in science and engineering applications [64, 34].

• Online Bayesian solution means infinite data can be learned from with a memory cost and compute cost per update, where

is the width of the design matrix (see below). This trivially reduces to the standard Kalman filter

[44] for the case of Gaussian prior, and is hence exact (the compute cost also reduces to in this case). Otherwise it provides exact MAP estimate (for convex prior) and variational approximation to the posterior, with little more than a pair of parallel Kalman filters. We note the results presented are constrained to the case of regression, but will be extended to non-trivial dynamics in future work.

• Adaptive online learning of hyper-parameters via EM strategies provides improved accuracy at a marginal additional cost.

• A further “low rank diagonal” approximation can provide a reduction in cost to in case is prohibitively large, in a similar spirit to the ensemble Kalman filter (EnKF) [29].

The rest of the paper is organized as follows. The basic model is introduced in Section 2: after introducing linear models in 2.1, sparsity priors in 2.2, Bayesian formulation in 2.3, EM in 2.4, and VBEM in 2.5, this culminates in our Gaussian approximation of sparsity-promoting posteriors and VBL in 2.6. Further enhancements to the basic model are introduced in Section 3. In particular, the online version is introduced in 3.1, and hyper-parameter tuning is considered in 3.2. Numerical results are presented in Section 4, including a small basic dataset relating to diabetes in 4.1, and a more computationally intensive total variation (TV) denoising/deblurring example in 4.2.

## 2 The basic model

First some background and the basic model will be introduced in this section.

### 2.1 Linear models

Let ,with and (for simplicity), and let and

. Consider the following statistical model for Bayesian linear regression

 yi=xTiβ+ϵi,ϵi⊥β. (1)

Online inference of corresponds to the case . If and , possibly depending on some parameters which are suppressed in the notation, then conditioned on , the posterior distribution is given in closed form as follows [11]

 mn = (1γ2XTnXn+C−10)−1(1γ2XTnYn+C−10m0) (2) = m0+C0XTn(γ2In+XnC0XTn)−1(Yn−Xnm0), Cn = (1γ2XTnXn+C−10)−1=C0−C0XTn(γ2In+XnC0XTn)−1XnC0. (3)

The mean minimizes the quadratic objective function

 12γ2|Xnβ−y|2+12|C−1/20(β−m0)|2. (4)

By choosing the appropriate version above, the computation cost is , and the memory cost is . In case , and in particular if , then the Kalman filter [44] provides exact solution online with a memory and (per iteration) computation cost of . See appendix D.

A useful observation in what follows is that equation (2) can be rewritten recursively as

 mn = (1γ2(xnxTn+XTn−1Xn−1)+C−10)−1(1γ2(xnyn+XTn−1Xn−1mn−1)+C−10mn−1) (5) = mn−1+C0XTn(γ2In+XnC0XTn)−1(ˆYn−Xnmn−1),

where . Above is presented for the assimilation of one observation at a time, but it is easily modified to update a batch of observations at a time, which will be considered later. This observation is obviously not useful by itself, as it incurs a cost of per iteration, whereas the Kalman filter delivers updates in primal/covariance form (see 55). However, in what follows, the precision will be required. The latter representation will facilitate an approximate update which relies on recursive “low-rank + diagonal” approximations of , in similar spirit to the ensemble Kalman filter [15].

Linear models of the form are quite flexible, once one considers basis function expansions. In other words, for data and some functions , which can be a subset of polynomials [34, 20, 38], wavelets and other “-lets” [19]

[14], random feature models [59], or any number of other choices. The book [7] provides a concise and easy to read summary for regression applications. In fact, there are complete bases for many function-spaces. For example, if then the Fourier series forms a complete basis for [33] and a subset of such features can therefore be used to construct a convergent approximation. In fact, since radial basis function kernel operators are diagonalized by the Fourier basis, then the expectation of the product of two such features with a random frequency is equal to the kernel evaluation. Monte Carlo approximation of such expectations is the basis of the random feature models [59].

An issue is how many terms to include, and perhaps more importantly, how to select a subset of the important terms from a sufficiently rich set, without incurring a loss in accuracy due to overfitting, as can occur with too much flexibility. The latter issue is often referred to as “variance-bias tradeoff”: a model which is too flexible (negligible bias) may be too strongly influenced by a particular data set, hence incurring a large variance over a range of different data sets

[32]. This well-known issue can be dealt with by beginning with a sufficiently rich class of approximating functions (e.g. a large enough basis) and then introducing prior assumptions, or regularization, in order to let the model select the most parsimonious representation [69, 40, 68, 54, 53, 67].

### 2.2 Sparsity priors

In the context of prior selection, often the Gaussian assumption is considered too restrictive. In particular, it has become very popular at the end of the last and the beginning of this millenium to utilize a sparsity promoting prior. In this case the Gaussian with quadratic density which gives rise to the second (Tikhonov regularization) term in equation (4) is replaced with another density of the form , where for example , for , , and [68, 19, 25, 56]. This is often extended to the case , which corresponds to counting measure on the non-zero elements . As we will see, more general sparsity promoting priors are also possible. Note that if there is a such that , then one can always redefine and . Therefore, we assume without loss of too much generality that . This will be discussed further in the examples.

There is a computational burden to performing inference with these more exotic priors, even in the case when we abandon UQ and settle for a MAP estimator. In the best case of we have a convex optimization problem, which can be solved efficiently by a number of modern methods, such as iterative soft thresholding [22, 13] and alternating direction method of multipliers [31, 12]. These iterative methods incur the same cost as (2) at each iteration.

Considering the full Bayesian posterior, the situation is even more daunting. Indeed once one adopts such a sparsity prior then the posterior is no longer characterized in closed form with finitely many parameters, as in the Gaussian case (in the finite/discrete/parametric case). Laplace approximation [7] requires derivative and Hessian of the log-posterior, which may not exist. Computationally-intensive methods such as Markov chain Monte Carlo (MCMC) [60] are required for consistent estimation, as proposed for the Bayesian Lasso in [56]. There has been a lot of activity in this direction in the past 10 years – see, e.g. [49, 57, 58, 50, 16, 72, 9]. Here we propose to employ a variational approach to recover the best Gaussian approximation to the sparsity-promoting posterior, in a sense to be defined precisely below. It is worth mentioning that the Bayesian formulation allows for more general priors than the continuous, or “shrinkage”, priors considered here. In recent years “spike and slab” priors [51] have become very popular, consisting of a mixture distribution with a Dirac mass at 0 (spike) and a continuous distribution such as the ones considered here (slab). See [5] for a recent review, focused on the spike and slab LASSO.

### 2.3 Bayesian formulation of sparsity priors

Assume that the prior consists of

independent random variables

, each with Laplace distribution . The MAP estimator associated to this model corresponds to L1 regularized regression, or LASSO [68]. It is well-known that the Laplace distribution can be expressed as a scale mixture of zero mean Gaussians [2, 30, 61], where the mixture distribution follows a generalized inverse Gaussian law

 (6)

with and , i.e.

 L(βj;λ)=∫R+N(βj;0,θj)GIG(θj;1,0,λ)dθj. (7)

We consider priors defined by general scale mixtures of Normals with mixing distribution given by (6). This includes other sparsity promoting distributions such as Normal inverse Gaussian and Normal Gamma [70, 18, 37, 17]. Furthermore, the conditionals of the joint are known exactly

 β|θ,Xn,Yn ∼ N(β;mθn,Cθn), (8) θj|βj,Xn,Yn ∼ GIG(θj;ν−1/2,√δ2+β2j,λ), (9)

where are given by (2), with the diagonal matrix with on the diagonal, and . Therefore, Gibbs sampling can be used to sample from the monolithic posterior (2) [63, 56]. Sequential Monte Carlo methods [26, 52] have also been designed to sample from the full posterior sequentially in [63, 17], and a sequential expectation maximization (EM) [24] method has been used to approximate MAP estimates of in [18, 17].

###### Remark 2.1

One may also keep an original so that and we have . This would marginally yield an elastic net prior for and : [61].

### 2.4 Expectation maximization

For the next sections we suppress and in the notation where convenient. Suppose we want to maximize

 logP(Y,β)=log∫P(Y,β,θ)dθ≥∫log(P(Y,β,θ)q(θ))q(θ)dθ, (10)

where the inequality arises (for anyprobability density ) from an application of Jensen’s inequality, and here denotes a probability density. The expectation maximization (EM) algorithm [24] proceeds as follows. Define ,

 Q(β|βt)=∫log(P(Y,β,θ)P(θ|βt,Y))P(θ|βt,Y)dθ, (11)

and let .

In our context this entails iteratively computing

 Q(β|βt)=12βTD(1/θt+1)β+12γ2|Yn−Xnβ|22+K(βt,Xn,Yn) (12)

where we recall that is the diagonal matrix with on the diagonal. For example, in the case of and , is defined element-wise as

 1/θt+1j:=E[1/θj|βt,Xn,Yn]=λ((βtj)2+δ2)−1/2. (13)

The calculation of (12) is given in Appendix A along with a slightly lengthier explanation of EM. Note we have assumed but allowed

in the hyperprior (

6), which relaxes the marginal Laplace identity (7). The general form of (13) is given in equation (49), and the case is given in (39). One then has the iteration

 μt+1n=(XTnXn+γ2λD(((μtn)2+δ2)−1/2))−1XTnYn. (14)

These analytical calculations have been shown and used before in several works, including [30, 18, 37]. From this, one obtains the MAP estimator at convergence .

###### Remark 2.2

An iteratively reweighted least squares (IRLS) algorithm [41] can be derived in order to approximate regularization with by a sequence of problems with , based on the following observation

 |β|qq=p∑i=1|βi|q=p∑i=1|βi|q−2β2i.

The resulting iteration for is exactly as in (14), where is interpreted as a regularization parameter. It can be shown under appropriate assumptions that as , where is sparse if such solution exists, and convergence is linear (exponentially fast) for sufficiently close to [23].

### 2.5 Variational Bayesian Expectation maximization

Here we propose to use the variational Bayesian expectation maximization (VBEM) algorithm, introduced in the context of graphical models in [4, 6],. We show how it works elegantly in our context to provide a Gaussian approximation to problems with sparsity priors, which is optimal in a certain sense. Suppose we return to (10), and this time multiply/divide by some density and integrate over as well. Then we have the evidence lower bound

 logP(Y)=log∫P(Y,β,θ)dθdβ≥∫log(P(Y,β,θ)q(θ)q(β))q(θ)q(β)dθdβ. (15)

Coincidentally, maximizing this with respect to the densities coincides with minimizing the KL divergence between this variational approximation and the joint posterior, i.e.

 logP(Y)−∫log(P(Y,β,θ)q(θ)q(β))q(θ)q(β)dθdβ = −∫log(P(β,θ|Y)q(θ)q(β))q(θ)q(β)dθdβ (16) =: KL[q(θ)q(β)||P(β,θ|Y)].

The objective functions for each of and given the other are convex and can be minimized exactly, as observed in [4, 6], leading to the iterative algorithm

 qt+1(θ) ∝ exp(∫logP(Y,β,θ)qt(β)dβ), (17) qt+1(β) ∝ exp(∫logP(Y,β,θ)qt+1(θ)dθ).

Furthermore, following from convexity of the intermediate targets this gives a descent direction for (16) . Observe that constraining to , where is the Dirac delta function and is the point of maximum probability above, yields the original EM algorithm. Also, observe that (17) may itself be intractable in general, although it is shown in [6] that it is simplified somewhat for conjugate exponential models and may be analytically soluble. Fortunately, the present situation is the best case, where it is analytically soluble. Notice that the objective function (16) corresponds to an independence assumption between and , however from (17) it is clear that the solution solves a coupled system, and in fact probabilistic dependence is replaced with a deterministic dependence on each others’ summary statistics, as noted in [6].

### 2.6 Gaussian approximation to a sparsity promoting posterior

Following instead the variational Bayesian approach of Sec. 2.5 gives (17), i.e.

 qt+1(θ) ∝ exp(−12p∑j=1Et[β2j]/θj)P(θ|λ), (19) qt+1(β) ∝ exp(−12p∑j=1β2jEt+1[1/θj]−12γ2|Yn−Xnβ|22),

where is used to (degenerately) denote expectation with respect to the iteration intermediate variational distribution, with respect to its argument, or . The first equation looks similar to the EM algorithm, however with the important difference

 qt+1(θj)=GIG(θj;ν−1/2,√δ2+Ctn,jj+(mtn,j)2,λ), (20)

where are the mean and covariance of (note the appearance of the variance instead of just ).

In turn this means that for the case we have

 1/θt+1n,j:=Et+1[1/θj]=λ(Ctn,jj+(mtn,j)2+δ2)−1/2. (21)

The update equations are (e.g. in primal/monolithic form (2))

 mt+1n = Ct+1n(1γ2XTnYn), (22) Ct+1n = (1γ2XTnXn+D(1/θt+1n))−1. (23)

Here we can explicitly observe the deterministic dependence between the marginally optimal and distributions via each others’ summary statistics. The general form of the update (21) is given in (49), and the case is given in (39).

Note that this algorithm runs for approximately the same cost as the former, and provides a Gaussian approximation of the posterior. The former provides an approximation of the MAP estimator . In the case we refer to this triple as the variational Bayesian LASSO (VBL). All above is comprised in Algorithm 1. It is well-known that the mean of sparsity priors, such as TV-regularization, may not in fact promote sparsity [47]. Indeed even the Bayesian LASSO point estimator [56]

, which returns median instead of mean, is not as sparse as the standard LASSO, i.e. its MAP estimator. In the context of UQ, one is more concerned with uncertainty measures, such as confidence intervals. For example, one may consider the sparse solution to be reasonable if, for an index

such that , one has , i.e. the origin is within the confidence interval of the Bayesian marginal for those coordinates which are predicted to vanish. In general, one may flag as unusual any circumstances where . See Figure 1 for an illustration of the VBL applied to a simple example with (“exact” values are calculated with numerical quadrature on the domain with grid points in each direction).

###### Remark 2.3

Step (3) of Algorithm 1 requires a stopping criterion. It is sensible for example to let this be determined by the misfit or . In the first case, a good choice may be , for . We will see that the algorithm returns a good estimator very quickly, but may take a long time to converge, and may even return a worse estimator at convergence. Therefore, a maximum number of iterations is often also employed as an alternative in practice.

###### Remark 2.4

The VBEM has come to be known more generally as mean-field variational Bayes. See [10] for a recent review, where the quite natural name coordinate ascent variational inference is given to the iterations 19. After completing this work, we made several discoveries. The special case of , referred to as automatic relevance detection [54], has been introduced already, independently, in [8] and [27]. Also, the work [3] considers variational inference in the context of the stable mixing distribution. This distribution cannot be explicitly represented in general, but it yields marginal priors of the form for , and leads to EM updates of the IRLS form with , as in remark 2.2. This is referred to as Bayesian bridge regression, and clearly includes the Bayesian LASSO (, i.e. ) as a special case. The focus of that work is on the resulting point estimator for very small , however, and the VBL case is not considered.

## 3 Enhanced model

This section focuses on enhancing the model by enabling sequential/online inference and hyper-parameter optimization.

### 3.1 Online Gaussian approximation to a sparsity posterior

In the following we focus the description on of the VBEM formulation (22), (23), but note that analogous equations hold for (14). There are two distinct scenarios to consider here. First we will consider the case of moderate and , or in particular , where the online method reproduces EM and VBEM exactly at a cost of per iteration. The second case we will consider is that of very large , and possibly , where it is necessary to impose an approximation in order to control the cost by . The approximate approach is also amenable to .

#### 3.1.1 Small/moderate p exact method

Suppose we are assimilating batches of size , and denote

 ¯Xn=XnM,¯Yn=YnM,~Xn=[x(n−1)M+1,…,xnM]T,~Yn=[y(n−1)M+1,…,ynM]T,

so that e.g. . Sequential batches are presented but this may not always be a sensible choice and some permutation of the indices may make sense. See remark 3.2. We can compute batch updates of the required matrices in (2) exactly with a total of at most operations:

 ¯XTn¯Xn=¯XTn−1¯Xn−1+~XTn~Xn,¯XTn¯Yn=¯XTn−1¯Yn−1+~XTn~Yn,¯YTn¯Yn=¯YTn−1¯Yn−1+~YTn~Yn.

The cost may be smaller if the intermediate quantities are sparse (many zeros) or low-rank. We have the following equation for the precision and can proceed directly with iterating (22) and (23). In the worst case scenario, the inversion required to compute (22) will incur a cost of , so one would aim to take . We iterate that the focus here is the case . For large , which will be discussed now, must be sparse or low-rank. In this case, inversion can be done approximately with a cost of as little as . In case is prohibitive for computation or memory, then online computation and storage of the component matrices must be controlled as well. This is discussed further in the following section.

#### 3.1.2 Large p approximate method

In the case of large and/or , the problem is different. It is preferable to directly confront the monolithic problem (2) if it is possible, for example in case that is fixed and is sufficiently sparse and/or low-rank to allow the direct use of an iterative Krylov-type solver [39, 62]. On the other hand, if this cannot be handled directly, then a sequential/online strategy can be adopted as follows. Suppose we have s.t. , so that

 (C∗n−1)−1≈1γ2ˆXTn−1ˆXn−1+D(1/θ∗n−1).

The idea is to preserve a “rank + diagonal” type approximation, while effectively passing information forward in an online fashion. Recall equation (5) and define

The update which replaces (22) and (23) is given by

 mt+1n = mn−1+D(θt+1n)XTn(γ2I2M+XnD(θt+1n)XTn)−1(ˆYn−Xnmn−1), (29) Ct+1n = (Ip−D(θt+1n)XTn(γ2I2M+XnD(θt+1n)XTn)−1Xn)D(θt+1n). (30)

Finally we need s.t. in order to proceed to the next iteration. This is achieved by (i) computing a reduced rank-M eigendecomposition , with diagonal and orthogonal, and (ii) defining . This approximation in principle costs but with a memory cost of only . All the steps above are summarized in Algorithm 2.

###### Remark 3.1 (EnKF)

We note the similarity between the above procedure and a (square-root) EnKF [48] for solution of (55) and (56), which proceeds with a low-rank (or low-rank plus diagonal) approximation of the covariance . In our scenario, the above framework is more natural and is expected to provide a better approximation. This is described further in appendix D.

###### Remark 3.2 (Batching strategy)

In the offline scenario where the data size is fixed and the sequential method is employed then choice of batches is important. If the inputs are i.i.d. random samples then sequential batching is sensible, i.e. , , etc. Otherwise if there is structure in the inputs (e.g. in the context of inverse problems) then it makes more sense to use random sampling without replacement or evenly spread out batches , , etc., where .

#### 3.1.3 Further discussion

Note that in practice one would hope that after some iterations will not be changing very much with the iterative re-weighting, and few inner updates will be required, if any. If the model is stationary, then one may also not need to allow . There are some other modifications which could be made along the way to further improve efficiency, such as thresholding, i.e. , for small , where is the indicator function on the set , and it acts elementwise on the entries of (and similar for . Suppose that has essentially converged, and parameters are non-zero. We can then discard the -valued parameters, thereby either vastly speeding up the algorithm or making way for inclusion of new parameters. Similar things have been done before. See e.g. [74].

All of the present technology is well-suited to an online scenario, where one assumes a fixed static problem but data arrives sequentially in time, and may continue indefinitely. If this model is meant to emulate a computer simulation, for example which is called by another computer simulation for a particular value of inputs, as in the common case of coupled multi-physics systems and whole device modelling, then one can decide whether the emulator is suitable for a given query input, for example by evaluating the uncertainty under the current approximation . If this is below a certain level then the model returns (and possibly also if the requesting program is capable of handling uncertainty), otherwise the emulator requests a label from the computer simulation and is updated accordingly, as above. It may also be of interest in an offline scenario to build up a database of labeled data and revise the emulator as this is done. Such a task is called experimental design, and greedy or myopic sequential experimental design can be posed elegantly within the sequential framework above.

### 3.2 Learning hyperparameters

Here it will be assumed that the regularization parameter is fixed at some small value, e.g. . However, the hyper-parameters and still need to be learned. We will consider several approaches.

The first approach is only applicable to VBEM for and is a version of iterated nested VBEM algorithm, whereby for each the inner VBEM algorithm is as in (19) and Algorithm 1, and yields

 qt,τ(β|(λ,γ2)τ)→q∗,τ(β|(λ,γ2)τ)=N(m∗,τn,C∗,τn),

which is then itself used in an outer standard EM on to find

 (λ,γ2)τ+1=argmaxλ,γ2∫logq∗,τ(β,Y|(λ,γ2))q∗,τ(β|Y,(λ,γ2)τ)dβ.

The details of how this is done will be described in detail in Section 3.2.2 below. The procedure is iterated until convergence. The second approach is a variant on the first, in which single steps of the outer and inner algorithm are executed iteratively:

 (λ,γ2)t+1=argmaxλ,γ2∫logqt(β,Y|(λ,γ2))qt(β|Y,(λ,γ2)t)dβ. (36)

The third approach is applicable to both EM (i.e. the problem for ) and VBEM (i.e. the problem for

) independently, however there is a philosophical disconnect if they return different optimal hyper-parameters, since then our MAP estimate and our moment estimates correspond to different posteriors. In this version we augment the variational distribution with an additional factor

, which must be learned in an additional step after (19). Furthermore, we will constrain the distribution to the form , where the distribution on the parameters is Dirac at the maximizer, i.e.

 ((γ2)