1 Introduction
Regression is a quintessential and ubiquitous task of machine learning. The simplest method one can use to solve regression tasks is a linear model with Gaussian noise and prior
[11]. The most attractive feature of linear Gaussian models is analytical tractability, from both frequentist and Bayesian viewpoints. However, once one employs basis function expansions, they also become quite flexible. There are numerous methods in which linear models can be embedded, such as total least squares [75] and mixtures of regressions [45], for example. Sparsity promoting priors have proven to be very useful for identifying useful features and avoiding overfitting, perhaps most notably the Lasso [68] and its incarnation as total variation (TV) regularization in imaging [71, 66]. However, as soon as a nonGaussian prior is introduced then analytical tractability is lost and, in particular, the Bayesian solution becomes very expensive [56], requiring Markov chain Monte Carlo (MCMC) methods
[60, 36]. Furthermore, sparsity promoting priors are not differentiable, which prevents the use of simple Gaussian approximations such as Laplace approximation [7].Here we introduce variational approximations of the posterior associated to sparsity priors given by scale mixtures of Normals, with a generalized inverse Gaussian mixing distribution. This includes the variational Bayesian Lasso (VBL) – a principled Gaussian approximation to the Bayesian Lasso presented in [56] ^{1}^{1}1See remark 2.4 for discussion of some recent related work which we discovered after writing this paper.
. It is derived through a variational Bayesian expectation maximization (VBEM) method
[6, 4], which requires only solution of (unconstrained) linear systems, and provides approximations of the mean and covariance of the target. Additionally, in parallel we perform classical expectation maximization (EM) [24]to obtain the maximum a posteriori (MAP) estimator.
The approach presented here provides fast uncertainty quantification (UQ) in the context of sparse regression and linear inverse problems. The key points are:

Online Bayesian solution means infinite data can be learned from with a memory cost and compute cost per update, where
is the width of the design matrix (see below). This trivially reduces to the standard Kalman filter
[44] for the case of Gaussian prior, and is hence exact (the compute cost also reduces to in this case). Otherwise it provides exact MAP estimate (for convex prior) and variational approximation to the posterior, with little more than a pair of parallel Kalman filters. We note the results presented are constrained to the case of regression, but will be extended to nontrivial dynamics in future work. 
Adaptive online learning of hyperparameters via EM strategies provides improved accuracy at a marginal additional cost.

A further “low rank diagonal” approximation can provide a reduction in cost to in case is prohibitively large, in a similar spirit to the ensemble Kalman filter (EnKF) [29].
The rest of the paper is organized as follows. The basic model is introduced in Section 2: after introducing linear models in 2.1, sparsity priors in 2.2, Bayesian formulation in 2.3, EM in 2.4, and VBEM in 2.5, this culminates in our Gaussian approximation of sparsitypromoting posteriors and VBL in 2.6. Further enhancements to the basic model are introduced in Section 3. In particular, the online version is introduced in 3.1, and hyperparameter tuning is considered in 3.2. Numerical results are presented in Section 4, including a small basic dataset relating to diabetes in 4.1, and a more computationally intensive total variation (TV) denoising/deblurring example in 4.2.
2 The basic model
First some background and the basic model will be introduced in this section.
2.1 Linear models
Let ,with and (for simplicity), and let and
. Consider the following statistical model for Bayesian linear regression
(1) 
Online inference of corresponds to the case . If and , possibly depending on some parameters which are suppressed in the notation, then conditioned on , the posterior distribution is given in closed form as follows [11]
(2)  
(3) 
The mean minimizes the quadratic objective function
(4) 
By choosing the appropriate version above, the computation cost is , and the memory cost is . In case , and in particular if , then the Kalman filter [44] provides exact solution online with a memory and (per iteration) computation cost of . See appendix D.
A useful observation in what follows is that equation (2) can be rewritten recursively as
(5)  
where . Above is presented for the assimilation of one observation at a time, but it is easily modified to update a batch of observations at a time, which will be considered later. This observation is obviously not useful by itself, as it incurs a cost of per iteration, whereas the Kalman filter delivers updates in primal/covariance form (see 55). However, in what follows, the precision will be required. The latter representation will facilitate an approximate update which relies on recursive “lowrank + diagonal” approximations of , in similar spirit to the ensemble Kalman filter [15].
Linear models of the form are quite flexible, once one considers basis function expansions. In other words, for data and some functions , which can be a subset of polynomials [34, 20, 38], wavelets and other “lets” [19]
[14], random feature models [59], or any number of other choices. The book [7] provides a concise and easy to read summary for regression applications. In fact, there are complete bases for many functionspaces. For example, if then the Fourier series forms a complete basis for [33] and a subset of such features can therefore be used to construct a convergent approximation. In fact, since radial basis function kernel operators are diagonalized by the Fourier basis, then the expectation of the product of two such features with a random frequency is equal to the kernel evaluation. Monte Carlo approximation of such expectations is the basis of the random feature models [59].An issue is how many terms to include, and perhaps more importantly, how to select a subset of the important terms from a sufficiently rich set, without incurring a loss in accuracy due to overfitting, as can occur with too much flexibility. The latter issue is often referred to as “variancebias tradeoff”: a model which is too flexible (negligible bias) may be too strongly influenced by a particular data set, hence incurring a large variance over a range of different data sets
[32]. This wellknown issue can be dealt with by beginning with a sufficiently rich class of approximating functions (e.g. a large enough basis) and then introducing prior assumptions, or regularization, in order to let the model select the most parsimonious representation [69, 40, 68, 54, 53, 67].2.2 Sparsity priors
In the context of prior selection, often the Gaussian assumption is considered too restrictive. In particular, it has become very popular at the end of the last and the beginning of this millenium to utilize a sparsity promoting prior. In this case the Gaussian with quadratic density which gives rise to the second (Tikhonov regularization) term in equation (4) is replaced with another density of the form , where for example , for , , and [68, 19, 25, 56]. This is often extended to the case , which corresponds to counting measure on the nonzero elements . As we will see, more general sparsity promoting priors are also possible. Note that if there is a such that , then one can always redefine and . Therefore, we assume without loss of too much generality that . This will be discussed further in the examples.
There is a computational burden to performing inference with these more exotic priors, even in the case when we abandon UQ and settle for a MAP estimator. In the best case of we have a convex optimization problem, which can be solved efficiently by a number of modern methods, such as iterative soft thresholding [22, 13] and alternating direction method of multipliers [31, 12]. These iterative methods incur the same cost as (2) at each iteration.
Considering the full Bayesian posterior, the situation is even more daunting. Indeed once one adopts such a sparsity prior then the posterior is no longer characterized in closed form with finitely many parameters, as in the Gaussian case (in the finite/discrete/parametric case). Laplace approximation [7] requires derivative and Hessian of the logposterior, which may not exist. Computationallyintensive methods such as Markov chain Monte Carlo (MCMC) [60] are required for consistent estimation, as proposed for the Bayesian Lasso in [56]. There has been a lot of activity in this direction in the past 10 years – see, e.g. [49, 57, 58, 50, 16, 72, 9]. Here we propose to employ a variational approach to recover the best Gaussian approximation to the sparsitypromoting posterior, in a sense to be defined precisely below. It is worth mentioning that the Bayesian formulation allows for more general priors than the continuous, or “shrinkage”, priors considered here. In recent years “spike and slab” priors [51] have become very popular, consisting of a mixture distribution with a Dirac mass at 0 (spike) and a continuous distribution such as the ones considered here (slab). See [5] for a recent review, focused on the spike and slab LASSO.
2.3 Bayesian formulation of sparsity priors
Assume that the prior consists of
independent random variables
, each with Laplace distribution . The MAP estimator associated to this model corresponds to L1 regularized regression, or LASSO [68]. It is wellknown that the Laplace distribution can be expressed as a scale mixture of zero mean Gaussians [2, 30, 61], where the mixture distribution follows a generalized inverse Gaussian law(6) 
with and , i.e.
(7) 
We consider priors defined by general scale mixtures of Normals with mixing distribution given by (6). This includes other sparsity promoting distributions such as Normal inverse Gaussian and Normal Gamma [70, 18, 37, 17]. Furthermore, the conditionals of the joint are known exactly
(8)  
(9) 
where are given by (2), with the diagonal matrix with on the diagonal, and . Therefore, Gibbs sampling can be used to sample from the monolithic posterior (2) [63, 56]. Sequential Monte Carlo methods [26, 52] have also been designed to sample from the full posterior sequentially in [63, 17], and a sequential expectation maximization (EM) [24] method has been used to approximate MAP estimates of in [18, 17].
Remark 2.1
One may also keep an original so that and we have . This would marginally yield an elastic net prior for and : [61].
2.4 Expectation maximization
For the next sections we suppress and in the notation where convenient. Suppose we want to maximize
(10) 
where the inequality arises (for anyprobability density ) from an application of Jensen’s inequality, and here denotes a probability density. The expectation maximization (EM) algorithm [24] proceeds as follows. Define ,
(11) 
and let .
In our context this entails iteratively computing
(12) 
where we recall that is the diagonal matrix with on the diagonal. For example, in the case of and , is defined elementwise as
(13) 
The calculation of (12) is given in Appendix A along with a slightly lengthier explanation of EM. Note we have assumed but allowed
in the hyperprior (
6), which relaxes the marginal Laplace identity (7). The general form of (13) is given in equation (49), and the case is given in (39). One then has the iteration(14) 
These analytical calculations have been shown and used before in several works, including [30, 18, 37]. From this, one obtains the MAP estimator at convergence .
Remark 2.2
An iteratively reweighted least squares (IRLS) algorithm [41] can be derived in order to approximate regularization with by a sequence of problems with , based on the following observation
The resulting iteration for is exactly as in (14), where is interpreted as a regularization parameter. It can be shown under appropriate assumptions that as , where is sparse if such solution exists, and convergence is linear (exponentially fast) for sufficiently close to [23].
2.5 Variational Bayesian Expectation maximization
Here we propose to use the variational Bayesian expectation maximization (VBEM) algorithm, introduced in the context of graphical models in [4, 6],. We show how it works elegantly in our context to provide a Gaussian approximation to problems with sparsity priors, which is optimal in a certain sense. Suppose we return to (10), and this time multiply/divide by some density and integrate over as well. Then we have the evidence lower bound
(15) 
Coincidentally, maximizing this with respect to the densities coincides with minimizing the KL divergence between this variational approximation and the joint posterior, i.e.
(16)  
The objective functions for each of and given the other are convex and can be minimized exactly, as observed in [4, 6], leading to the iterative algorithm
(17)  
Furthermore, following from convexity of the intermediate targets this gives a descent direction for (16) . Observe that constraining to , where is the Dirac delta function and is the point of maximum probability above, yields the original EM algorithm. Also, observe that (17) may itself be intractable in general, although it is shown in [6] that it is simplified somewhat for conjugate exponential models and may be analytically soluble. Fortunately, the present situation is the best case, where it is analytically soluble. Notice that the objective function (16) corresponds to an independence assumption between and , however from (17) it is clear that the solution solves a coupled system, and in fact probabilistic dependence is replaced with a deterministic dependence on each others’ summary statistics, as noted in [6].
2.6 Gaussian approximation to a sparsity promoting posterior
Following instead the variational Bayesian approach of Sec. 2.5 gives (17), i.e.
(19)  
where is used to (degenerately) denote expectation with respect to the iteration intermediate variational distribution, with respect to its argument, or . The first equation looks similar to the EM algorithm, however with the important difference
(20) 
where are the mean and covariance of (note the appearance of the variance instead of just ).
In turn this means that for the case we have
(21) 
The update equations are (e.g. in primal/monolithic form (2))
(22)  
(23) 
Here we can explicitly observe the deterministic dependence between the marginally optimal and distributions via each others’ summary statistics. The general form of the update (21) is given in (49), and the case is given in (39).
Note that this algorithm runs for approximately the same cost as the former, and provides a Gaussian approximation of the posterior. The former provides an approximation of the MAP estimator . In the case we refer to this triple as the variational Bayesian LASSO (VBL). All above is comprised in Algorithm 1. It is wellknown that the mean of sparsity priors, such as TVregularization, may not in fact promote sparsity [47]. Indeed even the Bayesian LASSO point estimator [56]
, which returns median instead of mean, is not as sparse as the standard LASSO, i.e. its MAP estimator. In the context of UQ, one is more concerned with uncertainty measures, such as confidence intervals. For example, one may consider the sparse solution to be reasonable if, for an index
such that , one has , i.e. the origin is within the confidence interval of the Bayesian marginal for those coordinates which are predicted to vanish. In general, one may flag as unusual any circumstances where . See Figure 1 for an illustration of the VBL applied to a simple example with (“exact” values are calculated with numerical quadrature on the domain with grid points in each direction).Input: Design matrix , labels , parameters , initial guess , convergence criteria and distance function .

Specify functional forms and based on , as given in (49).

Set , and (all zeros).

While and ;

Compute and (arguments suppressed);

Compute
(24) (25) 
Compute
(26) (27) (28) 
t=t+1.

Output: .
Remark 2.3
Step (3) of Algorithm 1 requires a stopping criterion. It is sensible for example to let this be determined by the misfit or . In the first case, a good choice may be , for . We will see that the algorithm returns a good estimator very quickly, but may take a long time to converge, and may even return a worse estimator at convergence. Therefore, a maximum number of iterations is often also employed as an alternative in practice.
Remark 2.4
The VBEM has come to be known more generally as meanfield variational Bayes. See [10] for a recent review, where the quite natural name coordinate ascent variational inference is given to the iterations 19. After completing this work, we made several discoveries. The special case of , referred to as automatic relevance detection [54], has been introduced already, independently, in [8] and [27]. Also, the work [3] considers variational inference in the context of the stable mixing distribution. This distribution cannot be explicitly represented in general, but it yields marginal priors of the form for , and leads to EM updates of the IRLS form with , as in remark 2.2. This is referred to as Bayesian bridge regression, and clearly includes the Bayesian LASSO (, i.e. ) as a special case. The focus of that work is on the resulting point estimator for very small , however, and the VBL case is not considered.
3 Enhanced model
This section focuses on enhancing the model by enabling sequential/online inference and hyperparameter optimization.
3.1 Online Gaussian approximation to a sparsity posterior
In the following we focus the description on of the VBEM formulation (22), (23), but note that analogous equations hold for (14). There are two distinct scenarios to consider here. First we will consider the case of moderate and , or in particular , where the online method reproduces EM and VBEM exactly at a cost of per iteration. The second case we will consider is that of very large , and possibly , where it is necessary to impose an approximation in order to control the cost by . The approximate approach is also amenable to .
3.1.1 Small/moderate exact method
Suppose we are assimilating batches of size , and denote
so that e.g. . Sequential batches are presented but this may not always be a sensible choice and some permutation of the indices may make sense. See remark 3.2. We can compute batch updates of the required matrices in (2) exactly with a total of at most operations:
The cost may be smaller if the intermediate quantities are sparse (many zeros) or lowrank. We have the following equation for the precision and can proceed directly with iterating (22) and (23). In the worst case scenario, the inversion required to compute (22) will incur a cost of , so one would aim to take . We iterate that the focus here is the case . For large , which will be discussed now, must be sparse or lowrank. In this case, inversion can be done approximately with a cost of as little as . In case is prohibitive for computation or memory, then online computation and storage of the component matrices must be controlled as well. This is discussed further in the following section.
3.1.2 Large approximate method
In the case of large and/or , the problem is different. It is preferable to directly confront the monolithic problem (2) if it is possible, for example in case that is fixed and is sufficiently sparse and/or lowrank to allow the direct use of an iterative Krylovtype solver [39, 62]. On the other hand, if this cannot be handled directly, then a sequential/online strategy can be adopted as follows. Suppose we have s.t. , so that
The idea is to preserve a “rank + diagonal” type approximation, while effectively passing information forward in an online fashion. Recall equation (5) and define
The update which replaces (22) and (23) is given by
(29)  
(30) 
Finally we need s.t. in order to proceed to the next iteration. This is achieved by (i) computing a reduced rankM eigendecomposition , with diagonal and orthogonal, and (ii) defining . This approximation in principle costs but with a memory cost of only . All the steps above are summarized in Algorithm 2.
Remark 3.1 (EnKF)
We note the similarity between the above procedure and a (squareroot) EnKF [48] for solution of (55) and (56), which proceeds with a lowrank (or lowrank plus diagonal) approximation of the covariance . In our scenario, the above framework is more natural and is expected to provide a better approximation. This is described further in appendix D.
Remark 3.2 (Batching strategy)
In the offline scenario where the data size is fixed and the sequential method is employed then choice of batches is important. If the inputs are i.i.d. random samples then sequential batching is sensible, i.e. , , etc. Otherwise if there is structure in the inputs (e.g. in the context of inverse problems) then it makes more sense to use random sampling without replacement or evenly spread out batches , , etc., where .
Input: Design matrix , labels (possibly infinite and arriving online), parameters , initial guess , inner convergence criteria , distance function , batch size and rule for batching (see remark 3.2).

For

Set , , ,

While and ;

Compute and (arguments suppressed);

Compute
(31) (32) 
Compute
(33) (34) (35) 
t=t+1.


Compute rank approximation , and set .

Output: , at any time (or rank version of the latter).
3.1.3 Further discussion
Note that in practice one would hope that after some iterations will not be changing very much with the iterative reweighting, and few inner updates will be required, if any. If the model is stationary, then one may also not need to allow . There are some other modifications which could be made along the way to further improve efficiency, such as thresholding, i.e. , for small , where is the indicator function on the set , and it acts elementwise on the entries of (and similar for . Suppose that has essentially converged, and parameters are nonzero. We can then discard the valued parameters, thereby either vastly speeding up the algorithm or making way for inclusion of new parameters. Similar things have been done before. See e.g. [74].
All of the present technology is wellsuited to an online scenario, where one assumes a fixed static problem but data arrives sequentially in time, and may continue indefinitely. If this model is meant to emulate a computer simulation, for example which is called by another computer simulation for a particular value of inputs, as in the common case of coupled multiphysics systems and whole device modelling, then one can decide whether the emulator is suitable for a given query input, for example by evaluating the uncertainty under the current approximation . If this is below a certain level then the model returns (and possibly also if the requesting program is capable of handling uncertainty), otherwise the emulator requests a label from the computer simulation and is updated accordingly, as above. It may also be of interest in an offline scenario to build up a database of labeled data and revise the emulator as this is done. Such a task is called experimental design, and greedy or myopic sequential experimental design can be posed elegantly within the sequential framework above.
3.2 Learning hyperparameters
Here it will be assumed that the regularization parameter is fixed at some small value, e.g. . However, the hyperparameters and still need to be learned. We will consider several approaches.
The first approach is only applicable to VBEM for and is a version of iterated nested VBEM algorithm, whereby for each the inner VBEM algorithm is as in (19) and Algorithm 1, and yields
which is then itself used in an outer standard EM on to find
The details of how this is done will be described in detail in Section 3.2.2 below. The procedure is iterated until convergence. The second approach is a variant on the first, in which single steps of the outer and inner algorithm are executed iteratively:
(36) 
The third approach is applicable to both EM (i.e. the problem for ) and VBEM (i.e. the problem for
) independently, however there is a philosophical disconnect if they return different optimal hyperparameters, since then our MAP estimate and our moment estimates correspond to different posteriors. In this version we augment the variational distribution with an additional factor
, which must be learned in an additional step after (19). Furthermore, we will constrain the distribution to the form , where the distribution on the parameters is Dirac at the maximizer, i.e.
Comments
There are no comments yet.