Large Scale Variational Inference and Experimental Design for Sparse Generalized Linear Models

Many problems of low-level computer vision and image processing, such as denoising, deconvolution, tomographic reconstruction or super-resolution, can be addressed by maximizing the posterior distribution of a sparse linear model (SLM). We show how higher-order Bayesian decision-making problems, such as optimizing image acquisition in magnetic resonance scanners, can be addressed by querying the SLM posterior covariance, unrelated to the density's mode. We propose a scalable algorithmic framework, with which SLM posteriors over full, high-resolution images can be approximated for the first time, solving a variational optimization problem which is convex iff posterior mode finding is convex. These methods successfully drive the optimization of sampling trajectories for real-world magnetic resonance imaging through Bayesian experimental design, which has not been attempted before. Our methodology provides new insight into similarities and differences between sparse reconstruction and approximate Bayesian inference, and has important implications for compressive sensing of real-world images.

Authors

• 2 publications
• 18 publications
• Reconstruction-Aware Imaging System Ranking by use of a Sparsity-Driven Numerical Observer Enabled by Variational Bayesian Inference

It is widely accepted that optimization of imaging system performance sh...
05/14/2019 ∙ by Yujia Chen, et al. ∙ 0

• SRR-Net: A Super-Resolution-Involved Reconstruction Method for High Resolution MR Imaging

Improving the image resolution and acquisition speed of magnetic resonan...
04/13/2021 ∙ by Wenqi Huang, et al. ∙ 0

• Covariance-Free Sparse Bayesian Learning

Sparse Bayesian learning (SBL) is a powerful framework for tackling the ...
05/21/2021 ∙ by Alexander Lin, et al. ∙ 0

• Stochastic Variational Bayesian Inference for a Nonlinear Forward Model

Variational Bayes (VB) has been used to facilitate the calculation of th...
07/03/2020 ∙ by Michael A. Chappell, et al. ∙ 0

• Efficient variational inference in large-scale Bayesian compressed sensing

We study linear models under heavy-tailed priors from a probabilistic vi...
07/22/2011 ∙ by George Papandreou, et al. ∙ 0

• Memory-efficient Learning for Large-scale Computational Imaging

Computational imaging systems jointly design computation and hardware to...
12/11/2019 ∙ by Michael Kellman, et al. ∙ 0

• Large Scale Variational Bayesian Inference for Structured Scale Mixture Models

Natural image statistics exhibit hierarchical dependencies across multip...
06/27/2012 ∙ by Young Jun Ko, et al. ∙ 0

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Natural images have a sparse low-level statistical signature, represented in the prior distribution of a sparse linear model (SLM). Imaging problems such as reconstruction, denoising or deconvolution can successfully be solved by maximizing its posterior density (maximum a posteriori

(MAP) estimation), a convex program for certain SLMs, for which efficient solvers are available. The success of these techniques in modern imaging practice has somewhat shrouded their limited scope as Bayesian techniques: of all the information in the posterior distribution, they make use of the density’s mode only.

Consider the problem of optimizing image acquisition, our major motivation in this work. Magnetic resonance images are reconstructed from Fourier samples. With scan time proportional to the number of samples, a central question to ask is which sampling designs of minimum size still lead to MAP reconstructions of diagnostically useful image quality? This is not a direct reconstruction problem, the focus is on improving measurement designs to better support subsequent reconstruction. Goal-directed acquisition optimization cannot sensibly be addressed by MAP estimation, yet we show how to successfully drive it by Bayesian posterior information beyond the mode. Advanced decision-making of this kind needs uncertainty quantification (posterior covariance) rather than point estimation, requiring us to step out of the sparse reconstruction scenario and approximate sparse Bayesian inference instead.

The Bayesian inference relaxation we focus on is not new [11, 23, 17], yet when it comes to problem characterization or efficient algorithms, previous inference work lags far behind standards established for MAP reconstruction. Our contributions range from theoretical characterizations over novel scalable solvers to applications not previously attempted. The inference relaxation is shown to be a convex optimization problem if and only if this holds for MAP estimation (Section 3), a property not previously established for this or any other SLM inference approximation. Moreover, we develop novel scalable double loop algorithms to solve the variational problem orders of magnitude faster than previous methods we are aware of (Section 4). These algorithms expose an important link between variational Bayesian inference and sparse MAP reconstruction, reducing the former to calling variants of the latter few times, interleaved by Gaussian covariance (or PCA) approximations (Section 4.4). By way of this reduction, the massive recent interest in MAP estimation can play a role for variational Bayesian inference just as well. To complement these similarities and clarify confusion in the literature, we discuss computational and statistical differences of sparse estimation and Bayesian inference in detail (Section 5).

The ultimate motivation for novel developments presented here is sequential Bayesian experimental design (Section 6), applied to acquisition optimization for medical imaging. We present a powerful variant of adaptive compressive sensing, which succeeds on real-world image data where theoretical proposals for non-adaptive compressive sensing [9, 6, 10] fail (Section 6.1). Among our experimental results is part of a first successful study for Bayesian sampling optimization of magnetic resonance imaging, learned and evaluated on real-world image data (Section 7.4).

An implementation of the algorithms presented here is publicly available, as part of the glm-ie toolbox (Section 4.6). It can be downloaded from mloss.org/software/view/269/.

2 Sparse Bayesian Inference. Variational Approximations

In a sparse linear model (SLM), the image of pixels is unknown, and linear measurements are given, where in many situations of practical interest.

 y=Xu+ε,ε∼N(0,σ2I), (1)

where is the design matrix, and

is Gaussian noise of variance

, implying the Gaussian likelihood . Natural images are characterized by histograms of simple filter responses (derivatives, wavelet coefficients) exhibiting super-Gaussian (or sparse) form: most coefficients are close to zero, while a small fraction have significant sizes [35, 28] (a precise definition of super-Gaussianity is given in Section 2.1). Accordingly, SLMs have super-Gaussian prior distributions . The MAP estimator can outperform maximum likelihood , when represents an image.

In this paper, we focus on priors of the form , where . The potential functions are positive and bounded. The operator may contain derivative filters or a wavelet transform. Both and have to be structured or sparse, in order for any SLM algorithm to be scalable. Laplace (or double exponential) potentials are sparsity-enforcing:

 ti(si)=e−τi|si|,τi>0. (2)

For this particular prior and the Gaussian likelihood (1), MAP estimation corresponds to a quadratic program, known as LASSO [37] for . Note that is concave. In general, if log-concavity holds for all model potentials, MAP estimation is a convex problem. Another example for sparsity potentials are Student’s t:

 ti(si)=(1+(τi/ν)s2i)−(ν+1)/2,τi,ν>0. (3)

For these, is not concave, and MAP estimation is not (in general) a convex program. Note that is also known as Lorentzian penalty function.

2.1 Variational Lower Bounds

Bayesian inference amounts to computing moments of the posterior distribution

 P(u|y)=Z−1N(y|Xu,σ2I)∏qi=1ti(si),s=Bu,Z=∫N(y|Xu,σ2I)∏qi=1ti(si)du. (4)

This is not analytically tractable in general for sparse linear models, due to two reasons coming together: is highly coupled ( is not blockdiagonal) and non-Gaussian. We focus on variational approximations here, rooted in statistical physics. The log partition function (also known as log evidence or log marginal likelihood) is the prime target for variational methods [42]. Formally, the potentials are replaced by Gaussian terms of parametrized width, the posterior by a Gaussian approximation . The width parameters are adjusted by fitting to , in what amounts to the variational optimization problem.

Our variational relaxation exploits the fact that all potentials are (strongly) super-Gaussian: there exists a such that is even, and convex and decreasing as a function of [23]. We write , in the sequel. This implies that

 ti(si)=maxγi≥0ebisi−s2i/(2γi)−hi(γi)/2,hi(γi):=maxxi≥0(−xi/γi−2gi(xi)). (5)

A super-Gaussian has tight Gaussian-form lower bounds of all widths (see Figure 1). We replace by these lower bounds in order to step from to the family of approximations , where .

To establish (5), note that the extended-value function (assigning for ) is a closed proper convex function, thus can be represented by Fenchel duality [26, Sect. 12]: , where the conjugate function is closed convex as well. For and , we have that with , which implies that for . Also, , so that for any . Therefore, for any , . Reparameterizing , we have that (note that in order to accommodate in general, we have to allow for ). Finally, .

The class of super-Gaussian potentials is large. All scale mixtures (mixtures of zero-mean Gaussians ) are super-Gaussian, and can be written in terms of the density for [23]. Both Laplace (2) and Student’s t potentials (3) are super-Gaussian, with given in Appendix A.6. Bernoulli potentials, used as binary classification likelihoods,

 ti(si)=(1+e−yiτisi)−1,yi∈{±1},τi>0 (6)

are super-Gaussian, with [17, Sect. 3.B]. While the corresponding cannot be determined analytically, this is not required in our algorithms, which can be implemented based on and its derivatives only. Finally, if the are super-Gaussian, so is the positive mixture , , because the logsumexp function [4, Sect. 3.1.5] is strictly convex on and increasing in each argument, and the log-mixture is the concatenation of logsumexp with , the latter convex and decreasing for in each component [4, Sect. 3.2.4].

A natural criterion for fitting to is obtained by plugging these bounds into the partition function of (4):

 Z≥e−h(γ)/2∫N(y|Xu,σ2I)ebTs−12sTΓ−1sdu,h(γ):=∑qi=1hi(γi), (7)

where and . The right hand side is a Gaussian integral and can be evaluated easily. The variational problem is to maximize this lower bound w.r.t. the variational parameters ( for all ), with the aim of tightening the approximation to . This criterion can be interpreted as a divergence function: if the family of all contained the true posterior , the latter would maximize the bound.

This relaxation has frequently been used before [11, 23, 17] on inference problems of moderate size. In the following, we provide results that extend its scope to large scale imaging problems of interest here. In the next section, we characterize the convexity of the underlying optimization problem precisely. In Section 4, we provide a new class of algorithms for solving this problem orders of magnitude faster than previously used techniques.

3 Convexity Properties of Variational Inference

In this section, we characterize the variational optimization problem of maximizing the right hand side of (7). We show that it is a convex problem if and only if all potentials are log-concave, which is equivalent to MAP estimation being convex for the same model.

We start by converting the lower bound (7) into a more amenable form. The Gaussian posterior has the covariance matrix

 CovQ[u|y]=A−1,A:=σ−2XTX+BTΓ−1B,Γ=diagγ. (8)

We have that

 ∫P(y|u)ebTs−12sTΓ−1sdu=|2πA−1|1/2maxuP(y|u)ebTs−12sTΓ−1s,s=Bu,

since . We find that , where

 ϕ(γ):=log|A|+h(γ)+minuR(u,γ),R:=σ−2∥y−Xu∥2+sTΓ−1s−2bTs (9)

and . The variational problem is , and the Gaussian posterior approximation is with the final parameters plugged in. We will also write , so that .

It is instructive to compare this variational inference problem with maximum a posteriori (MAP) estimation:

 minu−2logP(u|y)=minuσ−2∥y−Xu∥2−2∑ilogti(si)+C2=minu,γ⪰0h(γ)+R(u,γ)+C2, (10)

where is a constant. The difference between these problems rests on the term, present in yet absent in MAP. Ultimately, this observation is the key to the characterization provided in this section and to the scalable solvers developed in the subsequent section. Its full relevance will be clarified in Section 5.

3.1 Convexity Results

In this section, we prove that is convex if all potentials are log-concave, with this condition being necessary in general. We address each term in (9) separately.

Theorem 1

Let , be arbitrary matrices, and

 ~A(d):=σ−2XTX+BT(diagd)B,d≻0,

so that is positive definite for all .

1. Let be twice continuously differentiable functions into , so that are convex for all and . Then, is convex. Especially, is convex.

2. Let be concave functions into . Then, is concave. Especially, is concave.

3. Let be concave functions into . Then, is concave. Especially, is concave.

4. Let be the approximate posterior with covariance matrix given by (8). Then, for all :

 VarQ[si|y]=δTiBA−1BTδi≤γi.

A proof is provided in Appendix A.1. Instantiating part (1) with , we see that is convex. Other valid examples are , . For , we obtain the convexity of , generalizing the logsumexp function to matrix values. Part (2) and part (3) will be required in Section 4.2. Finally, part (4) gives a precise characterization of as sparsity parameter, regulating the variance of .

Theorem 2

The function

 γ↦ϕ(γ)−h(γ)=log|A|+minu(σ−2∥y−Xu∥2+sTΓ−1s−2bTs)

is convex for , where .

Proof. The convexity of has been shown in Theorem 1(1). is convex in , and is jointly convex, since the quadratic-over-linear function is jointly convex for [4, Sect. 3.1.5]. Therefore, is convex for [4, Sect. 3.2.5]. This completes the proof.

To put this result into context, note that

 ϕ(γ)−h(γ)=−2log∫P(y|u)ebTs−12sTΓ−1sdu,s=Bu

is the negative log partition function of a Gaussian with natural parameters : it is well known that is a concave function [42]. However, is convex for a model with super-Gaussian potentials (recall that , where is convex as dual function of ), which means that in general need not be convex or concave. The convexity of this negative log partition function w.r.t.  seems specific to the Gaussian case.

Given Theorem 2, if all are convex, the whole variational problem is convex. With the following theorem, we characterize this case precisely.

Theorem 3

Consider a model with Gaussian likelihood (1) and a prior , , so that all are strongly super-Gaussian, meaning that is even, and is strictly convex and decreasing for .

1. If is concave and twice continuously differentiable for , then is convex. On the other hand, if for some , then is not convex at some .

2. If all are concave and twice continuously differentiable for , then the variational problem is a convex optimization problem. On the other hand, if for some and , then is not convex, and there exist some , , and such that is not a convex function.

The proof is given in Appendix A.2. Our theorem provides a complete characterization of convexity for the variational inference relaxation of Section 2, which is the same as for MAP estimation. Log-concavity of all potentials is sufficient, and necessary in general, for the convexity of either. We are not aware of a comparable equivalence having been established for any other nontrivial approximate inference method for continuous variable models.

We close this section with some examples. For Laplacians (2), (see Appendix A.6). For SLMs with these potentials, MAP estimation is a convex quadratic program. Our result implies that variational inference is a convex problem as well, albeit with a differentiable criterion. Bernoulli potentials (6

) are log-concave. MAP estimation for generalized linear models with these potentials is known as penalized logistic regression, a convex problem typically solved by the iteratively reweighted least squares (IRLS) algorithm. Variational inference for this model is also a convex problem, and our algorithms introduced in Section

4 make use of IRLS as well. Finally, Student’s t potentials (3) are not log-concave, and is neither convex nor concave (see Appendix A.6). Neither MAP estimation nor variational inference are convex in this case.

Convexity of an algorithm is desirable for many reasons. No restarting is needed to avoid local minima. Typically, the result is robust to small perturbations of the data. These stability properties become all the more important in the context of sequential experimental design (see Section 6), or when Bayesian model selection111

Model selection (or hyperparameter learning) is not discussed in this paper. It can be implemented easily by maximizing the lower bound

w.r.t. hyperparameters. is used. However, the convexity of does not necessarily imply that the minimum point can be found efficiently. In the next section, we propose a class of algorithms that solve the variational problem for very large instances, by decoupling the criterion (9) in a novel way.

4 Scalable Inference Algorithms

In this section, we propose novel algorithms for solving the variational inference problem in a scalable way. Our algorithms can be used whether

is convex or not, they are guaranteed to converge to a stationary point. All efforts are reduced to well known, scalable algorithms of signal reconstruction and numerical mathematics, with little extra technology required, and no additional heuristics or step size parameters to be tuned.

We begin with the special case of log-concave potentials , such as Laplace (2) or Bernoulli (6), extending our framework to full generality in Section 4.1. The variational inference problem is convex in this case (Theorem 3). Previous algorithms for solving [11, 23] are of the coordinate descent type, minimizing w.r.t. one at a time. Unfortunately, such algorithms cannot be scaled up to imaging problems of interest here. An update of depends on the marginal posterior , whose computation requires the solution of a linear system with matrix . At the projected scale, neither nor a decomposition thereof can be maintained, systems have to be solved iteratively. Now, each of the potentials has to be visited at least once, typically several times. With , , and in the hundred thousands, it is certainly infeasible to solve linear systems. In contrast, the algorithms we develop here often converge after less than hundred systems have been solved. We could also feed and its gradient into an off-the-shelf gradient-based optimizer. However, as already noted in Section 3, is the sum of a standard penalized least squares (MAP) part and a highly coupled, computationally difficult term. The algorithms we propose take account of this decomposition, decoupling the troublesome term in inner loop standard form problems which can be solved by any of a large number of specialized algorithms not applicable to . The expensive part of has to be computed only a few times for our algorithms to converge.

We make use of a powerful idea known as double loop or concave-convex

algorithms. Special cases of such algorithms are frequently used in machine learning, computer vision, and statistics: the expectation-maximization (EM) algorithm

[8], variational mean field Bayesian inference [1], or CCCP for discrete approximate inference [48], among many others. The idea is to tangentially upper bound by decoupled functions which are much simpler to minimize than itself: algorithms iterate between refitting to and minimizing . For example, in the EM algorithm for maximizing a log marginal likelihood, these stages correspond to “E step” and “M step”: while the criterion could well be minimized directly (at the expense of one “E step” per criterion evaluation), “M step” minimizations are much easier to do.

As noted in Section 3, if the variational criterion (9) lacked the part, it would correspond to a penalized least squares MAP objective (10), and simple efficient algorithms would apply. As discussed in Section 4.4, evaluating or its gradient are computationally challenging. Crucially, this term satisfies a concavity property. As shown in Section 4.2, Fenchel duality implies that . For any fixed , the upper bound is tangentially tight, convex in , and decouples additively. If is replaced by this upper bound, the resulting objective is of the same decoupled penalized least squares form than a MAP criterion (10). This decomposition suggests a double loop algorithm for solving . In inner loop minimizations, we solve for fixed , and in interjacent outer loop updates, we refit .

The MAP estimation objective (10) and have a similar form. Specifically, recall that , where and . The inner loop problem is

 minu,γ⪰0ϕz1(u,γ)=minuσ−2∥y−Xu∥2−2∑qi=1(gi(z1,i+s2i)+bisi), (11)

where . This is a smoothed version of the MAP estimation problem, which would be obtained for . However, in our approximate inference algorithm at all times (see Section 4.2). Upon inner loop convergence to , , where . Note that in order to run the algorithm, the analytic form of need not be known. For Laplace potentials (2), the inner loop penalizer is , and .

Importantly, the inner loop problem (11) is of the same simple penalized least squares form than MAP estimation, and any of the wide range of recent efficient solvers can be plugged into our method. For example, the iteratively reweighted least squares (IRLS) algorithm [14], a variant of the Newton-Raphson method, can be used (details are given in Section 4.3). Each Newton step requires the solution of a linear system with a matrix of the same form as (8), the convergence rate of IRLS is quadratic. It follows from the derivation of (9) that once an inner loop has converged to , the minimizer is the mean of the approximate posterior for .

The rationale behind our algorithms lies in decoupling the variational criterion via a Fenchel duality upper bound, thereby matching algorithmic scheduling to the computational complexity structure of . To appreciate this point, note that in an off-the-shelf optimizer applied to , both and the gradient have to be computed frequently. In this respect, the coupling term proves by far more computationally challenging than the rest (see Section 4.4). This obvious computational difference between parts of is not exploited in standard gradient based algorithms: they require all of in each iteration, all of in every single line search step. As discussed in Section 4.4, computing to high accuracy is not feasible for models of interest here, and most off-the-shelf optimizers with fast convergence rates are very hard to run with such approximately computed criteria. In our algorithm, the critical part is recognized and decoupled, resulting inner loop problems can be solved by robust and efficient standard code, requiring a minimal effort of adaptation. Only at the start of each outer loop step, has to be refitted: (see Section 4.2), the computationally critical part of is required there. Fenchel duality bounding222 Note that Fenchel duality bounding is also used in difference-of-convex programming, a general framework to address non-convex, typically non-smooth optimization problems in a double loop fashion. In our application, is smooth in general and convex in many applications (see Section 3): our reasons for applying bound minimization are different. is used to minimize the number of these costly steps (further advantages are noted at the end of Section 4.4). Resulting double loop algorithms are simple to implement based on efficient penalized least squares reconstruction code, taking full advantage of the very well-researched state of the art for this setup.

4.1 The General Case

In this section, we generalize the double loop algorithm along two directions. First, if potentials are not log-concave, the inner loop problems (11) are not convex in general (Theorem 3), yet a simple variant can be used to remedy this defect. Second, as detailed in Section 4.2, there are different ways of decoupling , giving rise to different algorithms. In this section, we concentrate on developing these variants, their practical differences and implications thereof are elaborated on in Section 5.

If is not log-concave, then is not convex in general (Theorem 3). In this case, we can write , where is concave and nondecreasing, is convex. Such a decomposition is not unique, and has to be chosen for each at hand. With hindsight, should be chosen as small as possible (for example, if is log-concave, the case treated above), and if IRLS is to be used for inner loop minimizations (see Section 4.3), should be twice continuously differentiable. For Student’s t potentials (3), such a decomposition is given in Appendix A.6. We define , , and modify outer loop updates by applying a second Fenchel duality bounding operation: , resulting in a variant of the inner loop criterion (11). If is differentiable, the outer loop update is , otherwise any element from the subgradient can be chosen (note that , as is nondecreasing). Moreover, as shown in Section 4.2, Fenchel duality can be employed in order to bound in two different ways, one employed above, the other being , . Combining these bounds (by adding to ), we obtain

 ϕ(γ,u)≤ϕz(u,γ):=zT1(γ−1)+zT2γ−zT3(logγ)+h∪(γ)+R(u,γ)−g∗(z),

where , and collects the offsets of all Fenchel duality bounds. Note that for , and for each , either and , or and . We have that

 ϕz(u):=minγ⪰0ϕz(u,γ)=σ−2∥y−Xu∥2+2∑qi=1h∗i(si)−2bTs,h∗i(si):=12minγi≥0(z1,i+s2i)/γi+z2,iγi−z3,ilogγi+h∪,i(γi),s=Bu. (12)

Note that is convex as minimum (over ) of a jointly convex argument [4, Sect. 3.2.5]. The inner loop minimization problem is of penalized least squares form and can be solved with the same array of efficient algorithms applicable to the special case (11). An application of the second order IRLS method is detailed in Section 4.3. A schema for the full variational inference algorithm is given in Algorithm 1.

The algorithms are specialized to the through and its derivatives. The important special case of log-concave has been detailed above. For Student’s t potentials (3), a decomposition is detailed in Appendix A.6. In this case, the overall problem is not convex, yet our double loop algorithm iterates over standard-form convex inner loop problems. Finally, for log-concave and (type B bounding, Section 4.2), our algorithm can be implemented generically as detailed in Appendix A.5.

We close this section establishing some characteristics of these algorithms. First, we found it useful to initialize them with constant and/or of small size, and with . Moreover, each subsequent inner loop minimization is started with from the last round. The development of our algorithms is inspired by the sparse estimation method of [43], relationships to which are discussed in Section 5. Our algorithms are globally convergent, a stationary point of is found from any starting point (recall from Theorem 3 that for log-concave potentials, this stationary point is a global solution). This is seen as detailed in [43]. Intuitively, at the beginning of each outer loop iteration, and have the same tangent plane at , so that each inner loop minimization decreases significantly unless . Note that this convergence proof requires that outer loop updates are done exactly, this point is elaborated on at the end of Section 4.4.

Our variational inference algorithms differ from previous methods333 This comment holds for approximate inference methods. For sparse estimation, large scale algorithms are available (see Section 5). in that orders of magnitude larger models can successfully be addressed. They apply to the particular variational relaxation introduced in Section 3, whose relationship to other inference approximations is detailed in [29]. While most previous relaxations attain scalability through many factorization assumptions concerning the approximate posterior, in our method is fully coupled, sharing its conditional independence graph with the true posterior . A high-level view on our algorithms, discussed in Section 4.4, is that we replace a priori independence (factorization) assumptions by less damaging low rank approximations, tuned at runtime to the posterior shape.

4.2 Bounding log|A|

We need to upper bound by a term which is convex and decoupling in . This can be done in two different ways using Fenchel duality, giving rise to bounds with different characteristics. Details for the development here are given in Appendix A.4.

Recall our assumption that for each . If , then is concave for (Theorem 1(2) with ). Moreover, is increasing and unbounded in each component of (Theorem 4). Fenchel duality [26, Sect. 12] implies that for , thus for . Therefore, . For fixed , this is an equality for

 z1,∗=∇γ−1log|A|=^z:=(VarQ[si|y])=diag−1(BA−1BT)≻0,

and . This is called bounding type A in the sequel.

On the other hand, is concave for (Theorem 1(3) with ). Employing Fenchel duality once more, we have that , . For any fixed , equality is attained at , and at this point. This is referred to as bounding type B.

In general, type A bounding is tighter for away from zero, while type B bounding is tighter for close to zero (see Figure 2), implications of this point are discussed in Section 5. Whatever bounding type we use, refitting the corresponding upper bound to requires the computation of : all marginal variances of the Gaussian distribution . In general, computing Gaussian marginal variances is a hard numerical problem, which is discussed in more detail in Section 4.4.

4.3 The Inner Loop Optimization

The inner loop minimization problem is given by (12), its special case (11) for log-concave potentials and bounding type A is given by . This problem is of standard penalized least squares form, and a large number of recent algorithms [12, 3, 46] can be applied with little customization efforts. In this section, we provide details about how to apply the iteratively reweighted least squares (IRLS) algorithm [14], a special case of the Newton-Raphson method.

We describe a single IRLS step here, starting from . Let

denote the residual vector. If

, , then

 ∇u(ϕz/2)=σ−2XTr+BTθ,∇∇u(ϕz/2)=σ−2XTX+BT(diagρ)B.

Note that by the convexity of . The Newton search direction is

 d:=−(σ−2XTX+BT(diagρ)B)−1(σ−2XTr+BTθ).

The computation of requires us to solve a system with a matrix of the same form as , a reweighted least squares problem otherwise used to compute the means in a Gaussian model of the structure of . We solve these systems approximately by the (preconditioned) linear conjugate gradients (LCG) algorithm [13]. The cost per iteration of LCG is dominated by matrix-vector multiplications (MVMs) with , , and . A line search along can be run in negligible time. If , then , where is the gradient at . With , , and precomputed, and can be evaluated in without any further MVMs. The line search is started with . Finally, once is found, is explicitly updated as . Note that at this point, , which follows from the derivation at the beginning of Section 3.

4.4 Estimation of Gaussian Variances

Variational inference does require marginal variances of the Gaussian (see Section 4.2), which are much harder to approximate than means. In this section, we discuss a general method for (co)variance approximation. Empirically, the performance of our double loop algorithms is remarkably robust in the light of substantial overall variance approximation errors, some insights into this finding are given below.

Marginal posterior variances have to be computed in any approximate Bayesian inference method, while they are not required in typical sparse point estimation techniques (see Section 5). Our double loop algorithms reduce approximate inference to point estimation and Gaussian (co)variance approximation. Not only do they expose the latter as missing link between sparse estimation and variational inference, their main rationale is that Gaussian variances have to be computed a few times only, while off-the-shelf variational optimizers query them for every single criterion evaluation.

Marginal variance approximations have been proposed for sparsely connected Gaussian Markov random fields (MRFs), iterating over embedded spanning tree models [41] or exploiting rapid correlations decay in models with homogeneous prior [20]. In applications of interest here, is neither sparse nor has useful graphical model structure. Committing to a low-rank approximation of the covariance [20, 27]

, an optimal choice in terms of preserving variances is principal components analysis (PCA), based on the smallest eigenvalues/-vectors of

(the largest of ). The Lanczos algorithm [13] provides a scalable approximation to PCA and was employed for variance estimation in [27]. After iterations, we have an orthonormal basis