 # A Bayesian Conjugate Gradient Method

A fundamental task in numerical computation is the solution of large linear systems. The conjugate gradient method is an iterative method which offers rapid convergence to the solution, particularly when an effective preconditioner is employed. However, for more challenging systems a substantial error can be present even after many iterations have been performed. The estimates obtained in this case are of little value unless further information can be provided about the numerical error. In this paper we propose a novel statistical model for this numerical error set in a Bayesian framework. Our approach is a strict generalisation of the conjugate gradient method, which is recovered as the posterior mean for a particular choice of prior. The estimates obtained are analysed with Krylov subspace methods and a contraction result for the posterior is presented. The method is then analysed in a simulation study as well as being applied to a challenging problem in medical imaging.

Comments

There are no comments yet.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

This paper presents an iterative method for solution of systems of linear equations of the form

 Ax∗=b (1)

where

is an invertible matrix and

is a vector, each given, while

is to be determined. The principal novelty of our method, in contrast to existing approaches, is that its output is a probability distribution over vectors which reflects knowledge about after expending a limited amount of computational effort. This allows the output of the method to be used, in a principled anytime

manner, tailored to reflect a constrained computational budget. In a special case, the mode of this distribution coincides with the estimate provided by the standard conjugate gradient method, whilst the probability mass is proven to contract onto

as more iterations are performed.

Challenging linear systems arise in a wide variety of applications; of these, partial differential equations (PDEs) should be emphasised, as these arise frequently throughout the applied sciences and in engineering

(Evans, 2010). Finite element and finite difference discretisations of PDEs each yield large, sparse linear systems which can sometimes be highly ill-conditioned, such as in the classically ill-posed backwards heat equation (Evans, 2010)

. Even for linear PDEs, a detailed discretisation may be required. This can result in a linear system with billions of degrees of freedom and require specialised algorithms to be even approximately solved practically

(e.g. Reinarz et al., 2018). Another example arises in computation with Gaussian measures (Bogachev, 1998; Rasmussen, 2004), in which analytic covariance functions, such as the exponentiated quadratic, give rise to challenging linear systems. This has an impact in a number of related fields, such as symmetric collocation solution of PDEs (Fasshauer, 1999; Cockayne et al., 2016), numerical integration (Larkin, 1972; Briol et al., 2018) and generation of spatial random fields (Besag and Green, 1993; Parker and Fox, 2012; Schäfer et al., 2017). In the latter case, large linear systems must often be solved to sample from these fields, such as in models of tropical ocean surface winds (Wikle et al., 2001) where systems may again be billion-dimensional. Thus, it is clear that there exist many important situations in which error in the solution of a linear system cannot practically be avoided.

### 1.1 Linear Solvers

The solution of linear systems is one of the most ubiquitous problems in numerical analysis and Krylov subspace methods (Hestenes and Stiefel, 1952; Liesen and Strakos, 2012) are among the most successful at obtaining an approximate solution at low cost. Krylov subspace methods belong to the class of iterative methods (Saad, 2003), which construct a sequence that approaches and can be computed in an efficient manner. Iterative methods provide an alternative to direct methods (Davis, 2006; Allaire and Kaber, 2008) such as the LU or Cholesky decomposition, which generally incur higher cost as termination of the algorithm after iterations is not meaningful. In certain cases an iterative method can produce an accurate approximation to with reduced computational effort and memory usage compared to a direct method.

The conjugate gradient (CG) method (Hestenes and Stiefel, 1952) is a popular iterative method, and perhaps the first instance of a Krylov subspace method. The error arising from CG can be shown to decay exponentially in the number of iterations, but convergence is slowed when the system is poorly conditioned. As a result, there is interest in solving equivalent preconditioned systems (Allaire and Kaber, 2008), either by solving (left-preconditioning) or (right-preconditioning), where is chosen both so that (or ) has a lower condition number than itself, and so that computing the solution of systems is computationally inexpensive for arbitrary and . Effective preconditioning can dramatically improve convergence of CG, and of Krylov subspace methods in general, and is recommended even for well-conditioned systems, owing to how rapidly conjugacy is lost in CG when implemented numerically. One reasonably generic method for sparse systems involves approximate factorisation of the matrix, through an incomplete LU or incomplete Cholesky decomposition (e.g. Ajiz and Jennings, 1984; Saad, 1994). Other common approaches exploit the structure of the problem. For example, in numerical solution of PDEs a coarse discretisation of the system can be used to construct a preconditioner for a finer discretisation (e.g. Bramble et al., 1990). A more detailed survey of preconditioning methods can be found in many standard texts, such as Benzi (2002) and Saad (2003). However, no approach is universal, and in general careful analysis of the structure of the problem is required to determine an effective preconditioner (Saad, 2003, p. 283). At worst, constructing a good preconditioner can be as difficult as solving the linear system itself.

In situations where numerical error cannot practically be made negligible, an estimate for the error must accompany the output of any linear solver. The standard approach is to analytically bound by some function of the residual , for appropriate choices of norms, then to monitor the decay of the relative residual. In implementations, the algorithm is usually terminated when this reaches machine precision, which can require a very large number of iterations, and substantial computational effort. This often constitutes the principal bottleneck in contemporary applications. The contribution of this paper is to demonstrate how Bayesian analysis can be used to develop a richer, probabilistic description for the error in estimating the solution with an iterative method. From a user’s perspective, this means that solutions from the presented method can still be used in a principled way, even when only a small number of iterations can be afforded.

### 1.2 Probabilistic Numerical Methods

The concept of a probabilistic numerical method dates back to Larkin (1972). The principal idea is that problems in numerical analysis can be cast as inference problems and are therefore amenable to statistical treatment. Bayesian probabilistic numerical methods (Cockayne et al., 2017) posit a prior distribution for the unknown, in our case , and condition on a finite amount of information about to obtain a posterior that reflects the level of uncertainty in , given the finite information obtained. In contemporary applications, it is common for several numerical methods to be composed in a pipeline to perform a complex task. For example, climate models (such as Roeckner et al., 2003) involve large systems of coupled differential equations. To simulate from these models, many approximations must be combined. Bayesian probabilistic numerical methods are of particular interest in this setting, as a probabilistic description of error can be coherently propagated through the pipeline to describe the structure of the overall error and study the contribution of each component of the pipeline to that error (Hennig et al., 2015). As many numerical methods rely on linear solvers, such as the CG method, understanding the error incurred by these numerical methods is critical. Other works to recently highlight the value of statistical thinking in this application area includes Calvetti et al. (2018).

In recent work, Hennig (2015) treated the problem of solving Eq. (1) as an inference problem for the matrix , and established correspondence with existing iterative methods by selection of different matrix-valued Gaussian priors within a Bayesian framework. This approach was explored further in Bartels and Hennig (2016). There, it was observed that the posterior distribution over the matrix in Hennig (2015) produces the same factors as in the LU or Cholesky decompositions111Recall that the Cholesky decomposition is a symmetric version of the LU decomposition for symmetric positive-definite matrices.. Our contribution takes a fundamentally different approach, in that a prior is placed on the solution rather than on the matrix . There are advantages to the approach of Hennig (2015), in that solution of multiple systems involving the same matrix is trivial. However we argue that it is more intuitive to place a prior on than on , as one might more easily reason about the solution to a system than the elements of the inverse matrix. Furthermore, the approach of placing a prior on is unchanged by any left-preconditioning of the system, while the prior of Hennig (2015) is not preconditioner-invariant.

#### Contribution

The main contributions of this paper are as follows:

• The Bayesian conjugate gradient (BayesCG) method is proposed for solution of linear systems. This is a novel probabilistic numerical method in which both prior and posterior are defined on the solution space for the linear system, . We argue that placing a prior on the solution space is more intuitive than existing probabilistic numerical methods and corresponds more directly with classical iterative methods. This makes substitution of BayesCG for existing iterative solvers simpler for practitioners.

• The specification of the prior distribution is discussed in detail. Several natural prior covariance structures are introduced, motivated by preconditioners or Krylov subspace methods. In addition, a hierarchical prior is proposed in which all parameters can be marginalised, allowing automatic adjustment of the posterior to the scale of the problem. This discussion provides some generic prior choices to make application of BayesCG more straightforward for users unfamiliar with probabilistic numerical methods.

• It is shown that, for a particular choice of prior, the posterior mode of BayesCG coincides with the output of the standard CG method. An explicit algorithm is provided whose complexity is shown to be a small constant factor larger than that of the standard CG method. Thus, BayesCG can be efficiently implemented and could be used in place of classical iterative methods with marginal increase in computational cost.

• A thorough convergence analysis for the new method is presented, with computational performance in mind. It is shown that the posterior mean lies in a particular Krylov subspace, and rates of convergence for the mean and contraction for the posterior are presented. The distributional quantification of uncertainty provided by this method is shown to be conservative in general.

The structure of the paper is as follows: In Section 2 BayesCG is presented and its inputs discussed. Its correspondence with CG is also established for a particular choice of prior. Section 3 demonstrates that the mean from BayesCG lies in a particular Krylov subspace and presents a convergence analysis of the method. In Section 4 the critical issue of prior choice is addressed. Several choices of prior covariance are discussed and a hierarchical prior is introduced to allow BayesCG to adapt to the scale of the problem. Section 5 contains implementation details, while in Section 6 the method is applied to a challenging problem in medical imaging which requires repeated solution of a linear system arising from the discretisation of a PDE. The paper concludes with a discussion in Section 7. Proofs of all theoretical results are provided in the electronic supplement.

## 2 Methods

We begin in Section 2.1 by defining a Bayesian probabilistic numerical method for the linear system in Eq. (1). In Section 2.2 a correspondence to the CG method is established. In Section 2.3 we discuss a particular choice of search directions that define BayesCG. Throughout this paper, note that is not required to be symmetric positive-definite, except for in Section 2.2.

### 2.1 Probabilistic Linear Solver

In this section we present a general probabilistic numerical method for solving Eq. (1). The approach taken is Bayesian, so that the method is defined by the choice of prior and the information on which the prior is to be conditioned. For this work, the information about is linear and is provided by search directions , , through the matrix-vector products

 yi:=(s⊤iA)x∗=s⊤ib. (2)

The matrix-vector products on the right-hand-side are assumed to be computed without error222i.e. in exact arithmetic, which implies a likelihood model in the form of a Dirac distribution:

 p(y|x)=δ(y−S⊤mAx). (3)

This section assumes the search directions are given a-priori. The specific search directions which define BayesCG will be introduced in Section 2.3.

In general the recovery of from pieces of information is ill-posed. The prior distribution serves to regularise the problem, in the spirit of Tikhonov (1963); Stuart (2010). Linear information is well-adapted to inference with stable distributions333 Let and be independent copies of a random variable . Then is said to be stable if, for any constants , the random variable has the same distribution as for some constants and .

such as the Gaussian or Cauchy distributions, in that the posterior distribution is available in closed-form. Optimal estimation with linear information is also well-understood

(cf. Traub et al., 1988). To proceed, let

be a random variable, which will be used to model epistemic uncertainty regarding the true solution

, and endow with the prior distribution

 p(x)=N(x;x0,Σ0) (4)

where and are each assumed to be known a-priori, an assumption that will be relaxed in Section 4. It will be assumed throughout that is a symmetric and positive-definite matrix.

Having specified the prior and the information, there exists a unique Bayesian probabilistic numerical method which outputs the conditional distribution (Cockayne et al., 2017) where satisfies , and denotes the matrix whose columns are . This is made clear in the following result:

###### Proposition 1 (Probabilistic Linear Solver).

Let and . Then the posterior distribution is given by

 p(x|ym)=N(x;xm,Σm)

where

 xm =x0+Σ0A⊤SmΛ−1mS⊤mr0 (5) Σm =Σ0−Σ0A⊤SmΛ−1mS⊤mAΣ0 (6)

This provides a distribution on that reflects the state of knowledge given the information contained in . The mean, , could be viewed as an approximation to that might be provided by a numerical method. From a computational perspective, the presence of the matrix could be problematic, as this implies a second linear system must be solved, albeit at a lower cost . This could be addressed to some extent by updating iteratively using the Woodbury matrix inversion lemma, though this would not reduce the overall cost. However, as the search directions can be chosen arbitrarily, this motivates a choice which diagonalises , to make the inverse trivial. This will be discussed further in Section 2.3.

Note that the posterior distribution is singular, in that . This is natural since what uncertainty remains in directions not yet explored is simply the restriction, in the measure-theoretic sense, of the prior to the subspace orthogonal to the columns of . As a result, the posterior distribution is concentrated on a linear subspace of

. Singularity of the posterior makes computing certain quantities difficult, such as posterior probabilities. Nevertheless,

can be decomposed using techniques such as the singular-value decomposition, so sampling from the posterior is straightforward.

Define a generic inner-product of two vectors in by , with associated norm . Note that for this to define a norm, it is required that be a positive-definite matrix. The following basic result establishes that the posterior covariance provides a meaningful connection to the error of , when viewed as a point estimator:

###### Proposition 2.
 ∥xm−x∗∥Σ−10∥x0−x∗∥Σ−10≤√tr(ΣmΣ−10)

Thus the right hand side provides an upper bound on the relative error of the estimator in the -norm. This is a weak result and tighter results for specific search directions are provided later. In addition to bounding the error in terms of the posterior covariance , we can also compute the rate of contraction of the posterior covariance itself:

###### Proposition 3.
 tr(ΣmΣ−10)=d−m

The combination of Propositions 2 and 3 implies that the posterior mean is consistent and, since the posterior covariance characterises the width of the posterior, Proposition 3 can be viewed as a posterior contraction result. This result is intuitive; after exploring linearly independent search directions, has been perfectly identified in an -dimensional linear subspace of . Thus, after adjusting for the weighting of provided by the prior covariance , it is natural that an appropriate measure of the size of the posterior should also converge at a rate that is linear.

### 2.2 Correspondence with the Conjugate Gradient Method

In this section we examine the correspondence of the posterior mean described in Proposition 1 with the CG method. It is frequently the case that Bayesian probabilistic numerical methods have some classical numerical method as their mean, due to the characterisation of the conditional mean of a probability distribution as the -best element of the underlying space consistent with the information provided (Diaconis, 1988; Cockayne et al., 2017).

#### The Conjugate Gradient Method

A large class of iterative methods for solving linear systems defined by positive-definite matrices can be motivated by sequentially solving the following minimisation problem:

 xm=argminx∈Km∥x−x∗∥A

where is a sequence of -dimensional linear subspaces of . It is straightforward to show that this is equivalent to:

 xm=argminx∈Kmf(x)

where is a convex quadratic functional. Let denote a matrix whose columns are arbitrary linearly independent search directions , with . Let denote an arbitrary starting point for the algorithm. Then for some which can be computed by solving . This yields:

 xm =x0+Sm(S⊤mASm)−1S⊤m(b−Ax0) (7)

In CG (Hestenes and Stiefel, 1952) the search directions are constructed to simplify the inversion in Eq. (7) by imposing that the search directions are -conjugate, that is, whenever . A set of -conjugate vectors is also said to be -orthogonal, while if the vectors additionally have for each they are said to be -orthonormal. For simplicity of notation, we will usually work with -orthonormal search directions, but in most implementations of CG the normalisation step can introduce stability issues and is therefore avoided.

Supposing that such a set of -orthonormal search directions can be found, Eq. (7) simplifies to

 xCGm=xCG0+SCGm(S% CGm)⊤(b−AxCG0) (8)

which lends itself to an iterative numerical method:

 xCGm=xCGm−1+sCGm(sCGm)⊤(b−AxCGm−1).

Search directions are also constructed iteratively, motivated by gradient descent on the function , whose negative gradient is given by . The initial un-normalised search direction is chosen to be , so that . Letting , subsequent search directions are given by

 ~sCGm:=rCGm−1−⟨sCGm−1,rCGm−1⟩AsCGm−1 (9)

with . This construction leads to search directions which form an -orthonormal set.

Eq. 8 makes clear the following proposition, which shows that for a particular choice of prior the CG method is recovered as the posterior mean from Proposition 1:

###### Proposition 4.

Assume is symmetric and positive-definite. Let and . Then, taking , Eq. (5) reduces to .

This result provides an intriguing perspective on the CG method, in that it represents the estimate produced by a rational Bayesian agent whose prior belief about is modelled by . Dependence of the prior on the inaccessible matrix inverse is in accordance with the findings in Hennig (2015) (Theorem 2.4 and Lemma 3.4), in which an analogous result was presented. As observed in that paper, the appearance of in the prior covariance is not practically useful, as while the matrix inverse cancels in the expression for , it remains in the expression for .

### 2.3 Search Directions

In this section the choice of search directions for the method in Proposition 1 will be discussed, initially by following an information-based complexity (Traub et al., 1988) argument. For efficiency purposes, a further consideration is that should be easy to invert. This naturally suggests that search directions should be chosen to be conjugate with respect to the matrix , rather than . Note that this approach does not require to be positive-definite, as is positive-definite for any non-singular . Two choices of search direction will be discussed:

#### Optimal Information

One choice is to formulate selection of in a decision-theoretic framework, to obtain optimal information in the nomenclature of Cockayne et al. (2017). Abstractly, denote the probabilistic numerical method discussed above by , where is the set of all distributions on . The function takes a right-hand-side , together with a prior and a set of search directions and outputs the posterior distribution from Proposition 1. Thus is a measure and denotes its infinitesimal element.

For general , define the average risk associated with the search directions to be

 R(Sm,μ)=∬L(x,x∗)P[Ax∗;μ,Sm](% dx)μ(dx∗) (10)

where represents a loss incurred when is used to estimate . This can be thought of as the performance of the probabilistic numerical method, averaged both over the class of problems described by and over the output of the method. Optimal information in this paper concerns selection of to minimise . The following proposition characterises optimal information for the posterior in Proposition 1

in the case of a squared-error loss function and when

. Let , and let denote a square-root of a symmetric positive-definite matrix with the property that , where .

###### Proposition 5.

Suppose and consider the squared-error loss where is an arbitary symmetric positive-definite matrix. Optimal information for this loss is given by

 Sm=A−⊤M⊤2Φm

where is the matrix whose columns are the

leading eigenvectors of

, normalised such that .

The dependence of the optimal information on is problematic except for when , which corresponds to measuring the performance of the algorithm through the residual . While this removes dependence on the inverse matrix, finding the search directions in this case requires computing the eigenvectors of , the complexity of which would dominate the cost of computing the posterior in Proposition 1.

#### Conjugacy

A second, more practical method for obtaining search directions that diagonalise is similar to that taken in CG. Search directions are constructed which are conjugate to the matrix by following a similar procedure to that described in Section 2.2.

###### Proposition 6 (Conjugate Search Directions ⟹ Iterative Method).

Assume that the search directions are -orthonormal. Denote . Then, in Eq. (5) simplifies to

 xm =xm−1+Σ0A⊤sm(s⊤mrm−1)

while to compute in Eq. (6) it suffices to store only the vectors , for .

On the surface, the form of this posterior differs slightly from that in Proposition 1, in that the data are given by rather than . However, when search directions are conjugate, the two expressions are equivalent:

 s⊤mrm−1 =s⊤mb−s⊤mAxm−1 =s⊤mb−s⊤mAx0−s⊤mAΣ0A⊤S⊤m−1=0r0=s⊤mr0. (11)

Use of reduces the amount of storage required compared to direct application of Eq. (5). It also helps with stability as, while search directions can be shown to be conjugate mathematically, the accumulation of numerical error from floating point precision is such that numerical conjugacy may not hold, a point discussed further in Section 5.

An approach to constructing conjugate search directions for our probabilistic linear solver is now presented, again motivated by gradient descent.

###### Proposition 7 (Bayesian Conjugate Gradient Method).

Recall the definition of the residual . Denote and . For let

Further, assume and let . Then for each , the set is -orthonormal, and as a result .

This is termed a Bayesian conjugate gradient method for the same reason as in CG, as search directions are chosen to be the direction of gradient descent subject to a conjugacy requirement, albeit a different one than in standard CG. In the context of Proposition 4, note that the search directions obtained coincide with those obtained from CG when is symmetric positive-definite and . Thus, BayesCG is a strict generalisation of CG. Note, however, that these search directions are constructed in a data-driven manner, in that they depend on the right-hand-side . This introduces a dependency on through the relationship in Eq. 1 which is not taken into account in the conditioning procedure and leads to conservative uncertainty assessment, as will be demonstrated in Section 6.1.

## 3 BayesCG as a Krylov Subspace Method

In this section a thorough theoretical analysis of the posterior will be presented. Fundamental to the analysis in this section is the concept of a Krylov subspace.

###### Definition 8 (Krylov Subspace).

The Krylov subspace , , is defined as

 Km(M,v):=span(v,Mv,M2v,…,Mmv).

For a vector , the shifted Krylov subspace is defined as

 w+Km(M,v):=span(w+v,w+Mv,w+M2v,…,w+Mmv).

It is well-known that CG is a Krylov subspace method for symmetric positive-definite matrices (Liesen and Strakos, 2012), meaning that

 xCGm=argminx∈x0+Km−1(A,r0)∥x−x∗∥A.

It will now be shown that the posterior mean for BayesCG, presented in Proposition 6, is a Krylov subspace method. For convenience, let .

###### Proposition 9.

The BayesCG mean satisfies

 xm=argminx∈K∗m−1∥x−x∗∥Σ−10.

This proposition gives an alternate perspective on the observation that, when is symmetric positive-definite and , the posterior mean from BayesCG coincides with : Indeed, for this choice of , coincides with and furthermore, since under this choice of the norm minimised in Proposition 9 is , it is natural that the estimates and should be identical.

Proposition 9 allows us to establish a convergence rate for the BayesCG mean which is similar to that which can be demonstrated for CG. Let denote the condition number of a matrix in the matrix -norm. Now, noting that his well-defined, as and are each nonsingular, we have:

###### Proposition 10.
 ∥xm−x∗∥Σ−10∥x0−x∗∥Σ−10≤2(√κ(Σ0A⊤A)−1√κ(Σ0A⊤A)+1)m.

This rate is similar to the well-known convergence rate which for CG, in which is replaced by . However, since it holds that , the convergence rate for BayesCG will often be worse than that for CG, unless is chosen judiciously to reduce the condition number of . Thus it appears that there is a price to be paid when uncertainty quantification is needed. This is unsurprising, as it is generally the case that uncertainty quantification is associated with additional cost over methods for which uncertainty quantification is not provided.

Nevertheless, the rate of convergence in Proposition 10 is significantly faster than the rate obtained in Proposition 2. The reason for this is that knowledge about how the search directions were chosen has been exploited. The directions used in BayesCG are motivated by gradient descent on

. Thus, if gradient descent is an effective heuristic for the problem at hand, then the magnitude of the error

will decrease at a rate which is sub-linear. The same cannot be said for which continues to converge linearly as proven in Proposition 3. Thus, the posterior covariance will in general be conservative when the BayesCG search directions are used. This is verified empirically in Section 6.1.

## 4 Prior Choice

The critical issue of prior choice is now examined. In Section 4.1 selection of the prior covariance structure will be discussed. Then in Section 4.2 a hierarchical prior will be introduced to address the scale of the prior.

### 4.1 Covariance Structure

When is symmetric positive-definite, one choice which has already been discussed is to set , which results in a posterior mean equal to the output of CG. However correspondance of the posterior mean with CG does not in itself justify this modelling choice from a probabilistic perspective and moreover this choice is not practical, as access to would give immediate access to the solution of Eq. Eq. 1. We therefore discuss some alternatives for the choice of .

#### Natural Prior

Taking inspiration from probabilistic numerical methods for PDEs (Cockayne et al., 2016; Owhadi, 2015), another natural choice presents itself: The object through which information about is extracted is , so it is natural, and mathematically equivalent, to place a relatively uninformative prior on the elements of rather than on itself. If then the implied prior model for is . This prior is as impractical as that which aligns the posterior mean with CG, but has the attractive property that convergence is instantaneous when the search directions from Proposition 7 are used. To see this, observe that

 s1 =r0∥r0∥AΣ0A⊤=r0∥r0∥2 (since AΣ0A=I) ⟹x1 =x0+(A⊤A)−1A⊤r0(r⊤0r0)∥r0∥22 (Proposition 6) =x0+A−1(b−Ax0)=x∗ (since r0=b−Ax0).

Thus this prior is natural, in that when using the search directions from Proposition 7, convergence occurs in one iteration.

#### Preconditioner Prior

For systems in which a preconditioner is available, the preconditioner can be thought of as providing an approximation to the linear operator . Inspired by the impractical natural covariance , one approach proposed in this paper is to set , when a preconditioner can be found. Since by design the action of can be computed efficiently, so too can the action of . As mentioned in Section 1.1, the availability of a good preconditioner is problem-dependent.

#### Krylov Subspace Prior

The analysis presented in Section 3 suggests another potential prior, in which probability mass is distributed according to an appropriate Krylov subspace . Consider a distribution constructed as the linear combination

 xK=n∑i=0wiMib (12)

where for some positive-definite matrix . The distribution on induced by Eq. (12) is clearly Gaussian with mean . To determine its covariance, note that the above expression can be rewritten as , where is the matrix whose columns form a basis of the Krylov subspace . A convenient choice is , where

 ~ki=Aib−i−1∑j=0b⊤A(i+j)b⋅Ajb

and Irrespective of choice of , however, the covariance of is given by so that . One issue with this approach is that the computation of the matrix is of the same computational complexity as iterations of BayesCG, requiring matrix-vector products. To ensure that this cost does not dominate the procedure, it is necessary to take . However, in this situation , so it is necessary to add additional probability mass on the space orthogonal to , to ensure that lies in the prior support. To this end, let , and let denote a matrix whose columns span . Let , where for a scaling parameter . Then, the proposed Krylov subspace prior is given by

 x(=x0+xK+x⊥K) ∼N(x0,KnΦK⊤n+φK⊥n(K⊥n)⊤).

Practical issues associated with this approach are now discussed:

• Choice of : It seems natural to place mass on the Krylov subspace which occupies, that is , but dependence of this subspace on the prior covariance that is being computed makes this choice circular. An alternative choice is to choose so that the projection of into converges rapidly in to the truth. The choice seems natural, due to the known rapid convergence of CG, and corresponds to a prior encoding of the intuition that “CG search directions tend to work well”. Another option would be to take for some preconditioner , but this final choice was not explored.

• Selection of and : When , the theoretical bound for the relative error of CG can be used to choose the elements of , given in Proposition 10 when . Let . Then we propose to choose to be a diagonal matrix, with diagonal entries where is a scale parameter. Based upon Proposition 10, the ideal choices for and are and . However, since these two quantities are not typically known a priori, estimates must in practice be used. The remaining parameter, , determines how much weight is given to the orthogonal component. Given the choice of , it is natural to choose ; this encodes the user’s prior belief about how many iterations of standard CG are “typically needed”.

• Computation of : Computation of the complement of is equivalent to finding a basis of the set of all vectors for which ; that is, computing the null-space of

. This can be accomplished by QR decomposition. Recall that a QR-decomposition produces matrices

and such that . The matrix can be used to determine the null space of . After partitioning the matrix as , where and are respectively the first and last columns of , it holds that is an orthonormal matrix whose columns form a basis of the required null-space.

### 4.2 Covariance Scale

For the distributional output of BayesCG to be useful it must be well-calibrated. Loosely speaking, this means that the true solution

should typically lie in a region where most of the posterior probability mass is situated. As such, the scale of the posterior variance should have the ability to adapt and reflect the difficulty of the linear system at hand. This can be challenging, partially because the magnitude of the solution vector is

a-priori unknown and partially because of the aforementioned fact that the dependence of on is not accounted for in BayesCG.

In this section we propose to treat the prior scale as an additional parameter to be learned; that is we consider the prior model

 p(x|ν)=N(x0,νΣ0)

where are as before, while . This can be viewed as a generalised version of the prior in Eq. (4), which is recovered when . In this section we consider learning in a hierarchical Bayesian framework, but we note that could also be heuristically calibrated. An example of such a heuristic procedure is outlined in LABEL:supplement:sim_results.

The approach pursued below follows a standard approach in Bayesian linear regression

(Gelman et al., 2014). More generally, one could treat the entire covariance as unknown and perform similar conjugate analysis with an inverse-Wishart prior, though this extension was not explored. Consider then endowing with Jeffreys’ (improper) reference prior:

 p(ν)∝ν−1.

The conjugacy of this prior with the Gaussian distribution is such that the posterior marginal distributions

and can be found analytically. For the following proposition,

denotes an inverse-gamma distribution, while

denotes a multivariate distribution with degrees of freedom.

###### Proposition 11 (Hierarchical BayesCG).

When and are as above, the posterior marginal for is given by

 p(ν|ym)=IG(m2,12r⊤0SmΛ−1mS⊤mr0)

while the posterior marginal for is given by

 p(x|ym)=MVTm(xm,r⊤0SmΛ−1mS⊤mr0mΣm).

When the search directions are -orthonormal, this simplifies to

 p(ν|ym) =IG(m2,m2νm) p(x|ym) =MVTm(xm,νmΣm)

where .

Since reflects the initial error , the quantity can be thought of as describing the difficulty of the problem. Thus in this approach the scale of the posterior is data-dependent.

## 5 Implementation

In this section some important details of the implementation of BayesCG are discussed.

#### Numerical Breakdown of Conjugacy

In standard CG, as well as for the sequentially-computed search directions from Proposition 7, while the search directions are conjugate in exact arithmetic, the propagation of floating point error is such that numerical conjugacy can fail to hold. This is due to the iterative way in which the method is implemented. The implication is that, while mathematical convergence is guaranteed in iterations, iterations may be required in practice for both the CG estimate to converge to and the BayesCG posterior to contract around it. When used in practise, CG is only run for iterations to mitigate the impact of conjugacy breakdown. This further highlights the importance of preconditioners, to ensure rapid convergence.

This phenomenon can be mitigated to some extent by the modification described in Eq. (11) which ensures that “local” conjugacy is exploited (Meurant, 2006). While the new direction may not be numerically conjugate to , it is likely to be numerically conjugate to by the construction in Proposition 7. Thus, computing the quantity rather than promotes continued stability and convergence of the posterior mean .

The impact on the posterior covariance also deserves comment. In the regime when , takes complex values so the posterior covariance will no longer be meaningful. The underlying issue is that the fact that when the search directions are not perfectly conjugate, and so the simplification exploited in Proposition 6 induces overconfidence in the resulting posterior. Since the same issue arises in search directions obtained from CG, we note that the work of Hennig (2015), which also exploits conjugacy, is likely to suffer from the same deficiency. However a full treatment of floating point error in the context of BayesCG is rather technical and is deferred for future work. In this direction inspiration may be taken from Parker and Fox (2012), where a similar analysis was performed.

To circumvent these issues for the purposes of our experiments we introduce the batch-computed search directions obtained by a Gram-Schmidt orthogonalisation procedure444See (Giraud et al., 2005) for a numerically stable alternative.:

 ~sCm :=rm−1−m−1∑i=1⟨sCi,rm−1⟩AΣ0A⊤sCm−1 sCm :=~sCm/∥~sCm∥AΣ0A⊤.

The batch-computed search directions are mathematically identical to the BayesCG search directions . However, by explicitly orthogonalising with respect to all previous directions, this batch procedure ensures that numerical conjugacy is maintained.

#### Computational Cost

The cost of BayesCG is a constant factor higher than the cost of CG as three, rather than one, matrix-vector multiplications are required. Thus, the overall cost is when the search directions from Proposition 7 are used. When the batch-computed search directions are used an additional loop of complexity must be performed. Thus, the cost of the BayesCG algorithm with batch-computed search directions is . Note that each of these costs assumes that and are dense marices; in the case of sparse matrices the cost of the matrix-vector multiplications is driven by the number of nonzero entries of each matrix rather than the dimension .

#### Termination Criteria

An appealing use of the posterior distribution might be to derive a probabilistic termination criterion for BayesCG. Recall from Proposition 2 that approaches at a rate bounded by , and from Proposition 3 that . To decide in practice how many iterations of BayesCG should be performed we propose a termination criterion based upon the posterior distribution from Proposition 11:

 σ2m:=tr(ΣmΣ−10)×νm=(d−m)νm

Thus, termination when , for some tolerance that is user-specified, might be a useful criterion. However, Proposition 2 is extremely conservative, and since Proposition 10 establishes a much faster rate of convergence for in the case of BayesCG search directions, this is likely to be an overcautious stopping criterion in the case of BayesCG. Furthermore, since this involves a data-driven estimate of scale, the term is not uniformly decreasing with . As a result, in practise we advocate using a more traditional termination criterion based upon monitoring the residual; see Golub and Van Loan (2013, Section 11.3.8) for more detail. Further research is needed to establish whether the posterior distribution can provide a useful termination criterion.

Full pseudocode for the BayesCG method, including the termination criterion, is presented in Algorithm 1. Two algebraic simplifications have been exploited here relative to the presentation in the main text; these are described in detail in Section LABEL:supplement:algorithm of the supplement. A Python implementation can be found at github.com/jcockayne/bcg.

## 6 Numerical Results

In this section two numerical studies are presented. First we present a simulation study in which theoretical results are verified. Second we present an application to electrical impedance tomography, a challenging medical imaging technique in which linear systems must be repeatedly solved.

### 6.1 Simulation Study

The first experiment in this section is a simulation study, the goals of which are to empirically examine the convergence properties of BayesCG and to compare the output of the algorithm against the probabilistic approach of Hennig (2015).

For our simulation study, a matrix

was generated by randomly drawing its eigenvalues

from an exponential distribution with parameter

. A sparse, symmetric-positive definite matrix with these eigenvalues was then drawn using the MATLAB function sprandsym. The proportion of non-zero entries was taken to be . Subsequently, a vector was drawn from a reference distribution on , and was computed as . Throughout, the reference distribution for was taken to be . For this experiment and . In all cases the prior mean was taken to be . The prior covariance was alternately taken to be , and where was a preconditioner found by computing an incomplete Cholesky decomposition with zero fill-in. This decomposition is simply a Cholesky decomposition in which the (approximate) factor has the same sparsity structure as . The preconditioner is then given by . The matrix can be computed at a computational cost of where is the number of nonzero entries of . Furthermore, is cheap to apply because its Cholesky factor is explicit. In addition, the Krylov subspace prior introduced in Section 4.1 has been examined. While it has been noted that the choice is generally impractical, for this illustrative example has been computed directly. Additional experimental results which apply the methodology discussed in this section to higher-dimensional problems is presented in LABEL:supplement:pde.

#### Point Estimation Figure 1: Convergence in mean of BayesCG (BCG). For several independent test problems, x∗∼μref, the error ∥xm−x∗∥2 was computed. The standard CG method (top left) was compared to variants of BayesCG (right), corresponding to different prior covariances Σ0. The search directions used for BayesCG were either computed sequentially (top right) or in batch (bottom right). For comparison, the a priori optimal search directions for BayesCG are shown in the bottom left panel.

In Figure 2 the convergence of the posterior mean from BayesCG is contrasted with that of the output of CG, for many test problems with a fixed sparse matrix . As expected from the result of Proposition 9, the convergence of the BayesCG mean vector when is slower than in CG. In this case, the speed of convergence for BayesCG is gated by which is larger than the corresponding for CG. The a priori optimal search directions also appear to yield a slower rate than the BayesCG search directions, owing to the fact that they do not exploit knowledge of . Similarly as expected, the posterior mean when is identical to the estimate for obtained from CG. The fastest rate of convergence was achieved when , which provides a strong motivation for using a preconditioner prior if such a preconditioner can be computed, though note that a preconditioned CG method would converge at a yet faster rate gated by .

In the lower row of Figure 2 the convergence is shown when using batch-computed directions. Here convergence appears to be faster than when using the sequentially-computed directions, at correspondingly higher computational cost. The batch-computed directions provide an exact solution after iterations, in contrast to the sequentially-computed directions, for which numerical conjugacy may not hold.

Convergence for the Krylov subspace prior introduced in Section 4.1 is plotted in the right-hand column. The size of the computed subspace was set to , with and , as these quantites are easily computable in this simplified setting. The remaining parameter was set to , to ensure that low prior weight was given to the remaining subspaces. With the sequentially computed directions significant numerical instability is observed starting at . This does not occur with the batch computed directions, where a jump in the convergence rate is seen at this iteration.

#### Posterior Covariance

In this section the full posterior output from BayesCG is evaluated. In Figure 2, the convergence rate of is plotted for the same set of problems just described to numerically verify the result presented in Proposition 3. Note that while Figure 2 has its -axis on a log-scale, Figure 2 uses a linear scale. It is clear that when the more informative CG or BayesCG search directions are used, the rate of contraction in the posterior mean does not transfer to the posterior covariance. In the remaining columns of the figure, appears to contract at a roughly linear rate, in contrast to the exponential rate observed for . This indicates that tightening the bound provided in Proposition 3 is unlikely to be possible. Furthermore, in the last two columns of Figure 2, the impact of numerical non-conjugacy is apparent as the posterior covariance takes on negative values at around .

#### Uncertainty Quantification

We now turn to an assessment of the quality of the uncertainty quantification (UQ) being provided. The same experimental setup was used as in the previous sections, however rather than running each variant of BayesCG to , instead we ran these until to ensure that uncertainty quantification (UQ) is needed. To avoid the issue of negative covariances seen in Figure 2, the batch-computed search directions were used throughout.

First, the Gaussian version of BayesCG from Proposition 6 was evaluated. To proceed we used the following argument: When the UQ is well-calibrated, we could consider as plausibly being drawn from the posterior distribution . Note that is of rank , but assessing uncertainty in its null space is not of interest as in this space has been determined exactly. Since is positive semidefinite, it has the singular-value decomposition

 Σm=U[D0d−m,m0m,d−m0m,m]U⊤

where denotes an matrix of zeroes, is diagonal and