Comments on the article "A Bayesian conjugate gradient method"

06/24/2019
by   T. J. Sullivan, et al.
Freie Universität Berlin
0

The recent article "A Bayesian conjugate gradient method" by Cockayne, Oates, Ipsen, and Girolami proposes an approximately Bayesian iterative procedure for the solution of a system of linear equations, based on the conjugate gradient method, that gives a sequence of Gaussian/normal estimates for the exact solution. The purpose of the probabilistic enrichment is that the covariance structure is intended to provide a posterior measure of uncertainty or confidence in the solution mean. This note gives some comments on the article, poses some questions, and suggests directions for further research.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

05/01/2020

Rejoinder for the discussion of the paper "A novel algorithmic approach to Bayesian Logic Regression"

In this rejoinder we summarize the comments, questions and remarks on th...
01/16/2018

A Bayesian Conjugate Gradient Method

A fundamental task in numerical computation is the solution of large lin...
06/24/2019

A Role for Symmetry in the Bayesian Solution of Differential Equations

The interpretation of numerical methods, such as finite difference metho...
05/11/2016

Active Uncertainty Calibration in Bayesian ODE Solvers

There is resurging interest, in statistics and machine learning, in solv...
05/09/2018

Automatic Article Commenting: the Task and Dataset

Comments of online articles provide extended views and improve user enga...
09/10/2021

Low-rank statistical finite elements for scalable model-data synthesis

Statistical learning additions to physically derived mathematical models...

1 Overview

In “A Bayesian conjugate gradient method”, Cockayne, Oates, Ipsen, and Girolami add to the recent body of work that provides probabilistic/inferential perspectives on deterministic numerical tasks and algorithms. In the present work, the authors consider a conjugate gradient (CG) method for the solution of a finite-dimensional linear system for , given an assuredly invertible symmetric matrix and , with .

The authors derive and test an algorithm, BayesCG, that returns a sequence of normal distributions

for , starting from a prior distribution . This sequence of normal distributions is defined using a recursive relationship similar to that defining the classical CG method, and indeed the BayesCG mean coincides with the output of CG upon choosing — this choice is closely related to what the authors call the “natural prior covariance”, . The distribution is intended to be an expression of posterior belief about the true solution to the linear system under the prior belief given the first BayesCG search directions . Like CG, BayesCG terminates in at most steps, at which point its mean is the exact solution and it expresses complete confidence in this belief by having . The convergence and frequentist coverage properties of the algorithm are investigated in a series of numerical experiments.

The field of probabilistic perspectives on numerical tasks has been enjoying a resurgence of interest in the last few years; see e.g. Oates and Sullivan (2019) for a recent survey of both historical and newer work. The authors’ contribution is a welcome addition to the canon, showing as it does how classical methods (in this case CG; cf. the treatment of Runge–Kutta methods for ODEs by Schober et al. (2014)) can be seen as point estimators of particular instances of inferential procedures. it is particularly encouraging to see contributions coming from authors with both statistical and numerical-analytical expertise, and the possibilities for generalisation and further work are most interesting.

2 Questions and directions for generalisation

The article raises a number of natural questions and directions for generalisation and further investigation, which Cockayne et al. might use their rejoinder to address.

Symmetry and generalisations.

It is interesting that, for the most part, BayesCG does not require that be symmetric, and works instead with . This prompts a question for the authors to consider in their rejoinder: How does BayesCG behave in the case that is square but not invertible? Could one even show that the BayesCG posterior concentrates, within at most iterations, to a distribution centred at the minimum-norm solution (in some norm on ) of the linear system? One motivation for this question is that, if such a result could be established, then BayesCG could likely be applied to rectangular and produce a sequence of normally-distributed approximate solutions to the minimum-norm least-squares problem, i.e. a probabilistically-motivated theory of Moore–Penrose pseudo-inverses. (Recall that the Moore–Penrose pseudo-inverse can be characterised as the solution operator for the minimum-norm least squares problem, i.e.,

for the usual norms on and .)

The authors could also comment on whether they expect BayesCG to generalise easily to infinite-dimensional Hilbert spaces, analogously to CG (Fortuna, 1977, 1979; Málek and Strakoš, 2015), since experience has shown that analysis of infinite-dimensional statistical algorithms can yield powerful dimension-independent algorithms for the finite-dimensional setting (Cotter et al., 2013; Chen et al., 2018). A more esoteric direction for generalisation would be to consider fields other than . The generalisation of BayesCG to would appear to be straightforward via a complex normal prior, but how about fields of finite characteristic?

Conditioning relations.

Could the authors use their rejoinder to provide some additional clarity about the relationship of BayesCG to exact conditioning of the prior normal distribution , and in particular whether BayesCG is exactly replicating the ideal conditioning step, approximating it, or doing something else entirely?

To make this question more precise, fix a weighted inner product on , e.g. the Euclidean, -weighted, -weighted, or -weighted inner product. With respect to this inner product, let be the (self-adjoint) orthogonal projection onto the Krylov space and the (self-adjoint) orthogonal projection onto its orthogonal complement. The product measure

on is also a normal distribution on that expresses complete confidence about the true solution to the linear system in the directions of the Krylov subspace and reverts to the prior in the complementary directions; is the prior, and is a Dirac on the truth. The obvious question is, for some choice of weighted inner product, does BayesCG basically output , but in a clever way? Or is the output of BayesCG something else?

Uncertainty quantification: cost, quality, and terms of reference.

Cockayne et al. point out that the computational cost of BayesCG is a small multiple (a factor of three) of the cost of CG, and this would indeed be a moderate price to pay for high-quality uncertainty quantification. However, based on the results presented in the paper, the UQ appears to be quite poorly calibrated. One could certainly try to overcome this shortcoming by improving the UQ. However, an alternative approach would be to choose a prior covariance structure that aggressively sparsifies and hence accelerates the linear-algebraic operations that BayesCG performs (particularly lines 7, 9, and 11 of Algorithm 1), while preserving the relatively poor UQ. Does this appear practical, in their view?

In fact, the whole question of “the UQ being well calibrated” is somewhat ill defined. Some might argue that, since is deterministic, there is no scope for uncertainty in this setting. However, It certainly makes sense to ask — as the authors do in Section 6.1 – whether, when the problem setup is randomised, the frequentist coverage of BayesCG lines up with that implied by the randomisation of the problem. The statistic that Cockayne et al. introduce is intersting, and already captures several scenarios in which the UQ is well calibrated and several in which it is not; I strongly encourage Cockayne et al. to address in their rejoinder the Bayesian accuracy of BayesCG, e.g., to exhaustively sample the true posterior on given and see whether this empirical distribution is well approximated by the BayesCG distribution .

As a related minor question, can BayesCG be seen as an (approximate Gaussian) filtering scheme for the given prior and the filtration associated to the data stream ?

Precision formulation.

As formulated, BayesCG expresses a Gaussian belief about the solution of the linear system in terms of mean and covariance; Gaussian measures can also be expressed in terms of their mean and precision. Do the authors have a sense of whether the BayesCG algorithm be formulated as a sequence of precision updates, or does the singularity of the covariance in the Krylov directions essentially forbid this?

One motivation for seeking a precision formulation would be to render the “natural prior” of Section 4.1 more tractable, since this prior has an easily-accessible precision while accessing its covariance involves solving the original linear problem.

As the authors note, their “natural prior” is closely related to one introduced by Owhadi (2015) and applied in Cockayne et al. (2017)

. It seems that working with images of Gaussian white noise, such as this “natural prior”, is presently producing considerable analytical and computational strides forward

(Owhadi, 2017; Chen et al., 2018), and so this seems to be a topic worth further attention in the statistical community as a whole.

3 Minor comments

Rank and trace estimates.

Proposition 3, which states that , seems to miss the point slightly. It would be good to have a companion results to the effect that , and a more quantitative result for the trace such as for some constant depending only on .

Posterior and Krylov spaces.

It seems natural to ask whether the posterior nullspace coincides with the Krylov space . Put another way, is the posterior column space the same as the orthogonal complement of the Krylov space, in the Euclidean or -weighted inner product?

Square roots.

Is the square root introduced just before Proposition 5 required to be unique? Does it even matter whether a (unique) symmetric positive semidefinite square root or a Cholesky factor is chosen? This is related to the above discussion of symmetry and generalisations.

Interpretation of termination criteria.

In Section 5, the authors refer to “probabilistic termination criteria”. As a matter of semantics, the termination criterion that they propose is in fact deterministic, albeit based on a probabilistic interpretation of the algorithmic quantities.

Acknowledgements

TJS is supported by the Freie Universität Berlin within the Excellence Strategy of the German Research Foundation (DFG), including the ECMath/MATH+ transition project CH-15 and project TrU-2 of the Berlin Mathematics Research Center MATH+ (EXC-2046/1, project ID 390685689).

References