Unreasonable effectiveness of Monte Carlo

01/18/2019 ∙ by Art B. Owen, et al. ∙ Stanford University 0

This is a comment on the article "Probabilistic Integration: A Role in Statistical Computation?" by F.-X. Briol, C. J. Oates, M. Girolami, M. A. Osborne and D. Sejdinovic to appear in Statistical Science. There is a role for statistical computation in numerical integration. However, the competition from incumbent methods looks to be stiffer for this problem than for some of the newer problems being handled by probabilistic numerics. One of the challenges is the unreasonable effectiveness of the central limit theorem. Another is the unreasonable effectiveness of pseudorandom number generators. A third is the common O(n^3) cost of methods based on Gaussian processes. Despite these advantages, the classical methods are weak in places where probabilistic methods could bring an improvement.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

I think that the answer to the question in the authors’ title is ‘yes’, despite some challenges that I will describe. The title of an earlier version at arXiv asked about a ‘role for statisticians in numerical analysis’. There the answer is a resounding ‘yes’. That role for statisticians includes developing Bayesian and frequentist methods, applying them to problems such as integration and approximation, and then using them to get both point estimates and uncertainty quantifications (UQ), such as interval estimates. Statistical ideas for numerical methods have a long history and there are exciting new developments too. Two examples from

cock:oate:sull:giro:2017:tr are: using Bayesian methods to study multiple solutions to Painlevé PDEs, and using those methods to study an entire computational pipeline taking account of the fact that some steps are cheap to change, some expensive and others completely frozen. Those problems are fascinating and important, underserved by frequentist methods, and I expect to see good progress on them from Bayesian methods in the coming years.

The paper focuses on the use of Bayesian methods to estimate integrals and especially to quantify the uncertainty in those estimates of integrals. This looks like tougher going because the incumbent methods have some ‘unreasonable effectiveness’ properties that will be hard to match. After describing those strengths, I will conclude by describing areas where the classical methods are weak providing an opportunity for probabilistic numerics (PN).

First, a few QMC-related remarks: The finite order weights in Section 5.4.2 build in an assumption that the integrand has no interactions whatsoever of order 3 and up (not just that they are small). This is considered quite risky (sloa:2007). Effective dimension is not usually defined as a sum of . That sum might not be smaller than . For a brief history of effective dimension in QMC, going back to the 1950s, see effdimpreso. The error in higher order digital nets can be reduced by a factor of about by scrambling the digits. See dick:2011 for conditions.

The authors have not seen BMCMC used. Something like that is in the forthcoming paper of lavi:hodg:2018. They use unequal weights designed for autocorrelations of the form between observations and . As a result they estimate population means by an unequally weighted sum of sample values. Their weights correspond to BMCMC if there is a first order autoregressive posterior distribution.

2 Inferential basis

The numerical approaches to integration that we compare begin by writing the integrand as an expectation of a quantity where

has a probability density

. The integral estimates then take the form

where are representative of in a sense that depends on the method being used. Weights , , and generate estimates , , and

, for Monte Carlo, quasi-Monte Carlo, Markov chain Monte Carlo and probabilistic numerics, respectively. For QMC we ordinarily use methods such as those in

devr:1986

to make desired non-uniform random variables from uniform ones. That is we arrange for

, along with any necessary compensating changes to .

For MC, the law of large numbers (LLN) treating the

as genuinely random, gives with probability one. For MCMC we also use an LLN but need additional assumptions about how the approach their target distribution and how they mix. In QMC, the are ordinarily determinstic points in . The counterpart to the LLN is that if is Riemann integrable and the star discrepancy (nied:1992) between and vanishes then (nied:1978). For PN, the present paper proves convergence with probability under a Gaussian process (GP) model for .

For QMC, the are usually . In some MC methods, is a function of . For PN and some other MC methods each can depend on all of . MCMC usually uses equal weights, often skipping the first few observations and/or thinning to every ’th observation. In any of these cases we get , a function of both and . We then make an error of size and we would like some idea of how large that is.

What does it mean to know ? diac:1988 begins with a closed form expression for a function, and then asks “What does it mean to ‘know’ a function?”. He then discusses Bayesian numerical analysis, cites some historical references and shows how Bayes can recover some well known methods as special cases. His question applies with equal force to the error . In what sense is it known (or unknown) when there is a precise mathematical expression for it?

For MC and MCMC one usually models as random to get a distribution on . For PN, one models as random for fixed . It seems compelling from a Bayesian point of view to condition on the observed value of , thereby treating them as known and not random. The same argument can be made for . We might view as a set of bytes describing a computation or more usefully as some (usually) smooth function describing a quantity of scientific interest. When computing however, one such has been chosen and even if it had been chosen at random, we could reasonably condition on it.

If we condition on both and then

is not random and it is hard to motivate other values it could have taken in order to fill up a confidence interval. One approach is to treat the base measure

as the unknown and develop estimation and UQ methods based on reweighting the sample values. See tan:2004 for an explanation. The resulting methods are similar to frequentist methods that take as fixed and as random. The next section compares the interval estimates from MC, QMC and PN.

3 Interval estimates

I consider the interval estimates from Monte Carlo, based on the central limit theorem, to be ‘unreasonably effective’, despite some caveats in Section 6. First, they are computable. Second, they are even more accurate than the estimate is, so we actually know more about our error than we do about the thing we seek to estimate.

In plain MC with , the error estimation is typically made based on the central limit theorem. We can get statements like

(3.1)

where is computed from . The error term is by the central limit theorem but Edgeworth expansions in hall:1988 yield the given error term assuming that

has sufficiently many moments and is not supported on points of an arithmetic sequence. Equation (

3.1) shows that the error rate in the probability statement is much better than the error rate in the estimate itelf. If we need more accuracy, perhaps because is small, the bootstrap- can get two-sided interval estimates with error (hall:1988) and that calibration is quite good even in tiny samples (elsmallsamp). Other bootstrap methods (efro:tibs:1994) can get one-sided interval estimates with error .

For QMC, the most studied counterpart to (3.1) is the Koksma-Hlawka inequality (see dick:pill:2010) that gives

(3.2)

where is the total variation of over in the sense of Hardy and Krause. At first sight (3.2) looks like much better UQ than (3.1) provides for MC. There is no probability involved. Instead we get an absolute upper bound on error and it holds for any integrand with . Unfortunately the bound holding for all means it can be extremely conservative for some . Furthermore is extremely hard to compute and is much harder to get than . The upper bound in (3.2) is then a product of two unknowns. The comparison of (3.2) to (3.1) calls to mind a point made by Ronald Fisher by way of George Barnard: In statistical inference, as distinct from mathematical inference, there is a world of difference between the two statements ‘p is true’ and ‘p is known to be true’.

We can quantify uncertainty with (3.1) but not with (3.2). Equation (3.2) remains valuable as it shows that the MC rate can be improved via constructions achieving for any .

A counterpart to (3.1) from Bayesian numerical analysis is

(3.3)

where and

are the posterior mean and variance of

over randomness in given . This also looks better than (3.1) because it has no error term at all. But we have reason to question whether the probabilities in it are well calibrated. The probability statement is ordinarily based on a GP model. It is not an objective Bayes statement because is not really sampled from the GP. It is not quite a subjective statement either. The choice of GP usually takes into account qualitative properties of the GP such as mean squared differentiability that are satisfied by many different GPs that we could have chosen. From that set the selected GP is based largely on familiarity and computational feasibility, not just scientific opinion. Equation (3.3) is not anybody’s belief.

QMC accuracy can be combined with MC-based error quantification in randomized QMC (RQMC) algorithms. One replicates an point QMC rule times. RQMC is surveyed by lecu:lemi:2002.

Monte Carlo is unreasonably effective for error estimation but in practice we use pseudo-random numbers. That raises calibration issues due to flaws in the pseudo-random number generators (PRNGs), which we turn to next.

4 Testing and calibration

The numbers coming from a PRNG are meant to simulate a stream of IID random variables but they are not actually random. That seems to place MC methods on the same footing as Bayesian numerical analysis that treats a non-random as random.

Random number generators have been the subject of thorough testing for several decades. New results still appear but the big crush in testU01 from lecu:sima:2007 seems to have set the standard. Some early PRNGs such as RANDU (lewi:good:mill:1969) had serious flaws but things are much better now. A flaw uncovered by ferr:land:wong:1992 was prominent enough to make the news. The largest error in their tables is where the known value of was about . Pierre L’Ecuyer assures me (personal communication) that modern generators are better than the ones used in that paper. gelm:shir:2011 consider an average of independent draws from a posterior distribution, if we could get them, to be sufficient in statistical applications because the numerical error comes along on top of a sizeable irreducible statistical error. vats:fleg:jone:2015 think larger samples are needed. However, the point remains that errors from PRNGs not being really independent uniform are not a serious problem for those or most other MC applications.

By comparison, calibration for UQ modeling as random is much less developed. There is no ‘big crush’ of problems on which to calibrate Bayesian confidence interval methods (yet). The calibration figures in this paper plot coverage probability versus credibility level. It is encouraging that they show qualitative agreement that grows better with increasing sample size. In applications we would like credible levels in the half open interval and perhaps at as well. The credible levels displayed are , , , , and . Calibration at % should be automatically correct so the most interesting results are at % which is not high enough for cautious users.

The function is an infinite dimensional quantity, and data may not ‘swamp the prior’ in those settings (diac:free:1986). There are some signs that calibration will prove hard for GP models in xu:stei:2017. They consider functions on . If is sampled at for and one uses a squared exponential covariance model, then they conjecture that the maximum likelihood estimate of the scale parameter is asymptotically proportional to . This holds theoretically for and it seems to hold empirically for in their data. A similar thing happened for the easy case, but not the hard case, in the authors’ Figure 9. We might have hoped for the GP parameters to converge to some value, as it would if they were being consistently estimated. Perhaps UQ calibrations can still converge properly in problems where a variance parameter converges to or diverges to , but relying on that is worrisome. Of course was not drawn from the GP and getting that function has probability . On the other hand, any function that we might work with has probability under the GP and we would want calibration for it.

It is astonishing that PRNGs work as well as they do. In practice floating point arithmetic not being the same as real arithmetic causes more trouble. We all owe a great debt to the people who did the algebra and implementations behind modern PRNGs.

5 Cubedness

The Bayesian approach to estimation and uncertainty quantification (UQ) typically includes a cost component proportional to . That is a severe problem for integration methods. If an integration method with error rate and cost is to be replaced by a method with cost the new method needs error rate to be competitive (asymptotically). If the method is far from competitive at estimating then its accuracy for UQ becomes much less well motivated. Users will ordinarily, though not universally, choose the method with greater accuracy over one with better UQ.

To illustrate, suppose that plain MC can be run with some number observations in the same time that PN with a GP can be run. Then . Let’s use . In Figure this value of would lead us to compare the QMC result with to the BC result with and we can substitute the one with . If that is the right , then plain QMC is much more accurate than BC. Drawing Figure 6 with computational cost on the horizontal axis could leave it essentially unchanged or shift the QMC points to be above or something in between.

For MCMC, suppose one uses observations and an computing budget. A competitor can run that MCMC for observations, discard the first of them to get samples closer to the target distribution than the probabilistic numerics method would have. Then the competitor can repeat that process independently some times to greatly reduce the estimation variance by a factor like . Those replicates can also be used in UQ.

There may be ways to mitigate the cubedness problem at least for integration of smooth functions over . jaga:hick:2018 reduce the cost to . They do that by choosing to be certain shifted lattice points and then using also a special covariance kernel that together with those input points allows fast transform methods to be used.

6 Conclusions

henn:osbo:giro:2015 delivered a call to arms for probabilistic numerical methods, as an alternative to classical methods. The classical methods for integration are quite strong, making it a difficult setting to score early improvements. Those methods do however have weaknesses for integration, and probabilistic methods could make a difference. Some problems in engineering and climate modeling have so expensive that the cost of algebra is much less than the cost of getting even one function evaluation. That removes most or all of the computational advantage of classical methods. Sometimes

has an extremely skewed distribution as for rare events, weakening the CLT, and we cannot always find a good importance sampler to compensate. It can even happen that

which makes the frequentist uncertainty quantification problem extremely hard. peng:2004 has a good solution but even the best way to handle heavy-tailed problems is not as good as having a light-tailed problem. When the CLT is not available, then much of the benefit of good random number generators disappears with it. With those three big advantages of the classical method gone we might have to turn to the scienctific understanding behind the construction of to get a better answer. That puts the problem on grounds where Bayes has a big advantage over classical alternatives. These harder problems might not all be suitable for the plain Gaussian process models that are central to probabilistic numerics at present. That’s a good thing because we need alternatives to those models and new uses will appear for them once the alternatives develop.

I’ll end with another reason for optimism about the probabilistic method. sack:welc:mitc:wynn:1989

find GP models to be more accurate than response surface regressions. In my experience, GP interpolation has seemed unreasonably effective for approximation of functions such as those in the test bed of

surj:bing:2014. (I looked for survey articles to cite for this and saw that not everybody had that same experience.) When the GP approximation is working that well and provides an easily integrable Bayesian approximation to , we can write and integrate the two terms, using MC or RQMC for the second term to get a better calibrated UQ. This decomposition is a classical technique. ritt:2000 gives it as Proposition II.4 and cites several earlier references.

Acknowledgements

I thank Michael Stein, Pierre L’Ecuyer, James Johndrow and Michael Lavine for answering some queries. I thank the authors for an interesting paper and the editorial staff of Statistical Science for the opportunity to comment.

References