Nested sampling with plateaus

10/26/2020
by   Andrew Fowlie, et al.
0

It was recently emphasised by Riley (2019); Schittenhelm Wacker (2020) that that in the presence of plateaus in the likelihood function nested sampling (NS) produces faulty estimates of the evidence and posterior densities. After informally explaining the cause of the problem, we present a modified version of NS that handles plateaus and can be applied retrospectively to NS runs from popular NS software using anesthetic. In the modified NS, live points in a plateau are evicted one by one without replacement, with ordinary NS compression of the prior volume after each eviction but taking into account the dynamic number of live points. The live points are replenished once all points in the plateau are removed. We demonstrate it on a number of examples. Since the modification is simple, we propose that it becomes the canonical version of Skilling's NS algorithm.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

06/05/2020

Nested sampling cross-checks using order statistics

Nested sampling (NS) is an invaluable tool in data analysis in modern as...
01/24/2021

Nested Sampling Methods

Nested sampling (NS) computes parameter posterior distributions and make...
01/31/2020

Mean shift cluster recognition method implementation in the nested sampling algorithm

Nested sampling is an efficient algorithm for the calculation of the Bay...
02/10/2020

Geometric nested sampling: sampling from distributions defined on non-trivial geometries

Metropolis Hastings nested sampling evolves a Markov chain, accepting ne...
03/16/2018

Improving the efficiency and robustness of nested sampling using posterior repartitioning

In real-world Bayesian inference applications, prior assumptions regardi...
05/22/2019

Nested sampling on non-trivial geometries

Metropolis nested sampling evolves a Markov chain from a current livepoi...
04/16/2018

Nestcheck: diagnostic tests for nested sampling calculations

Nested sampling is an increasingly popular technique for Bayesian comput...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Nested sampling (NS) (Skilling, 2004b, 2006)

is a popular algorithm for Bayesian inference in cosmology, astrophysics and particle physics. The algorithm handles multimodal and degenerate problems, and returns weighted samples for parameter inference as well as an estimate of the Bayesian evidence for model comparison. It was recently emphasised independently by

Riley (2019, appendix B.5.4) and Schittenhelm & Wacker (2020), however, that assumptions in NS are violated by plateaus in the likelihood, that is regions of parameter space that share the same likelihood. We should not be surprised by subtleties caused by ties in the likelihood as NS is based on order statistics and this problem and possible solutions were in fact discussed by Skilling (2004b, 2006). We believe, however, that it was underappreciated prior to Riley (2019); Schittenhelm & Wacker (2020).

Plateaus most often occur in discrete parameter spaces in which every possible likelihood value is associated with a finite prior mass, or in physics problems when regions of a model’s parameter space make predictions that are in such severe disagreement with observations that they are assigned a likelihood of zero. For example, in particle physics, parameter points that fail to predict electroweak symmetry breaking would be vetoed and in cosmology portions of parameter space may be excluded if they result in unphysically large power spectra for the purposes of applying lensing, or if it is impossible to trace a consistent cosmic history (see section XIII.C of Hergt et al. 2020 for more detail). We generically refer to such points as unphysical and the observation that renders them unphysical as . In fact, within popular implementations of NS, such as MultiNest (Feroz & Hobson, 2008; Feroz et al., 2009, 2013) and PolyChord (Handley et al., 2015a, b), unphysical points can be assigned a likelihood of zero or a prior of zero through the logZero

setting. Any log likelihoods below it are treated as if the prior were in fact zero. Statistically, this makes a difference to the evidences and Bayes factors, as it changes whether we consider

to be prior knowledge or information with which we are updating, i.e. whether we wish to compute or , where represents experimental data and represents a model. The latter is problematic within NS. We consider both cases interesting: on the one hand, taking a basic observation, e.g., electroweak symmetry breaking, as prior knowledge is reasonable, but, on the other, so is judging models by their ability to predict a basic observation. Although the problem with plateaus can in general lead to faulty posterior distributions as well, when plateaus occur only at zero likelihood, they do not impact posterior inferences about the parameters of the model. There are, furthermore, realistic situations in which plateaus could occur at non-zero likelihood, e.g., if in some regions of parameter space, the likelihood function or the physical observables on which it depends are approximated by a constant.

In Riley (2019); Schittenhelm & Wacker (2020), the problem caused by plateaus was formally demonstrated. After reviewing the relevant aspects of NS in section 2, in section 3, we instead make an informal explanation of the problem. We continue in section 4 by proposing a modified NS algorithm that deals with plateaus and reduces to ordinary NS in their absence. We show examples in section 5 before concluding in section 6. An implementation of the modified NS algorithm that can be used to correct evidences and posteriors found from MultiNest and PolyChord runs is implemented in anesthetic starting from version 2.0.0-beta.2 (Handley, 2019).

2 Nested sampling

To establish our notation and the assumptions in ordinary NS, let us briefly review the NS algorithm. NS works by computing evidence integrals,

(1)

where is the so-called likelihood function for the relevant experimental data and is the prior density for the model’s parameters, as Riemann sums of a one-dimensional integral,

(2)

where

(3)

is the prior volume contained within the iso-likelihood contour at and is the inverse of , i.e. . This evidently requires that such an inverse exists over the range of integration.

To tackle the one-dimensional integral, we first sample points from the prior — the live points. Then, at each iteration of NS, we remove the live point with the worst likelihood and replace it with one drawn form the prior subject to the constraint that . Thus we remove a sequence of samples of increasing likelihood . In NS we estimate by the properties of how that sequence was generated. Indeed, at each iteration the volume contracts by a factor , where the arithmetic and geometric expectations of are ⟨t ⟩ &= nlivenlive+ 1,
⟨lnt ⟩ &= -1nlive. We can then make a statistical estimate of the prior volume at the -th iteration, . This enables us to compute

(4)

The estimates in section 2

assume that the live points are uniformly distributed in

. In Fowlie et al. (2020) we proposed a technique for testing the veracity of this assumption within the context of a numerical implementation such as MultiNest or PolyChord. In the following section we discuss why plateaus violate that assumption.

3 Plateaus

In Skilling (2006), plateaus were recognised as a problem, since they offer no guidance about the existence or location of points of greater likelihood and since they cause ambiguity in the ranking of samples by likelihood. The latter problem was addressed by breaking ties by assigning a rankable second property to each live point, that is expected to be unique, such as a cryptographic hash of the parameter coordinates or just a random variate from some pre-specified distribution. This was implemented by extending the likelihood to include that tie-breaking second property, , suppressed by a tiny factor, , so that it doesn’t impact the evidence estimate,

(5)

It was not stated explicitly that plateaus violate an assumption in NS, as formally shown by Riley (2019); Schittenhelm & Wacker (2020), though it was known and it was mentioned in Murray (2007, section 4.4.6). Murray (2007), furthermore, noted that hashes of the parameter coordinates in discrete parameter spaces would not be unique and thus would fail to break ties.

In the presence of plateaus, the outermost contour could contain more than one point with the same likelihood. When we replace one of the points in the plateau by a point with a greater likelihood, the volume cannot contract at all, since the outermost contour still contains other points at the same likelihood. Once we’ve replaced all the points in the plateau, the volume finally contracts. Crucially, however, it was not possible for any of the replacement points to affect the contraction, as the replacement points could never be the worst point whilst points in the plateau remain in the live points. The latter subtlety changes statistical estimates of the volume. Without plateaus, it is possible that a replacement point is or soon becomes the worst point, slowing the volume contraction. Without that possibility, the volume contracts faster and thus ordinary NS underestimates volume contraction and overestimates the evidence when there are plateaus.

Figure 1: Infographic showing the impact of plateaus on assumptions in NS. We show an ordinary Gaussian (orange) and a modified Gaussian with plateaus in the tails (blue). In the lower right panel, we see that the distribution of is no longer uniform, breaking assumptions in NS.

The problem is illustrated by fig. 1. We show a Gaussian likelihood with mean

and standard deviation

(orange) and a modified Gaussian likelihood with a plateau in its tails at (blue). In each case we consider a flat prior for the parameter from to . The upper right panel shows the prior distribution of the likelihood. The plateau manifests as an atom in that distribution at

. The lower left panel shows the resulting enclosed prior volume as a function of the likelihood. The plateau causes a jump discontinuity at

. Finally, the lower right panel shows the distribution of . The plateau leads to an inaccessible region between about and

and an atom of probability mass at

, and so is not uniformly distributed.

Figure 2: The compression from ordinary and modified NS when of 100 live points lie in a plateau. The correct linear compression is shown for reference (solid blue).

Indeed, in ordinary NS, if outermost live points were in a plateau, we would compress by

(6)

We should, however, compress by about

(7)

which is an unbiased estimate of the volume outside the plateau based on binomial statistics (see

section 4.1 for further discussion). The compressions are only similar for , since

(8)

The breakdown in the NS compression in eq. 6 is shown in fig. 2. Note that this problem doesn’t impact importance nested sampling (Feroz et al., 2013), since it does not use the estimated volumes in eq. 6.

In fact, the arguments above show that in the presence of plateaus the inverse of , denoted in overloaded notation, does not exist for all (see the lower right panel in fig. 1). As shown in Schittenhelm & Wacker (2020), in this case we should instead consider the generalised inverse

(9)

with which the evidence may be written as

(10)

We now introduce a modified NS algorithm that correctly computes the evidence even in the presence of plateaus via eq. 10. We summarise the treatment of plateaus in existing popular NS software in table 1.

Code Handles plateaus Definition of constrained prior
nestle-0.2.0 (Barbary, 2016) No
dynesty-1.0.0 (Speagle, 2020) No
DIAMONDS-#76409b2 (Corsaro & De Ridder, 2014) No
MultiNest-3.1.2 (Feroz & Hobson, 2008; Feroz et al., 2009, 2013) No but compatible with anesthetic
PolyChord-1.18.2 (Handley et al., 2015a, b) No but compatible with anesthetic
DNest4-0.2.4 (Brewer & Foreman-Mackey, 2016) Yes by tie-breaking labels
Skilling’s implementation (Skilling, 2004a) Yes by user-specified tie-breaking labels
Table 1: Comparison of treatment of plateaus in popular NS software. The example program in Skilling’s implementation specifies tie-breaking labels.

4 Modified NS algorithm

In our modification to NS, we remove all live points at the contour one by one without replacement, contracting the volume after each removal. If there is a plateau, there may be more than one such point; if not, our algorithm reduces to ordinary NS. After removing all such points, we finally replenish the live points by adding points sampled from the prior subject to , as usual.

After removing a point, the number of live points drops by one, such that if we were to remove such points (i.e., if there were points in the plateau) we would compress by

(11)

if using an arithmetic estimate of the compression, and by ∑_i=1^q ⟨lnt_i ⟩ &= -∑_i=1^q 1nlive- (i - 1)
&≈ln(1 - qnlive)  for if using a geometric one. Thus we find in both cases that the compression from removing points would be about in agreement with eq. 7, the difference being of order . The difference is noteworthy when , in which case the unbiased estimate of the contraction would be zero, but the above estimates are about .

Sample points from the prior --- the live points;
Let be the set of live points;
Set ;
Initialise evidence, ;
Initialise ;
while  do
       Let be the minimum of the live points;
       Let be the live point with ;
       Increment iteration, ; Contract volume, ; Assign importance weight, ; Increment evidence, ; Remove the point from the live points; Add a new live point sampled from the prior subject to ;
end while
return Estimate of evidence,
Algorithm 1 Original NS.
Sample points from the prior --- the live points;
Let be the set of live points;
Set ;
Initialise evidence, ;
Initialise ;
while  do
       Let be the minimum of the live points;
       Let be the set of live points with ;
       foreach point in  do
             Increment iteration, ;
             Contract volume, ;
             Assign importance weight, ;
             Increment evidence, ;
             Remove the point from the live points;
            
       end foreach
      Add new live points sampled from the prior subject to ;
      
end while
return Estimate of evidence,
Algorithm 2 Modified NS.
Figure 3: Algorithm 1, original NS (left) and algorithm 2, modified NS (right). The changes are shown in red. If we were to add new live points in algorithm 2, it would be a dynamic nested sampling algorithm (Higson et al., 2019).

In algorithm 1 and algorithm 2 we show the original and our modified NS, respectively. For concreteness we show the geometric estimator of the compression. We highlight the parts of the algorithm that are changed in red. The simple difference is that whereas in the original NS we replace a single live point, we instead replace all the points sharing the minimum likelihood and contract the volume appropriately. If there are no plateaus, the algorithms are identical.

This modified NS algorithm automatically deals with plateaus whenever they occur and can be applied retrospectively to NS runs. It automatically implements the robust NS algorithm of Schittenhelm & Wacker (2020), but with the advantage that it avoids the requirement that we explicitly decompose the evidence integral into a sum of contributions from plateaus and non-trivial contributions to be computed by NS. Other advantages are that our modification fits elegantly into ordinary NS, since the modifications are small, and that it is in the spirit of dynamic NS (Higson et al., 2019), since taking into account the changing number of live points is crucial.

We could, alternatively, evict all live points in the outermost contour, compress by , and finally top up the live points. We don’t favour this implementation of our idea, as it breaks the equivalent treatment of plateaus and non-plateaus in algorithm 2 and our use of ordinary NS compression factors, but it remains a valid possibility. This would use the same estimates of the plateau size as Schittenhelm & Wacker (2020). Lastly, we could alternatively remove plateaus entirely from the likelihood function by e.g., using the extended likelihood function in eq. 5. If the tie-breaking terms, , are random variates, they must be independent and identically distributed (Riley, 2019, appendix B.5.4.3). This, however, cannot be applied retrospectively to NS runs, and, as discussed in Riley (2019), might lead to problems in specific NS implementations.

Indeed, when using a random variate to break ties in a plateau, a new live point could lie anywhere in the plateau as for any point in the plateau we are equally likely to draw a tie-breaking random variate that leads to acceptance. Sampling from the constrained prior thus requires sampling from the whole plateau and contours above the plateau. This becomes unavoidably inefficient for large plateaus, as almost all tie-breaking random variates would lead to rejection. NS implementations that attempt to increase the sampling efficiency thus easily lead to faulty estimates of the evidence.

For example, in ellipsoidal rejection sampling, as used in MultiNest, one samples from ellipsoids that are constructed to enclose the current live points and with a volume similar to the current estimated volume, , so that the ellipsoids typically shrink during an NS run. In the case of plateaus, this could easily lead to sampling from only a subregion of the whole plateau, as the ellipsoids could shrink as we compress through a plateau. In slice sampling, as used in PolyChord, we step out from a current live point to find the likelihood contour. To sample from the whole plateau, we should step out until at least the edge of the plateau. This, however, could fail, as the tie-breaking random variate when stepping out that far could by chance lead to rejection and thus we wouldn’t step out far enough. These problems could possibly be avoided by elevating the tie-breaking random variate to a model parameter with a prior. For ellipsoidal sampling in the presence of plateaus, this would allow the ellipsoids to contract only in that dimension and without encroaching on the plateau.

4.1 Distribution of compression factor

In our modified NS, we apply ordinary NS compression factors taking into account the dynamic number of live points. This assumes beta distributions for the compression factor. When dealing with plateaus,

Schittenhelm & Wacker (2020)

consider estimating the compression using the fact that the number of points inside the plateau should follow a binomial distribution parameterized by the size of the plateau. Let us consider more carefully the differences in estimates of the compression factor from these two approaches. In the latter, the number of live points,

, that fall in the outermost contour plateau follows a binomial

(12)

where is the size of the plateau and thus is the compression factor. The probability mass function is thus

(13)

for points in the plateau and points above it. We could make inferences about by computing a posterior distribution,

(14)

where is our prior for the size of the non-plateau region. For this is in fact the probability density for a distribution.

On the other hand, taking the approach in modified NS, the number of points that lie in the plateau

is no longer treated as a random variable. Instead, the factor

is assumed to follow a beta distribution when points are removed (Henderson & Goggans, 2014),

(15)

The density for is

(16)

corresponding to points below , points above it, and one point at .

The fact that we must consider points below and one point at , rather than points inside a plateau, results in a factor of difference between eqs. 16 and 14. A further factor of originates from any prior knowledge about the size of the plateau. If the factors may be neglected inferences based on ordinary NS compression may be reliable.

With a flat prior for the unknown size of the plateau, the difference is that between a and a distribution. As shown in in fig. 4

, the first two moments are similar, with only moderate differences of order

even when and .111The fact that the distributions as functions of are approximately identical is enough to ensure that our inferences are approximately identical (Berger & Wolpert, 1988). This holds despite the fact that in the binomial case the number of live points in the outermost plateau is a random variable and the size of the plateau is fixed, and in the beta case the compression factor is a random variable and is fixed. This needn’t be true for frequentist estimates of the factor . The factor vanishes entirely for a logarithmic prior for the unknown size of the plateau,

(17)

Thus we see that our modified NS treatment is approximately the same as the binomial treatment, the full details of which would depend on a prior for the size of the plateau. If the differences are important, though, we could instead use the binomial statistics, as discussed at the end of section 4, and a more careful treatment of the prior.

4.2 Error estimates

The classic NS error estimate

(18)

where is the Shannon entropy between the posterior and prior, assumes the ordinary NS compression and a constant number of live points, and so is not applicable to our modified NS. The anesthetic error estimates, however, already account for a dynamic number of live points, and so are applicable to the modified NS algorithm. The anesthetic error estimates are found (as initially suggested by Skilling, 2006) through simulations; sequences of possible compression factors are drawn from beta-distributions and used to make a set of estimates . One could alternatively compute an analytic estimate using the equivalents of the arithmetic expressions in section 2

for the variance, which can be found in Appendix B of

Handley et al. (2015b). Future versions of anesthetic will also support such analytic estimates.

Clearly, however, compression estimates for plateaus suffer from increased uncertainties, since we dynamically reduce the number of live points during the plateau period of an NS run. In fact, if the plateau at makes up a large fraction of the current prior volume, the error in the estimated compression could be substantial, as few points lie outside the plateau. Indeed, we show that the fractional error in the compression blows up when in fig. 4. In such cases, there may exist efficient schemes for dynamically increasing the number of live points to ensure that sufficient points lie above the plateau. For example, in algorithm 2 we must ultimately replenish the live points such that there are live points at . We could instead dynamically increase the number of live points immediately prior to the plateau by sampling them from , stopping at once of the points lie at . Once we remove points from the plateau, there would be live points remaining, and no need to top up the live points.

In the worst case scenario, no live points land in contours above the plateau, e.g., the likelihood function shows a tiny peak above a broad plateau. At this point, it would be unclear whether to proceed and if so so how to do so efficiently, since the live points in the plateau provide no clues about the presence or location of the peak. This problem would affect all NS variants known to us.

Figure 4: Estimates of compression for and points in the plateau from the binomial (blue) and beta (red). The estimates are shown relative to the ones from the binomial. The dashed lines show one standard deviation uncertainties.

5 Examples

We now consider a few examples. First, in section 5.1 we apply our modified NS to examples considered in Schittenhelm & Wacker (2020). Second, in section 5.2 we construct a ‘wedding-cake’ function that exhibits a series of plateaus and check that our modified NS correctly computes the evidence.

5.1 Examples from Schittenhelm & Wacker (2020)

For plateaus at the base of the likelihood function at , we overestimate the evidence by a factor

(19)

where is the fraction of the prior volume occupied by the plateau. For example, in Scenario 2 of Schittenhelm & Wacker (2020), the plateau occupies of the prior volume at the base of the likelihood, and we find , in good agreement with the difference found numerically in Schittenhelm & Wacker (2020), which was . In fig. 5, we check the results of this problem with modified NS in 1000 repeat runs with 500 live points. We find that the histogram of estimates form a Gaussian peak (red bars) around the analytic result (dashed blue). The spread was well-described by the average uncertainty estimates from single runs of modified NS (green). The original NS results (orange), on the other hand, lie well away from the analytic result, as expected.

Figure 5: Repeated modified NS runs on Scenario 2 of Schittenhelm & Wacker (2020). We show the results from 1000 runs (red histogram) and the average mean and uncertainty estimate from the single modified NS runs as a Gaussian (green). For reference, we show the original NS result (orange) and the analytic result (dashed blue).

For plateaus at the peak of the likelihood function at , the error depends on the stopping conditions and treatment of final live points in NS. In e.g., the MultiNest implementation of NS, the run converges if all live points share the same likelihood, and the remaining evidence is correctly accounted for. If the run halts but the remaining evidence isn’t accounted for, we underestimate the evidence by a term ΔZ&= f maxL
ΔlogZ&= log(Z+ ΔZ) - logZ where is the fraction of the prior volume occupied by the plateau. For example, in Scenario 3 of Schittenhelm & Wacker (2020), the plateau occupies of the prior volume at the peak of the likelihood at , and we find , in good agreement with the difference found numerically in Schittenhelm & Wacker (2020), which was .

5.2 Wedding cake likelihood

We now construct a semi-analytical example for which we can numerically confirm our approach. Consider an infinite sequence of concentric square plateaus of geometrically decreasing volume for . The edges of the plateaus lie at

(20)

for , where denotes the infinity norm. We define the height of each plateau to have a Gaussian profile:

(21)

where is an indicator function that, for any given , selects a single term in the sum. The resulting likelihood is therefore a set of hypercuboids with a hypercubic base of side length and height . If the base is two-dimensional, this creates a tiered “wedding cake” surface, as can be seen in fig. 6. The selected by the indicator function is in fact,

(22)

where the floor function (namely the greatest integer less than or equal to ), enabling us to write

(23)

Given that the volume of the region is , the evidence can be expressed as:

(24)

which as an infinite series converges sufficiently rapidly and stably to be evaluated numerically for any number of dimensions , but if speed is a requirement then a Laplace approximation shows that one only needs to consider the terms in the series around

(25)

For reference, putting all of these equations together, the likelihood can be computed as:

(26)

where

is a hyperparameter controlling the depth of the plateaus, and

controls the width of the overall Gaussian profile.

This likelihood forms part of the anesthetic test suite, which confirms that the approach suggested in this paper recovers the true likelihood.

The wedding cake likelihood can be very useful for testing nested sampling implementations as unlike a traditional Gaussian it can be trivially sampled from using a simple random number generator, has no unexpected edge effects as the boundaries of the prior are also a likelihood contour.

Figure 6: Top: Example of the wedding cake likelihood function in two dimensions for and . Bottom: The number of live points used in the calculation of the evidence. The target value is , but in plateaus as points are discarded one-by-one without replacement this causes to drop until the plateau is passed. The variability in the number of points discarded can be seen by the varying depth of each local minimum in the number of live points.

6 Conclusions

Following from Riley (2019); Schittenhelm & Wacker (2020), which showed formally why NS breaks down if there are plateaus in the likelihood, we first presented the problem of plateaus in an informal but accessible way. We then constructed a modified version of the NS algorithm. The simple modification permits it to remove points in a plateau one by one, without replacement. Once all points in a plateau are removed, the live points are finally replenished. This leads to correct compression in the presence of plateaus and ordinary NS in their absence.

We discussed examples from Schittenhelm & Wacker (2020), shedding light on them by finding the impact of plateaus on ordinary NS in simple analytic formulae. The impact was previously shown only numerically. Lastly, our especially constructed wedding-cake problem showed a case with multiple plateaus. The modified NS algorithm successfully dealt with them.

Runs from popular NS software such as PolyChord and MultiNest may be resummed retrospectively via the modified NS algorithm using anesthetic starting from version 2.0.0-beta.2. Our modified NS makes a minimal change to ordinary NS, to the extent that we recommend it becomes the canonical version of NS.

Acknowledgements

AF was supported by an NSFC Research Fund for International Young Scientists grant 11950410509. WH was supported by a George Southgate visiting fellowship grant from the University of Adelaide, Gonville & Caius College, and STFC IPS grant number ST/T001054/1. We thank Brendon Brewer for valuable comments and discussion on the manuscript.

Data availability

There is little data associated with this paper, though any data or code will be shared on reasonable request to the corresponding author.

References