SMC-NS
Unbiased and Consistent Nested Sampling via Sequential Monte Carlo
view repo
We introduce a new class of sequential Monte Carlo methods called Nested Sampling via Sequential Monte Carlo (NS-SMC), which reframes the Nested Sampling method of Skilling (2006) in terms of sequential Monte Carlo techniques. This new framework allows one to obtain provably consistent estimates of marginal likelihood and posterior inferences when Markov chain Monte Carlo (MCMC) is used to produce new samples. An additional benefit is that marginal likelihood estimates are also unbiased. In contrast to NS, the analysis of NS-SMC does not require the (unrealistic) assumption that the simulated samples be independent. As the original NS algorithm is a special case of NS-SMC, this provides insights as to why NS seems to produce accurate estimates despite a typical violation of its assumptions. For applications of NS-SMC, we give advice on tuning MCMC kernels in an automated manner via a preliminary pilot run, and present a new method for appropriately choosing the number of MCMC repeats at each iteration. Finally, a numerical study is conducted where the performance of NS-SMC and temperature-annealed SMC is compared on several challenging and realistic problems. MATLAB code for our experiments is made available at https://github.com/LeahPrice/SMC-NS.
READ FULL TEXT VIEW PDFUnbiased and Consistent Nested Sampling via Sequential Monte Carlo
Nested Sampling (NS) (Skilling, 2006) is based on the identity
(4) |
where is a function mapping from some space to , and . Note that
is simply the tail cdf (survival function) of the random variable
. We denote this survival function by . A simple inversion argument yields(5) |
where is the
–quantile function of the likelihood under
. This simple one-dimensional representation suggests that if one had access to the function , the integral could then be approximated by numerical methods. For example, for a discrete set of values, , one could compute the Riemann sum(6) |
as a (deterministic) approximation of . Unfortunately, the quantile function of interest is often intractable. NS provides an approximate way of performing quadrature such as (6). The core insight underlying NS is as follows. For independent samples from a density of the form
(7) |
we have that
(8) |
Put simply, consider that one has
independent samples distributed according to the prior subject to a given likelihood constraint, and then introduces a new constraint determined by choosing the minimum likelihood value of the samples. This will define a region that has less (unconstrained) prior probability by a factor that has a
distribution. As samples from this new distribution will be compressed into a smaller region of the original prior, (8) is often referred to as a compression factor.With this in mind, Skilling (2006) proposes the NS procedure that proceeds as follows. Initially, a population of independent samples (henceforth called particles) are drawn from . Then, for each iteration , the particle with the smallest value of is identified. This “worst performing” particle at iteration is denoted by and its likelihood value by . Finally, this particle is moved to a new position that is determined by drawing a sample according to . By construction, this procedure results in a population of samples from that is constrained to lie above higher values of at each iteration.
After iterations, we then have . Each corresponds to an unknown satisfying . Skilling proposes to (deterministically) approximate the values by assuming that at each iteration the compression factor (8) is equal to its geometric mean, i.e., . Thus, we have the approximation . This is the most popular implementation, and thus will be the version we consider for the remainder of this paper; however, it is worth noting that there exists another variant which randomly assigns at each iteration, where . With the pairs in hand, the numerical integration is then of the form
(9) |
In practice, the number of iterations is not set in advance, but rather the iterative sampling procedure is repeated until some termination criterion is satisfied. The standard approach is to continue until , where is some small value, say . This choice attempts to ensure that the remaining integral is sufficiently small so that error arising from omission of the final in the quadrature is negligible.
In addition to estimates of the model evidence , estimates of posterior expectations , as in (1), can be obtained by assigning to each the weight , and using
(10) |
as an estimator. A formal justification for this is given in Chopin and Robert (2010, Section 2.2), though in essence it is based on the fact that the numerator and denominator of (10) are (NS) estimators of their corresponding terms in the identity
(11) |
Pseudocode for NS is provided in Algorithm 1.
There are several potential issues with NS that we speculate are the reasons why it has not achieved mainstream adoption in the statistics community. We outline what we believe to be the main four objections to NS below (in decreasing order of severity), as well as a discussion on relevant works that have attempted to address them.
Assumption of Independent Samples. The property (8) requires at each iteration independent samples with the correct distribution. This is a strong condition, as generating samples from constrained densities of the form (7) is in general difficult. The sampling method originally proposed is to move the worst performing particle at each iteration to the position of one of the other particles, and then run an MCMC algorithm for sufficiently many iterations to create an (approximately) independent sample. This procedure itself does not ensure the assumption of independence is satisfied, as it only produces independent samples asymptotically in the number of MCMC iterations. Moreover, for problems with likelihoods that have multiple well-separated modes, the constrained density will have increasingly isolated islands of support as the algorithm progresses, making it difficult for most samplers to cross between modes in any reasonable amount of time. Thus, even approximate independence may be difficult to achieve (and verify) in practice.
Indeed, now over a decade after the introduction of the NS method, establishing consistency when MCMC transitions are used for sampling with NS remains a challenging open problem. Chopin and Robert (2010) remark that “a reason why such a theoretical result seems difficult to establish is that each iteration involves both a different MCMC kernel and a different invariant distribution”. In order to overcome the need for MCMC sampling, they propose a variant of NS for which the sampling can be performed exactly, and that demonstrate it can perform well in low dimensional problems for which is approximately Gaussian.
In a separate attempt to overcome dependency between samples, there is a class of approximate sampling methods called region sampling that attempts to generate independent samples by reparameterizing the problem so the constrained sampling problem is one of sampling uniformly within constrained regions of a unit hypercube. The most popular of these methods is MultiNest (Feroz and Hobson (2008)), which uses the population of particles to construct a region that is a union of hyperellipsoids, sampling from this region, and accepting samples which satisfy the constraint. There is no way however to ensure the proposal region is a superset of the actual region. Buchner (2016)
proposes a method that is more robust (but still not immune) to this problem; however, the results show it can be an order of magnitude more inefficient and is more susceptible to the curse of dimensionality.
Effect of Quadrature on Posterior Inferences. As shown in (10), the ratio of two NS estimators (from a single run) can be used for posterior inferences. However, the precise effect of the use of quadrature in both estimators on estimates of is not well understood. The algorithm replaces the integral of over a (random) shell with a single value, and assigns a volume to that shell according to a geometric expectation. To our knowledge, the only work toward better understanding this unique form of error is that of Higson et al (2018), which quantifies it through bootstrapping techniques.
Parallelization. While NS can be parallelized across runs, NS does not allow one to make use of parallel computing architectures within runs without modifying the algorithm. The most natural way to parallelize NS, first proposed by Burkoff et al (2012) is as follows. If we generalize (8) to consider the –th order statistic instead of simply the minimum , then has a distribution, with expectation . Thus, at each iteration, we can instead remove the points with the lowest likelihood, set , and parallelize the sampling across threads. The approach will not only increase the bias of the algorithm by introducing additional quadrature error, but will also compound the problem mentioned in the previous issue (as a single value will now be used to represent the mean of a larger shell).
Truncation Error. Finally, of lesser concern, yet still worth noting is that NS commits an truncation error (Evans, 2007) in the evidence estimate as a result of not performing quadrature on the entire
interval. A heuristic originally proposed by Skilling, which we call the
filling in procedure is to simply add after termination to the final evidence estimate. However, this is somewhat out of place with the rest of the quadrature. Using point process theory and techniques from the literature on unbiased estimation, Walter (2017)proposes an unbiased version of NS. However, this unbiasedness relies on the assumption of independent sampling, and comes with a cost of additional variance.
As mentioned earlier, all of these potential issues stem from the use of quadrature in NS. Indeed, the combined Monte Carlo/quadrature approach of NS seems to give somewhat of an overall awkwardness to the method. In the next section, we introduce SMC methodology, which we will soon discover allows us to retain the essence of NS, but allay the objections just discussed.
We begin with an introduction to importance sampling, which is the fundamental idea behind SMC. Recall that, in our setting, , where is a known function. For any probability density such that , it holds that
(12) |
where is called the weight function.
This suggests that one can draw and estimate (12) via the ratio
where we call the the normalized weights.
A common measure of the quality of using with regard to approximating is the effective sample size (ESS),
In practice, this can be estimated via
(13) |
see Liu (2001, Chapter 2.5) for a full discussion. Unfortunately, in difficult high-dimensional settings, it is often hard to make a choice of importance sampling density to ensure that the ESS will be high (equivalently, that the variance of the normalized weights will be low).
SMC samplers (Del Moral et al, 2006) extend the idea of importance sampling to a general method for sampling from a sequence of probability densities defined on a common space , as well as estimating their associated normalizing constants in a sequential manner. This is accomplished by obtaining at each time step a collection of random samples (called particles) with associated (normalized) weights , for , such that the weighted empirical measures of the cloud of particles,
(14) |
converge to their corresponding target measures as .
SMC samplers have three main elements:
Mutation. For each iteration , the population of particles are moved from to according to a (forward in time) Markov kernel , for which we denote the associated density . This implicitly defines a new importance sampling density at each iteration via the recursive formula
(15) |
Correction. The weights of the particles are updated via an incremental importance weight function , to ensure the particle system is correctly reweighted with respect to the next target density. This update involves multiplying the current weight of each particle by a corresponding incremental weight.
Selection. The particles are resampled according to their weights, which are then reset to . A variety of resampling schemes can be used (see for example Doucet and Johansen (2011, Section 3.4). However, the simplest is multinomial resampling. Here, the resampled population contains copies of for each , where .
Del Moral et al (2006) show that one can use an arbitrary mutation kernel at each stage, without being required to compute the corresponding importance sampling density at each iteration. This is achieved by introducing an artificial backward (in time) kernel, which transforms the problem into one in the well–understood setting of filtering (for a comprehensive survey, see Doucet and Johansen (2011)). Here, sample paths of the particles’ positions take values on the product space
, with the artificial joint distribution admitting each
(where is the unnormalized density) as a marginal. SMC samplers can be formulated in many different ways. For our purpose, we require SMC samplers for which for is a -invariant MCMC kernel (or several iterations thereof). This approach is most straightforward and related directly to NS. For this case, a suitable choice of the incremental weight function at time (i.e., one that will ensure the convergence (14)) is(16) |
In this setting, the implicit backward kernel will be a good approximation to the optimal backward kernel, provided that and are sufficiently close.
SMC samplers give an approximation of at each iteration via
(17) |
Further to this, at each iteration SMC samplers give estimates of the ratios of normalizing constants
Somewhat remarkably, the estimators are unbiased, i.e. . It follows readily that one can also obtain unbiased estimates of if is known, by including the term when appears in the incremental weights.
(Adaptivity) Introducing any sort of adaptivity into the SMC algorithm, for example resampling only if some criteria is met, choosing the next distribution online, or setting the parameters of according to the particle population, will not necessarily preserve the unbiasedness or convergence properties of the SMC estimators. The analysis of adaptive SMC methods is technically involved. However, there are consistency results for certain adaptive schemes, see for example Del Moral et al (2012b), Beskos et al (2016), and Cérou and Guyader (2016). Of course, one can always first run the algorithm adaptively, save the values of any adaptively chosen parameters, and then rerun the algorithm a second time and with these fixed.
Del Moral et al (2006) provide a strategy for using an SMC sampler to sample from a fixed density by defining a sequence of densities that transition from something that is easy to sample from (for example, the prior density) to . This can be accomplished in a number of ways. We outline the two most common in the SMC literature. One approach (Chopin (2002)) is to define the sequence of target distributions such that at each stage the effect of the likelihood is gradually introduced by considering more data than the last. The second method, first explored by Neal (2001) is called temperature annealing. Here, we have the sequence of densities
(18) |
parametrized by some temperature schedule
where is some initial importance sampling density. In the Bayesian setting, a natural choice is the gradual change from prior to the posterior:
In practice, it is often difficult to make a good choice for the temperature schedule. This can be achieved (approximately) by choosing the next temperature adaptively online according to the criterion of effective sample size (ESS), as proposed by Jasra et al (2011). This ensures successive distributions are sufficiently close. For some , one can approximately maintain an ESS of between successive distributions by choosing the next temperature online so that a given ESS is maintained. In other words, for a collection of particles, we choose (and thus the next density) so that the ESS for the current importance sampling step is equal to some desired amount. Formally stated, for , we solve
(19) |
via the bisection method, for example. Pseudocode for adaptive temperature–annealed SMC (TA–SMC) is given in Algorithm 2.
The similarity between SMC and NS at this point is evident. Both methods draw from some initial distribution (in our case, the prior distribution), and involve traversing a population of particles through a sequence of distributions, which is of an adaptive nature in NS, but may be either adaptive or fixed in SMC. From the outset, it would seem that nested sampling is some sort of SMC algorithm, yet it is distinct in its use of a quadrature rule. Further, it has an interesting point of difference in that NS does not transition from the prior to the posterior.
It turns out, somewhat suprisingly, that this difference is largely a matter of interpretation. Nested Sampling is a special type of adaptive SMC algorithm, where weights are assigned in a suboptimal way. In order to demonstrate this in a straightforward manner, we proceed as follows. We first present a general class of SMC methods called Nested Sampling via Sequential Monte Carlo (NS-SMC) methods. Then, we will proceed to show the correspondence with the original NS method by introducing an adaptive version of NS–SMC, and finally modifying this adaptive version further so that it more closely resembles (and is equivalent as ) to NS.
We begin by considering a set threshold schedule,
(20) |
which in turn parametrizes a sequence of nested sets
via
Next, define the sequence of constrained densities:
(21) |
We now consider directly shells of , via the sets,
Observe that sets form a partition of , for , and that because , we have that . Next, we define a second set of densities, corresponding to constrained versions of to these shells,
With the above in mind, we define a class of SMC Samplers called NS-SMC samplers, that have the following two properties:
Given samples targeting , the importance sampling branches into two paths. One path targets the next constrained prior , while the second targets (and terminates at) the constrained posterior . This branching of importance sampling paths occurs for all but , which proceeds only to . This is illustrated in Figure 1.
The importance sampling procedure just described results in (dependent) SMC samplers which output estimators of , as well as samples that can be used to estimate for each .
The resulting estimators for both and for are used together to estimate (1) via use of the identity
(22) |
For simplicity (and similarity to the original NS method), we consider the case where each is used directly as an importance sampling density for without any further resampling or moving. In such a case, we need only consider an SMC sampler that sequentially targets , because all terms in (22) can be rewritten in terms of expectations with respect to those densities. Thus, NS–SMC can be viewed as an extension to the rare–event SMC (multilevel splitting) method of Cérou et al (2012), which uses density sequences of the form (21) in order to estimate the probability (normalizing constant) .
For ease of presentation, below we use shorthand notation analogously to (17). For example, instead of , we write .
Noting that , we have
(23) |
which is estimated via
(24) |
Note that
(25) |
The final equality above implies that reweighting with respect to the (full) posterior requires that each particle targeting at iteration is assigned the weight
In turn, we have that
(26) |
is an estimator of .
Pseudocode for this version of NS–SMC is given in Algorithm 3. We call this version Fixed NS–SMC (as opposed to adaptive) as one specifies apriori. Note that resampling occurs at each iteration in order to avoid wasting computational effort moving particles with zero weight.
The issue of how to appropriately set will be addressed shortly. However, for now we return to the concerns with NS outlined earlier in Section 3, and note how they are addressed by NS–SMC:
Assumption of Independent Samples. NS-SMC has no requirement that the samples be independent. Moreover, the unbiasedness and consistency properties of Fixed NS–SMC are established in the appendix via a straightforward application of Feynman–Kac formalism.
Effect of Quadrature on Posterior Inferences. All errors in NS–SMC are solely Monte Carlo errors. The analogous error to that of NS in estimating is more natural and occurs as the result of the error in the ratios for .
Parallelization. NS-SMC is easily parellizable without any further modification. After resampling, the move step, which is often the most computationally intensive, can be parallelized at the particle level.
Truncation Error. NS-SMC commits no truncation error as the final density accounts for the interval which is omitted from the NS quadrature. However, it is important to note that the choice of the final threshold will still have an effect on the variance of NS–SMC. Nevertheless, the absence of truncation error is a key factor in allowing NS–SMC to obtain unbiased estimates of .
Generally one does not have a good idea of a choice of that will perform well. In a similar manner to adaptive TA–SMC, at each iteration in Algorithm 3 we can replace with a random threshold that is chosen adaptively. Ensuring an estimated ESS for that is at least simply reduces to choosing to be the quantile of the values . Such a choice in NS–SMC results in the online specification of both and . While is analogous to , we use this notation as it is common in adaptive multilevel splitting algorithms (Botev and Kroese, 2008, 2012; Cérou et al, 2012), where is interpreted as the proportion of particles that one desires to lie above each successive adaptively chosen threshold. For NS–SMC, can be interpreted as the desired proportion of samples with non–zero weight for .
As with NS, ANS–SMC also requires that the iteration at which termination occurs is determined online in some manner. The termination criterion/procedure we use compares the evidence estimate after an iteration with an estimate that would be obtained by instead terminating at that iteration. At each iteration, after computing , we compare the ratio of the two evidence estimates, and if it is greater than , we instead set , and declare . In our examples, we found that the choice was suitable.
For a given adaptive choice of the next threshold , experiments indicate that there is considerably less bias (particularly for small ) in the estimates of if one sets and instead of and .
The purpose of this section is to demonstrate that NS–SMC is not simply the Nested Sampling idea with an SMC twist, but that the original NS algorithm with MCMC is a variant of NS–SMC (with a different, yet suboptimal choice of weights). We follow the original NS sampling scheme more closely and derive an SMC estimator using a similar two-branched importance sampling scheme as illustrated in Figure 1. Specifically, we choose the sequence of distributions adaptively so only one particle lies below the next threshold and conduct our move step in a similar manner to NS. Just as Algorithm 3 can be viewed as an extension to the rare-event SMC algorithm of Cérou et al (2012), the more direct variant of NS we describe here can be viewed as an extension of the static Last Particle Method (LPM) for rare-event simulation of Guyader et al (2011). Unfortunately, there is a lack of theoretical results for the LPM in the setting where MCMC is used (due mainly to the special type of move step, outlined shortly).
We call this method Improved Nested Sampling (INS). The sampling scheme is identical for NS and INS, and thus one can obtain both estimates from the same nested sampling run. Somewhat surprisingly, provided the filling in procedure is used, the NS and INS estimators of model evidence and posterior quantities also become identical as . This provides insight into why NS seems to perform well in practice despite a violation of the independence assumption that underlies its quadrature.
INS is a modified version of ANS–SMC with the following differences:
We enforce for iterations that a single particle has non-zero incremental weight for . That is, like NS, we have only one particle that does not have support on the next constrained version of .
We conduct the resampling and mutation step in a manner that ensures that MCMC is only required to replenish the “worst–performing particle”.
Unfortunately, setting alone in ANS–SMC does not always ensure the first property above, which requires that all particles correspond to a unique value of . In discrete settings it is common for some particles to have the same value of . However, even if for all , there may still be duplicate particles if there is a non–zero probability that the MCMC kernel will return the same point (as is the case in Metropolis–Hastings MCMC).
The solution is reasonably straightforward. We employ auxiliary variables in a similar manner to (Murray, 2007, pgs. 96–98), who proposes a variant of NS that can be applied to discrete spaces. A similar approach is used in Cérou and Guyader (2016) to break ties in the theoretical analysis of adaptive multilevel splitting.
For brevity, we assume the aforementioned condition that for all , which is typically the case for continuous . However, this condition excludes certain cases of what Skilling (2006) refers to as “degenerate likelihoods”. Under this assumption, the approach about to be described is entirely implicit if one does not consider any auxiliary variables, ignores any duplicate particles, and just moves a single particle at each iteration, yielding the same for multiple iterations. However, in discrete cases, one must consider the extended space explicity and conduct the move step differently, see Murray et al (2005) and Murray (2007).
We extend the space from to
via a uniformly distributed variable
. That is, we haveIn this setting, define the augmented threshold schedule:
where is to be understood as either , or that both and .
Applying a similar derivation of NS–SMC to that given earlier in this section, we have the sets
(27) |
and the densities
Note that this setup ensures that the sets are nested and that the sets form a partition.
Prior to demonstrating how INS relates to NS, we stress the following. Due to the special type of mutation step, the algorithm falls outside of the standard SMC sampler framework (which requires the same forward kernel for all particles).
Despite this, we continue to use incremental weight functions of the form (16). While this choice seems to be a natural one (and matches the approach used in the LPM), it implicitly assumes that the INS procedure produces a population of particles that are marginally distributed according to (the adaptively chosen) at each time step. This may only hold approximately in practice, and even establishing that the property holds as is difficult due to the complicated combination of adaptively chosen distributions and non–standard mutation step.
Nevertheless, we present the method for the purpose of making clear the connection of NS with the NS–SMC framework. Moreover, we point out that while our assumption on the marginal distribution of the particles at each iteration is a strong one, it is substantially weaker than that required in the original formulation of NS, which assumes not only the same condition on the marginal distributions of the particles, but also that the particles are independent. Recall that both of these conditions are required for the property (8) to hold.
With the above in mind, we sketch the key aspects of the INS below.
Adaptive Choice of Densities. At each (non–final) iteration, we determine and adaptively (via the choice of of the next threshold parameters and ) as follows. First, we set . Next, denote the indices of the particles satisfying by . We “break ties” by choosing .
Reweighting. Importance sampling takes place for and with the incremental weight functions and , respectively.
By construction, we will have samples with non-zero incremental weight for , giving
Similarly, only one particle, denoted , will have non–zero incremental weight for (and thus non–zero weight for ), so we have
(28) |
Note that (28) is not only but also precisely the weight of with respect to as in (25). Recall that this is also the case with NS.
Resampling and Mutation. We resample according to a residual scheme, and reset all weights to . As we have samples with equal (non-zero) weight, residual resampling will result in unique particle positions , as well as a final particle that is a copy of one of the others.
The mutation step is as follows. We begin by applying the identity map all particles except the –th one, moving to . Then, we perform the following two –invariant moves in sequence to move to .
First, we move to by applying some fixed number of iterations of an –invariant MCMC kernel. Note that
so this is simply sampling from a constrained version of as in standard NS or NS–SMC. Next, we draw from a new position according to
Final Iteration. The reweighting and mutation steps continue up until a termination criteria is satisfied. At this point, we declare , and set the final threshold parameters , . Here, all samples will have non–zero incremental weight for , and we have
(29) |
Note that the above normalizing constant estimator bears similarity to the “filling in” heuristic in NS. However, here it arises naturally as a final step, and uses to estimate , as opposed to .
Despite this similarity, it still appears that (28) is distinct from its analogous term in NS. However, by means of some simple algebraic manipulation, we obtain the identity
which, after rearrangement, reveals that
(30) |
resembling precisely the Riemann sum quadrature rule, with the choice . The most notable aspect of this is that at no stage in the derivation of INS did we require the property given in (8), which would require samples to be independent.
We give this version of NS with the “improved” moniker as the alternative choices for have been found to yield superior estimators of when compared to those proposed originally by Skilling (2006). Guyader et al (2011) show (under the same idealized sampling assumption as NS), that the LPM estimators are unbiased estimators of . Further to this, Walter (2017, Remark 1) shows that this estimator results in superior estimates over in terms of variance so long as . In light of this, Walter suggests using Riemann sum quadrature using these alternative values for as it will result in a superior NS estimator.
The final piece of the puzzle connecting NS with INS and the overall NS–SMC framework, is that as we have that , so NS’s weights become equal to those of INS. We give a simple illustration of this convergence in Figure 2, where we plot the ratio of (30) to the standard NS Riemann sum / trapezoidal rule terms after iterations of NS for different (NS gives identical estimates for for this choice, regardless of ). The convergence will be slower for larger .
This provides some insight as to why NS seems to deliver correct results in practice, even when particles are far from independent (as will soon be demonstrated numerically). In essence, NS is an adaptive SMC algorithm on an extended space, where the choice of weights are, in a sense, sub–optimal.
In order to compare the different variants of NS–SMC and NS, we shall use a phase transition example. Nested sampling is robust to models that contain phase transitions, i.e., models for which the graph of
against is not concave. For a full discussion of the challenges of phase transition phenomena, including why they can be challenging for temperature based methods such as TA-SMC and the power posteriors method of Friel and Pettitt (2008), we refer back to the original NS paper (Skilling, 2006). In a Bayesian context, a phase transition can be understood intuitively as having a likelihood function that is spiked and changes rapidly in certain regions. While this would seem to be a pathological type of behaviour restricted to problems in physics, it is known to occur in statistical settings, see for example Brewer (2014).Similar to Skilling (2006), we consider the estimation of