Inference in complex and high-dimensional statistical models is a very challenging problem that is ubiquitous in applications such as climate informatics(Monteleoni et al., 2013), bioinformatics (Cohen, 2004)
and machine learning(Wainwright and Jordan, 2008), to mention a few.
We are interested in sequential Bayesian inference in settings where we have a sequence of posterior distributions that we need to compute. To be specific, we are focusing on settings where the model (or state variable) is high-dimensional, but where there are local dependencies. One example of the type of models we consider are the so-called spatio-temporal models (Wikle, 2015; Cressie and Wikle, 2011; Rue and Held, 2005).
Sequential Monte Carlo () methods comprise one of the most successful methodologies for sequential Bayesian inference. However, struggles in high dimensions and these methods are rarely used for dimensions, say, higher than ten (Rebeschini and van Handel, 2015). The purpose of the methodology is to push this limit well beyond the single digits.
The basic strategy is to mimic the behavior of a so-called fully adapted (or locally optimal) algorithm. Full adaptation can drastically improve the efficiency of in high dimensions (Snyder et al., 2015). Unfortunately, it can rarely be implemented in practice since the fully adapted proposal distributions are typically intractable. addresses this difficulty by requiring only approximate, properly weighted, samples from the proposal distribution. This enables us to use a second layer of to simulate approximately from the proposal. The proper weighting condition ensures the validity of , thus providing a generalisation of the family of methods. This paper extends preliminary work (Naesseth et al., 2015a)
with the ability to handle more expressive models, more informative central limit theorems and convergence proofs, as well as new experiments.
There has been much recent interest in using Monte Carlo methods as nested procedures of other Monte Carlo algorithms. The and algorithms by Chopin et al. (2013) and Tran et al. (2013), respectively, are algorithms for learning static parameters as well as latent variable(s). In these methods one /method for the parameters is coupled with another for the latent variables. Chen et al. (2011) and Johansen et al. (2012) on the other hand addresses the state inference problem by splitting into two components and run coupled samplers for these. These methods solve different problems and the “internal” samplers are constructed differently, for approximate marginalization instead of simulation.
By viewing the state inference problem as a sequential problem in the components of we can make use of the method for general graphical models by Naesseth et al. (2014b). This method is combined with the island particle filter (Vergé et al., 2015), and studied more closely by Beskos et al. (2014) under the name space-time particle filter (). The does not generate an approximation of the fully adapted . Another key distinction is that in each particle in the “outer” sampler corresponds to a complete particle system, whereas for it will correspond to different hypotheses about the latent state as in standard . This leads to lower communication costs and better memory efficiency in distributed implementations. We have also found that typically outperforms , even when run on a single machine with matched computing times.
The method proposed by Jaoua et al. (2013)
can be viewed as a special case of when the nested procedure to generate samples is given by with the proposal being the transition probability. Independent resampling PF (IR-PF) introduced inLamberti et al. (2016) generates samples in the same way as with , instead of , as the nested procedure. However, IR-PF uses a different weighting that requires both the outer and the inner number of particles to tend to infinity for consistency. Furthermore, we provide results in the supplementary material that show significantly outperforming IR-PF on an example studied in Lamberti et al. (2016).
There are other -related methods that have been introduced to tackle high-dimensional problems, see the so-called block PF studied by Rebeschini and van Handel (2015), the location particle smoother by Briggs et al. (2013), and various methods reviewed in Djuric and Bugallo (2013). These methods are, however, all inconsistent because they are based on approximations that result in systematic errors.
The concept of proper weighting (or random weights) is not new and has been used in the so-called random weights particle filter (Fearnhead et al., 2010). They require exact samples from a proposal
but use a nested Monte Carlo method to unbiasedly estimate the importance weights. In Martino et al. (2016) the authors study proper weighting as a means to perform partial resampling, only resample a subset of the particles at each time. The authors introduce the concept of “unnormalized” proper weighting, which is essentially the same as proper weighting that was introduced and used to motivate in Naesseth et al. (2015a). Furthermore, Stern (2015) uses proper weighting and to solve an inference problem within statistical historical linguistics.
2 Sequential probabilistic models
In statistics, data science and machine learning, probabilistic modeling and Bayesian inference are essential tools to finding underlying patterns and unobserved quantities of interest. To illustrate the nested sampler we will make use of two general classes of sequential probabilistic models, the so-calledMarkov random field () and the state space model
(). Sequential probabilistic models are in general built up of a sequence of (probabilistic) models that share common random variables and structure. These models will serve to illustrate the usefullness and wide applicability of the method we propose. We are interested in the type of sequential models where the latent variables are fairly high-dimensional. In subsequent sections we will also show explicitly how we can make use of structure between the (latent) random variables to design an efficient sampler that lets us scale to much higher dimensions than possible with standard methods, usually by up to 1–2 orders of magnitude. Note also that the is by no means restricted to the classes of models we illustrate in this section, rather it can in principle be applied to any sequence of distributions we would like to approximate. We will refer to this sequence of distributions of interest as thetarget distributions.
2.1 Markov random fields
The Markov random field is a type of undirected probabilistic graphical model (Jordan, 2004). The is typically not represented as a sequence of distributions (or models), but it has previously been shown (Hamze and de Freitas, 2005; Everitt, 2012; Naesseth et al., 2014a, b, 2015a, 2015c; Lindsten et al., 2016) that it can be very useful to artificially introduce a sequence to simplify the inference problem. Furthermore, it is also possible to postulate the model as an that increases with “time”, useful in climate science (Fu et al., 2012; Naesseth et al., 2015a). In the exposition below we will first for simplicity assume that we have an that is of fixed dimension, the latent variable is a finite-dimensional multivariate random variable. The conditional independencies of an are described by the structure of the graph , where is the vertex set and is the edge set. Given
we can define a joint probability density function forthat incorporates this structure as
where is the observed variable and are called observation and interaction potentials, respectively. The normalization constant that ensures that integrates to one is given by
Note that (1) is usually referred to as a pairwise in the literature due to factorising into potentials that only depend on pairs of components of the random variable . For clarity we restrict ourselves to this type, however the method we propose in this paper can be applied to more general types of graphs, see Naesseth et al. (2014b) for ideas on how to extend inference to non-pairwise s.
Now, the sequential is obtained if we consider a random variable , for some , that factorises according to
where again encodes the structure of the graphical model and is a new type of interaction potential that links to . Furthermore, the normalisation constant is given by . We illustrate a typical example of a sequential in Figure 1. It can amongst other things be used to model spatio-temporal phenomena, it was used by Naesseth et al. (2015a) to detect drought based on annual average precipitation rates collected from various sites in North America and Africa over the last century.
We would like to remark on one peculiarity that arises when the sequential is used to model a spatio-temporal process. Consider without measurements as a prior on a spatio-temporal model, the observation potentials in (2) do not depend on . In this case we get that the marginals for change depending on the value of , in general . Typically we would expect that a priori what happens for a dynamical process at time should not be affected by the length of time-series we consider. The next class of models we consider can introduce dependencies in both time and space without giving rise to this counter-intuitive result.
2.2 Spatio-temporal state space models
Before we move on to define the spatio-temporal state space model (), we will briefly review s, a comprehensive and important model type commonly used for studying dynamical systems. For a more detailed account, and with pointers to the wide range of applications, we refer the readers to Cappé et al. (2005); Douc et al. (2014); Shumway and Stoffer (2010).
In state space models the sequential structure typically enters as a known, or postulated, dynamics on the unobserved latent state that is then partially observed through the measurements . A common definition for s is through its functional form
where and , often called process and measurement noise, respectively, are random variables with some given distributions . Furthermore, we have that the initial state is a random variable with some initial distribution . For simplicity we will assume that both and are bijective and continuously differentiable. Then by the transformation theorem we can equivalently express (3) through the corresponding probability density functions ()
and we define the sequential probabilistic model (or target distribution) as follows
We will assume that is available and can be evaluated pointwise. This condition is often satisfied in practical applications.
A typical assumption when using the to model spatio-temporal systems is to introduce the spatial dependency only between time steps and , see the paper by Wikle and Hooten (2010). This can be achieved by defining a model such that the product of the induced distributions , conditionally on , completely factorize over the components of , see also (Rebeschini and van Handel, 2015) where applied to such a model is studied. Here we will study the case where we introduce spatial dependencies within each time step through the disturbance term . We define the as a combination of the functional and representation of an where the distribution for is given by an as in (1)
We make no assumptions on local dependencies between and , however, to keep it simple we will assume that the graph describing the distribution for does not depend on time . Furthermore, we will in this paper mainly consider models where dependencies between components in are “few”, the is sparse with few elements in , and where components of in only depends on subsets of . To illustrate the dependency structure in an we propose a combination of the traditional undirected graph for the and the directed acyclic graph for the , see Figure 2.
This allows us to model more complex dynamical processes than Naesseth et al. (2015a) who assumed that factorized with only local dependencies between components of . Furthermore, we can clearly see that the peculiarity discussed in Section 2.1 is not present in this model; the marginal of the prior does not change with as expected.
3 Nested Sequential Monte Carlo Methods
Inference in sequential probabilistic models essentially boils down to computing the target distribution for ; typically an intractable problem with no analytical or numerically efficient solution. This means that we have to resort to approximations. In this paper we focus on one particular succesful solution to the problem, the so called sequential Monte Carlo family of algorithms first introduced in the papers by Gordon et al. (1993); Stewart and McCarty (1992); Kitagawa (1996).
The basic idea with is to move a set of weighted samples (particles) approximating , to a new set of particles which approximates . These samples define an empirical approximation of the target distribution
where is a Dirac measure at . In the next section we will detail an especially efficient way of moving the particles, known as fully adapted (Pitt and Shephard, 1999), ensuring that all normalized weights are equal to .
3.1 Fully Adapted Sequential Monte Carlo
The procedure to move the particles and their weights from time to in any sampler is typically done in a three-stage approach. The first, resampling, stochastically chooses particles at time that seem promising, discarding low-weighted ones. The second stage, propagation, generates new samples for time conditioned on the resampled particles. The final stage, weighting, corrects for the discrepancy between the target distribution and the proposal, the instrumental distribution used in the propagation step.
Fully adapted (Pitt and Shephard, 1999) makes specific choices on the resampling weights, , and the proposal, , such that all the importance weights are equal. By introducing ancestor indices , we can describe the resampling step by simulating times from
Propagation then follows by simulating conditional on , for , as follows
This proposal is sometimes referred to as the (locally) optimal proposal
because it minimizes incremental variances in the importance weights. Weighting is easy since all weights are equal, the unnormalized weights are all set to . The fully adapted sampler in fact corresponds to a locally optimal choice of both resampling weights and proposal with an incremental variance in the importance weights that is zero.
Note that in most cases it is impossible to implement this algorithm exactly, since we can not calculate and/or simulate from . Nested solves this by requiring only approximate resampling weights and approximate samples from , in the sense that is formalized in Section 3.3. However, we will start by detailing some specific cases when we can efficiently implement exact fully adapted . These cases are of interest in themselves, however, here we will use them to build intuition for how the approximations in are constructed.
3.2 Forward Filtering–Backward Simulation
The problems we need to solve are those of computing and simulating from efficiently, in such a way that the computational complexity is controlled. There are at least two important special cases when we can use fully adapted . The first is if the state space is discrete and finite, . Even though exact algorithms are known in this case (Cappé et al., 2005) the computational complexity typically scales quadratically with the cardinality of , thus methods can still be of interest (Fearnhead and Clifford, 2003; Naesseth et al., 2014a, 2015a). The second case is if
is an unnormalized Gaussian distribution, in the this would correspond to
for some matrix , covariance matrix , and an in the components of where all pair-wise potentials are Gaussian.
Now, even though in principle the fully adapted is available these special cases, the computational complexity can be prohibitive. In fact in general it is of the order of and for the finite state space and Gaussian case, respectively. However, when there are local dependencies it is possible to make use of an underlying chain (or tree) structure, as proposed by Naesseth et al. (2014a) for the finite state space case, to make efficient implementations with only and complexity, respectively. This approach makes use of forward filtering–backward simulation (sampling), from Carter and Kohn (1994); Frühwirth-Schnatter (1994), on the components of to compute and sample exactly. Let us as an example consider the above with and and the Gaussian given by
for some positive constants and . Then straightforward computations gives the proposal and resampling weights
However, an equivalent way to simulate from this distribution and calculate is given below
Due to the structure in and we can see that the distribution to sample from corresponds to a Gaussian with a chain-structure in the ’s (Figure 2)
Because of this structure we can efficiently compute the normalization constant of (10) by means of “forward” filtering, keeping track of the incremental contributions to , . Sampling the distribution is then done by an explicit “backward” pass, simulating , . We provide an illustration of the process in Figure 3. See also Naesseth et al. (2014a) for an example of how this is done in practice for a discrete state space.
The main idea behind nested is to emulate this behavior for arbitrary sequential probabilistic models. Because computing and simulating from exactly is intractable in general we propose to run an -based forward filtering–backward simulation (Godsill et al., 2004; Lindsten and Schön, 2013) method on the components of (or ) to approximate and draws from .
3.3 Nested Sequential Monte Carlo
One way to think of the nested family of methods is as an exact approximation (Andrieu et al., 2010) of an algorithm with resampling weights and proposal given as in the fully adapted . Instead of exactly evaluating each , we run a nested (or internal) sampler with particles, for each , on the components (or ) with the final target (for ) equal to to mimic the exact forward filtering procedure. The normalization constant estimates from these internal filters gives us unbiased approximations of that we use to perform the resampling step. The resampling step not only selects the ancestors , but we also resample the complete internal state, denoted by , of the nested samplers which will be used for the propagation step. Lastly we simulate by running a backward simulation procedure (Godsill et al., 2004; Lindsten and Schön, 2013) using the resampled internal sampler’s to mimic the exact backward sampling described above. More formally, one step from iteration to of the method proceeds as follows.
Given an unweighted particle set (), approximating , we generate the internal states by simulating (forward filtering). Here
denotes the joint distribution of all random variables generated by the internal sampler. Then we extract an estimate of the resampling weights, where is a function such that
This is the normalization constant estimate at the final step of the internal samplers, where the target is equal to , and then (11) is satisfied by known properties of (Del Moral, 2004, Proposition 7.4.1). We now proceed to resample the internal samplers, generating ancestor variables such that
which concludes the resampling step.
Next, for propagation we generate samples (backward sampling), where is a distribution satisfying the following condition
The distribution can be realized by running backward simulation, however, a simple straightforward alternative that also satisfies (13) can be to sample from the corresponding empirical distribution induced by the internal sampler. We discuss the choice of and further in the next section.
Finally, we set and have thus obtained a new set of unweighted particles approximating ,
We say that the (random) pair are properly weighted for the (unnormalized) distribution if and for all measurable functions
for some positive constant that is indepedent of the ’s and ’s.
We provide a summary of the proposed method in Algorithm 4. Although we here focus on approximating the fully adapted sampler, the extension to arbitrary resampling weights and proposal is straightforward, see the supplementary material. Next we will illustrate how we can make use of nested or internal samplers to construct that generate properly weighted samples.
3.4 Constructing , and
To construct we propose to run an sampler targeting the components of (or ) one-by-one. This is done by choosing some sequence of (unnormalized) targets and proposals such that . For notational convenience we supress the dependence on time in this section. We provide a summary in Algorithm 2, in this case .
A first simple alternative to construct can be to simply simulate directly from the empirical measure defined by the approximation in Algorithm 2. Although this will be properly weighted it can introduce significant correlation between the samples. Instead we propose to make use of backward simulation (Godsill et al., 2004; Lindsten and Schön, 2013) to construct a more efficient , see Algorithm 3.
Now, putting all this together we define the complete procedure in Definition 2.
Definition 2 (and BS).
Proposition 1 (Proper weighting).
The procedure in Definition 2 generates that are properly weighted for .
The result follows from Theorem 2 in Naesseth et al. (2015a). ∎
Note that we can in fact replace Step 1 of Definition 2 (and BS) with running the algorithm itself, Algorithm 4, and letting the in Step 3. This will also yield properly weighted samples as discussed in Naesseth et al. (2015a). We will in the experiments show how this can be used to design efficient algorithms by nesting several layers of samplers.
Compare with the example in Section 3.2 and Figure 3 where we used forward filtering–backward sampling by considering the components of as our target. Instead of exact forward filtering we can use Algorithm 2, and instead of exact backward sampling we can use Algorithm 3, to generate properly weighted samples.
3.5 Theoretical Justification
In this section we will provide a central limit theorem that further motivates , and show how the asymptotic variance depends on the internal approximation of the exact fully adapted . Furthermore, we provide a result that shows how this asymptotic variance converges to that of the corresponding asymptotic variance of the exact fully adapted method as .
Theorem 1 (Central Limit Theorem).
Under certain (standard) regularity conditions on the function , specified in the supplementary material, we have the following central limit theorem
where are generated by Algorithm 4 and the asymptotic variance is given by
for ’s defined by