A key strategy in machine learning is to break down a problem into smaller and more manageable parts, then process data or unknown variables recursively. Well known examples of this are message passing algorithms for graphical models and annealing for optimization or sampling. SMC is a class of methods that are tailored to solved statistical inference problems recursively. These methods have mostly received attention in the signal processing and statistics communities. With well over two decades of research in SMC, they have enabled inference in increasingly complex and challenging models. Recently, there has been an emergent interest in this class of algorithms from the machine learning community. We have seen applications to PGM (Naesseth et al., 2014; Paige and Wood, 2016), probabilistic programming (Wood et al., 2014), VI (Maddison et al., 2017; Naesseth et al., 2018; Le et al., 2018), inference evaluation (Grosse et al., 2015; Cusumano-Towner and Mansinghka, 2017), Bayesian nonparametrics (Fearnhead, 2004) and many other areas.
We provide a unifying view of the SMC methods that have been developed since their conception in the early 1990s (Gordon et al., 1993; Stewart and McCarty, 1992; Kitagawa, 1993). In this introduction we provide relevant background material, introduce a running example, and discuss the use of code snippets throughout the tutorial.
1.1 Historical Background
SMC methods are generic tools for performing approximate (statistical) inference, predominantly Bayesian inference. They use a weighted sample set to iteratively approximate the posterior distribution of a probabilistic model. Ever since the dawn of MC methods (see e.g.Metropolis and Ulam (1949) for an early discussion), random sample-based approximations have been recognized as powerful tools for inference in complex probabilistic models. Parallel to the development of MCMC methods (Metropolis et al., 1953; Hastings, 1970), SIS (Handschin and Mayne, 1969) and sampling/importance resampling (Rubin, 1987) laid the foundations for what would one day become SMC.
SMC methods where initially known as particle filters (Gordon et al., 1993; Stewart and McCarty, 1992; Kitagawa, 1993). Particle filters where conceived as algorithms for online inference in nonlinear SSM (Cappé et al., 2005). Since then there has been a flurry of work applying SMC and particle filters to perform approximate inference in ever more complex models. While research in SMC initially focused on SSM, we will see that SMC can be a powerful tool in a much broader setting.
1.2 Probabilistic Models and Target Distributions
As mentioned above, SMC methods were originally developed as an approximate solution to the so called filtering problem, which amounts to online inference in dynamical models. Several overview and tutorial articles focus on particle filters, i.e. the SMC algorithms specifically tailored to solve the online filtering problem (Arulampalam et al., 2002; Doucet and Johansen, 2009; Fearnhead and Künsch, 2018). In this tutorial we will take a different view and explain how SMC can be used to solve more general “offline” problems. We shall see how this viewpoint opens up for many interesting applications of SMC in machine learning that do not fall in the traditional filtering setup, and furthermore how it gives rise to new and interesting design choices. We consider a generic probabilistic model given by a joint PDF of latent variables and observed data ,
We focus on Bayesian inference, where the key object is the posterior distribution
where is known as the marginal likelihood.
The target distributions are a sequence of probability distributions that we recursively approximate using SMC. We define each target distribution in the sequence as a joint PDF of latent variables , where . The PDF is denoted by
where is a positive integrable function and is the normalization constant, ensuring that is indeed a PDF.
We connect the target distributions to the probabilistic model through a requirement on the final target distribution . We enforce the condition that is either equivalent to the posterior distribution, or that it contains the posterior distribution as a marginal distribution. The intermediate target distributions, i.e. for , are useful only insofar they help us approximate the final target . This approach is distinct from previous tutorials on particle filters and SMC that traditionally focus on the intermediate targets, i.e. the filtering distributions. We stress that it is not necessarily the case that (the latent variables of the target distribution) is equal to (the latent variables of the probabilistic model of interest), as the former can include additional auxiliary variables.
Below we introduce a few examples of probabilistic models and some straightforward choices of target distributions. We introduce and illustrate our running example which will be used throughout. We will return to the issue of choosing the sequence of intermediate targets in Section 3.2.
State Space Models
The SSM (or hidden Markov model) is a type of probabilistic models where the latent variables and data satisfy a Markov property. For this model we typically have. Often the measured data can also be split into a sequence of the same length () as the latent variables, i.e. . The model is defined by a transition PDF and an observation PDF , equationparentequation
The joint PDF is
where is the prior on the initial state . This class of models is especially common for data that has an inherent temporal structure such as in the field of signal processing. A common choice is to let the target distributions follow the same sequential structure as in Eq. 5:
which means that the final normalized target distribution satisfies as required. This is the model class and target distributions which are studied in the classical filtering setup.
Non-Markovian Latent Variable Models
The non-Markovian LVM are characterized by either no, or higher order, Markov structure between the latent variables and/or data . This can be seen as a non-trivial extension of the SSM, see Eq. 4, which has a Markov structure. Also for this class of models it is common to have and .
Unlike the SSM, the non-Markovian LVM in its most general setting requires access to all previous latent variables to generate equationparentequation
where we again refer to and as the transition PDF and observation PDF, respectively. The joint PDF is given by
where is again the prior on . A typical target distribution is given by
with . Another option is
For both these sequences of target distributions the final iteration is the posterior distribution, i.e. . However, the former one will often lead to more accurate inferences. This is because we introduce information from the data at an earlier stage in the SMC algorithm.
Throughout the monograph we will exemplify the different methods using a Gaussian special case of Eq. 7, see Example 1. We let the prior on , defined by the transition PDF , be Markovian and introduce the non-Markov property instead through the observation PDF .
Example 1 (Non-Markovian Gaussian Sequence Model).
As our running example for illustration purposes we use a non-Markovian Gaussian sequence model. It is
with observed variables (data), and where
We let the prior at be . Artificial data was generated using . The distribution of interest is the posterior distribution . We illustrate a few sample paths of in Fig. 1 for .
We can adjust the strength of the dependence on previous latent variables in the observations, , through the parameter . If we set we obtain a linear Gaussian SSM, since the data depends only on the most recent latent . On the other hand if we let , this signifies that for has equally strong effect on as does .
Another example of a non-Markovian LVM often encountered in machine learning is the stochastic RNN. We define the stochastic RNN below and will return to it again in Section 3.
Example 2 (Stochastic Recurrent Neural Network).
A stochastic RNN is a non-Markovian LVM where the
parameters of the transition and observation models are defined by
RNNs. A common example is using the conditional Gaussian
distribution to define the transition PDF
A stochastic RNN is a non-Markovian LVM where the parameters of the transition and observation models are defined by RNNs. A common example is using the conditional Gaussian distribution to define the transition PDF
where the functions are defined by RNNs.
Conditionally independent models
A common model in probabilistic machine learning is to assume that the datapoints in the dataset are conditionally independent given the latent . This means that the joint PDF is given by
where is the likelihood. For this class of models it might not be immediately apparent that we can define a useful sequence of target distributions. However, as we shall see, we can make use of auxiliary variables to design target distributions that may help with inference.
We will discuss two approaches to design the sequence of target distributions: using data tempering and likelihood tempering, respectively. Both of these will make use of an auxiliary variable technique, where each
is a random variable on the same space as.
Data tempering: Using data tempering we add the data to the target distribution one by one. In this case the data index coincides with the target index . We define the target distribution
where the distributions are a design choice, known as backward kernels. With this choice, we have that the marginal distribution of at the final iteration is exactly the posterior distribution, i.e. . In fact, at each step we have that the target distribution is a partial posterior .
Likelihood tempering: With likelihood tempering, instead of adding the data one by one, we change the likelihood through a sequence of positive variables. We define the target distribution
where , and again make use of the user chosen backward kernels . In this setting all data is considered at each iteration. Since , we have that the final marginal target distribution is again equal to the posterior .
Models and Targets
We have seen several probabilistic models with examples of corresponding target distributions. While not limited to these, this illustrates the wide range of the applicability of SMC. In fact, as long as we can design a sequence of target distributions such that coincides with the distribution of interest, we can leverage SMC for inference.
SMC and IS methods have already seen a plethora of applications to machine learning and statistical inference problems. Before we turn to the fundamentals of the various algorithms it can be helpful to understand some of these applications. We present and discuss a few select examples of applications of SMC and IS to PGM, Bayesian nonparametric models, probabilistic programming, and inference evaluation.
Probabilistic Graphical Models
PGM (PGM; see e.g. Koller et al. (2009); Wainwright et al. (2008)) are probabilistic models where the conditional independencies in the joint PDF are described by edges in a graph. The graph structure allows for easy and strong control on the type of prior information that the user can express. The main limitation of the PGM is that exact inference is often intractable and approximate inference is challenging.
The PGM is a probabilistic model where the PDF factorizes according to an underlying graph described by a set of cliques , i.e. fully connected subsets of the vertices where contains all individual components of and . The undirected graphical model can be denoted by
where is a normalization constant ensuring that the right hand side is a proper PDF.
Probabilistic programming languages are programming languages designed to describe probabilistic models and to automate the process of doing inference in those models. We can think of probabilistic programming as that of automating statistical inference, particularly Bayesian inference, using tools from computer science and computational statistics. Developing a syntax and semantics to describe and denote probabilistic models and the inference problems we are interested in solving is key to the probabilistic programming language. To define what separates a probabilistic programming language from a standard programming language we quote Gordon et al. (2014): “Probabilistic programs are usual functional or imperative programs with two added constructs: (1) the ability to draw values at random from distributions, and (2) the ability to condition values of variables in a program via observations.” This aligns very well with the notion of Bayesian inference through the posterior distribution Eq. 2; through the syntax of the language we define to be the random values we sample and our observations that we condition on through the use of Bayes rule. The probabilistic program then essentially defines our joint probabilistic model .
One of the main challenges of probabilistic programming is to develop algorithms that are general enough to enable inference for any model (probabilistic program) that we could conceivably write down using the language. Recently Wood et al. (2014) have shown that SMC-based approaches can be used as inference back-ends in probabilistic programs.
Bayesian nonparametric models
Nonparametric models are characterized by having a complexity which grows with the amount of available data. In a Bayesian context this implies that the usual latent random variables (i.e., parameters) of the model are replaced by latent stochastic processes. Examples include Gaussian processes, Dirichlet processes, and Beta processes; see e.g. Hjort et al. (2010) for a general introduction.
Sampling from these latent stochastic processes, conditionally on observed data, can be done using SMC. To give a concrete example, consider the Dirichlet process mixture model, which is a clustering model that can handle an unknown and conceptually infinite number of mixture components. Let , be a stream of data. Let , be a sequence of latent integer-valued random variables, such that is the index of the mixture component to which datapoint belongs. A generative representation of the mixture assignment variables is given by
where is the number of distinct mixture components represented by the first datapoints, and is the number of datapoints among that belong to the th component.
The model is completed by specifying the distribution of the data, conditionally on the mixture assignment variables:
where is an emission probability distribution parameterized by and is a prior distribution over the mixture parameters .
Note that the mixture assignment variables , evolve according to a latent stochastic process. Solving the clustering problem amounts to computing the posterior distribution of this stochastic process, conditionally on the observed data. One way to address this problem is to use SMC; see Fearnhead (2004) for an efficient implementation tailored to the discrete nature of the problem.
An important problem when performing approximate Bayesian inference is to figure out when our approximation is “good enough”? Is it possible to give practical guarantees on the approximation we obtain? We need ways to evaluate how accurate our approximate inference algorithms are when compared to the true target distribution that we are trying to approximate. We will refer to the process of evaluating and validating approximate inference methods as inference evaluation.
Inference evaluation is mainly concerned with measuring how close our approximation is to the true object we are trying to estimate, often a posterior distribution. For simulated data, Grosse et al. (2015, 2016) have shown that we can make use of SMC and IS to bound the symmetrized KL divergence between our approximate posterior and the true posterior. In another related work Cusumano-Towner and Mansinghka (2017) have shown that SMC-based methods show promise in estimating the symmetric KL divergence between the approximate posterior and a gold standard algorithm.
1.4 Example Code
We will be making use of inline Python code snippets throughout the manuscript to illustrate the algorithms and methods. Below we summarize the modules that are necessary to import to run the code snippets:
The remainder of this tutorial is organized as follows. In Section 2, we first introduce IS, a foundational building block for SMC. Then, we discuss the limitations of IS and how SMC resolves these. Finally, the section concludes with discussing some practical issues and theoretical results relevant to SMC methods.
Section 3 is focused on the two key design choices of SMC: the proposal and target distributions. Initially we focus on the proposal, discussing various ways of adapting and learning good proposals that will make the approximation more accurate. Then we discuss the sequence of target distributions; how we can learn intermediate distributions that help us when we try to approximate the posterior.
Section 4 focuses on PM methods and other SMC methods that rely on a concept known as proper weights. First, we provide a simple and straightforward proof of the unbiasedness property of the SMC normalization constant estimate. Then, we describe and illustrate the combination of MCMC and SMC methods through PM algorithms. We move on to detail properly weighted SMC, a concept that unites and extends the random weights and nested SMC algorithms. Finally, we conclude the chapter by considering a few approaches for distributed and parallel SMC.
In Section 5 we introduce CSMC, and related methods for simulating from and computing expectations with respect to a target distribution. First, we introduce the basic CSMC algorithm and provide a straightforward proof of the unbiasedness of the inverse normalization constant estimate. Then, we show how SMC and CSMC can be combined to leverage multi-core and parallel computing architectures in the IPMCMC algorithm. Finally, we discuss recent applications of CSMC for evaluating approximate inference methods.
The tutorial concludes with a discussion and outlook in Section 6.
2 Importance Sampling to Sequential Monte Carlo
Typical applications require us to be able to evaluate or sample from the target distributions , as well as compute their normalization constants . For most models and targets this will be intractable, and we need approximations based on e.g. Monte Carlo methods.
In this section, we first review IS and some of its shortcomings. Then, we introduce the SMC method, the key algorithm underpinning this monograph. Finally, we discuss some key theoretical properties of the SMC algorithm.
2.1 Importance Sampling
IS is a Monte Carlo method that constructs an approximation using samples from a proposal distribution, and corrects for the discrepancy between the target and proposal using (importance) weights.
Most applications of Bayesian inference can be formulated as computing the expectation of some generic function , referred to as a test function, with respect to the target distribution ,
Examples include posterior predictive distributions, Bayesian p-values, and point estimates such as the posterior mean. ComputingEq. 15 is often intractable, but by a clever trick we can rewrite it as
The PDF is a user chosen proposal distribution, we assume it is simple to sample from and evaluate. We can now estimate the right hand side of Eq. 16 using the Monte Carlo method,
where and are simulated iid from . We will usually write Eq. 17 more compactly as
where the normalized weights are defined by
with a shorthand for . The estimate in Eq. 18 is strongly consistent, converging (almost surely) to the true expectation as the number of samples tend to infinity. An alternate view of IS is to consider it an (empirical) approximation111This should be interpreted as an approximation of the underlying probability distribution, and not of the density function itself. of ,
where denotes the Dirac measure at . Furthermore, IS provides an approximation of the normalization constant,
Because the weights depend on the random samples, , they are themselves random variables. One of the key properties is that it is unbiased, This can be easily seen by noting that are iid draws from and therefore
since is nothing by the normalization constant for . This property will be important for several of the more powerful IS and SMC-based methods considered in this monograph.
We summarize the IS method in Algorithm 1. This algorithm is sometimes referred to as self-normalized IS, because we are normalizing each individual weight using all samples.
The straightforward implementation of the IS method we have described thus far is impractical for many of the example models and targets in Section 1.2
. It is challenging to design good proposals for high-dimensional models. A good proposal is typically more heavy-tailed than the target; if it is not, the weights can have infinite variance. Another favorable property of a proposal is that it should cover the bulk of the target probability mass, putting high probability on regions of high probability under the target distribution. Even Markovian models, such as the SSM, can have a prohibitive computational complexity without careful design of the proposal. In the next section we will describe how we can alleviate these concerns using SIS, a special case of IS, with a kind of divide-and-conquer approach to tackle the high-dimensionality in.
2.1.1 Sequential Importance Sampling
SIS is a variant of IS were we select a proposal distribution that has an autoregressive structure, and compute importance weights recursively. By choosing a proposal defined by
we can decompose the proposal design problem into conditional distributions. This means we obtain samples by reusing from the previous iteration, and append a new sample, , simulated from . The unnormalized weights can be computed recursively by noting that
We summarize the SIS method in Algorithm 2, where and .
If we need to evaluate the normalization constant estimate , analogously to IS we make use of Eq. 20. However, we may also obtain a (strongly) consistent estimate of the ratio of normalization constants
While the estimate of the ratio is consistent, it is in general not unbiased. However, SIS is a special case of IS. This means that the SIS estimate of the normalization constant for , i.e. in Eq. 20, is still unbiased and consistent.
In Example 3 we detail a first example proposal for the running example, and derive the corresponding weights . Furthermore, we include a code snippet that illustrates how to implement the sampler in Python.
Example 3 (SIS for Example 1).
We revisit our running non-Markovian Gaussian example. The target distribution is
A common approach is to set the proposal to be the prior (or transition) distribution . A sample from the proposal is generated as follows
We refer to this proposal simply as the prior proposal. The corresponding weight update is
where . We provide LABEL:code:sis:running to illustrate how to implement SIS with the prior proposal for this model in Python.
For improved numerical stability we update the log-weights and subtract the logarithm of the sum of weights (the weight normalization) before exponentiating.
SIS can be implemented efficiently for a large class of problems, the computational complexity is usually linear in and . Even so, the IS methods suffer from severe drawbacks limiting their practical use for many high-dimensional problems.
2.1.2 Shortcomings of Importance Sampling
The main drawback of IS is that the variance of the estimator scales unfavorably with the dimension of the problem; the variance generally increases exponentially in . Because SIS is a special case of IS it inherits this unfavorable property.
One way in which this problem manifests itself in practice is through the normalized weights . The maximum of the weights, , will quickly approach one as increases, and consequently all other weights approach zero; a phenomena known as weight degeneracy. This means that, effectively, we approximate the target distribution using a single sample.
We illustrate weight degeneracy in Example 4 using the running example.
Example 4 (SIS weight degeracy).
We return to our running example, set the length , number of particles , and . Fig. 2 shows the particles and the normalized weights , where the area of the discs correspond to the size of the weights. We can see that as increases nearly all mass concentrates on the fourth particle . This means that the normalized weight of the particle is almost one, . The remaining particles have normalized weights that are all close to zero, and thus have a negligible contribution to the approximation.
This concentration of mass for SIS to a single particle happens very quickly. Even for very simple Markovian models the variance of e.g. our normalization constant estimator can increase exponentially fast as a function of .
SMC methods tackle the weight degeneracy issue by choosing a proposal that leverages information contained in , the previous iteration’s target distribution approximation.
2.2 Sequential Monte Carlo
SMC methods improve upon IS by mitigating the weight degeneracy issue through a clever choice of proposal distribution. For certain sequence models the weight degeneracy issue can be resolved altogether, providing estimators to the final marginal distribution that do not deteriorate for increasing . For other sequence models, SMC still tends to provide more accurate estimates in practice compared to IS.
Just like in SIS we need a sequence of proposal distributions for . This is a user choice that can significantly impact the accuracy of the SMC approximation. For now we assume that the proposal is given and return to this choice in Section 3. Below, we detail the iterations (or steps) of a basic SMC algorithm.
The first iteration of SMC boils down to approximating the target distribution using standard IS. Simulating times independently from the first proposal
and assigning corresponding weights
lets us approximate (cf. Eq. 19) by
The key innovation of the SMC algorithm is that it takes advantage of the information provided in , Eq. 26, when constructing a proposal for the next target distribution .
In the second iteration of SMC we sample from the proposal , rather than from like SIS. We sample times independently from
and assign weights
Simulating , Eq. 27, can be broken down into parts: resampling , propagation , and concatenation . Note the overloaded notation for . We replace the initial sample set from Step 1, with the resampled set , .
Resampling can refer to a variety of methods in statistics, for our purpose it is simple (weighted) random sampling with replacement from with weights . Resampling times independently means that the number of times each particle is selected is multinomially distributed. This resampling algorithm is known as multinomial resampling, see LABEL:code:multinomial_resampling. In Sections 2.2.2 and 2.2.1 we revisit resampling and present alternative resampling algorithms, increasing efficiency by correlation and adaptation.
Propagation generates new samples independently from the proposal, i.e. for each . By concatenating we obtain a complete sample from the proposal .
Finally, new importance weights are computed according to (28). This expression can be understood as a standard importance sampling weight—the target divided by the proposal. Note, however, that we use in the denominator: when computing the weights we “pretend” that the proposal is given by rather than by its approximation . This is of course just an interpretation and the motivation for using this particular weight expression is given by the convergence analysis for SMC; see Section 2.3.
The resulting approximation of is
SMC can in essence be described as a synthesis of SIS and resampling, which explains its alternate names SIR or SISR.
The remaining iterations follow the recipe outlined in step . First, the proposal is the product of the previous empirical distribution approximation and a conditional distribution
Samples for are generated as follows
Finally, we assign the weights
and approximate by
The normalization constant can be estimated by
We summarize the full SMC sampler in Algorithm 3, where , and .