# Particle Gibbs with Ancestor Sampling

Particle Markov chain Monte Carlo (PMCMC) is a systematic way of combining the two main tools used for Monte Carlo statistical inference: sequential Monte Carlo (SMC) and Markov chain Monte Carlo (MCMC). We present a novel PMCMC algorithm that we refer to as particle Gibbs with ancestor sampling (PGAS). PGAS provides the data analyst with an off-the-shelf class of Markov kernels that can be used to simulate the typically high-dimensional and highly autocorrelated state trajectory in a state-space model. The ancestor sampling procedure enables fast mixing of the PGAS kernel even when using seemingly few particles in the underlying SMC sampler. This is important as it can significantly reduce the computational burden that is typically associated with using SMC. PGAS is conceptually similar to the existing PG with backward simulation (PGBS) procedure. Instead of using separate forward and backward sweeps as in PGBS, however, we achieve the same effect in a single forward sweep. This makes PGAS well suited for addressing inference problems not only in state-space models, but also in models with more complex dependencies, such as non-Markovian, Bayesian nonparametric, and general probabilistic graphical models.

## Authors

• 29 publications
• 168 publications
• 60 publications
• ### Parameter elimination in particle Gibbs sampling

Bayesian inference in state-space models is challenging due to high-dime...
10/30/2019 ∙ by Anna Wigren, et al. ∙ 6

• ### Particle rolling MCMC with double block sampling: conditional SMC update approach

An efficient simulation-based methodology is proposed for the rolling wi...
09/26/2017 ∙ by Naoki Awaya, et al. ∙ 0

• ### Ancestor Sampling for Particle Gibbs

We present a novel method in the family of particle MCMC methods that we...
10/25/2012 ∙ by Fredrik Lindsten, et al. ∙ 0

• ### Sequential Monte Carlo for Graphical Models

We propose a new framework for how to use sequential Monte Carlo (SMC) a...
02/03/2014 ∙ by Christian A. Naesseth, et al. ∙ 0

• ### Bayesian inference in decomposable graphical models using sequential Monte Carlo methods

InthisstudywepresentasequentialsamplingmethodologyforBayesian inference ...
05/31/2018 ∙ by Jimmy Olsson, et al. ∙ 0

• ### Monte Carlo Methods for Tempo Tracking and Rhythm Quantization

We present a probabilistic generative model for timing deviations in exp...
06/24/2011 ∙ by A. T. Cemgil, et al. ∙ 0

• ### Discrete Sampling using Semigradient-based Product Mixtures

We consider the problem of inference in discrete probabilistic models, t...
07/04/2018 ∙ by Alkis Gotovos, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Monte Carlo methods are one of the standard tools for inference in statistical models as they, among other things, provide a systematic approach to the problem of computing Bayesian posterior probabilities. Sequential Monte Carlo ()

[1, 2] and Markov chain Monte Carlo () [3, 4] methods in particular have found application to a wide range of data analysis problems involving complex, high-dimensional models. These include state-space models (s) which are used in the context of time series and dynamical systems modeling in a wide range of scientific fields. The strong assumptions of linearity and Gaussianity that were originally invoked for s have indeed been weakened by decades of research on and .

These methods have not, however, led to a substantial weakening of a further strong assumption, that of Markovianity. It remains a major challenge to develop efficient inference algorithms for models containing a latent stochastic process which, in contrast with the state process in an , is non-Markovian. Such non-Markovian latent variable models arise in various settings, either from direct modeling or via a transformation or marginalization of an . We discuss this further in Section 6; see also [5, Section 4].

In this paper we present a new tool in the family of Monte Carlo methods which is particularly useful for inference in s and, importantly, in non-Markovian latent variable models. However, the proposed method is by no means limited to these model classes. We work within the framework of particle () [6] which is a systematic way of combining and , exploiting the strengths of both techniques. More specifically, samplers make use of to construct efficient, high-dimensional kernels with certain invariance properties. These kernels can then be used as off-the-shelf components in algorithms and other inference strategies relying on Markov kernels, such as Markovian stochastic approximation methods. has, in a relatively short period of time, found many applications in areas such as hydrology [7], finance [8], systems biology [9], and epidemiology [10], to mention a few.

Our method builds on the particle Gibbs () sampler proposed by [6]. In , the aforementioned Markov kernel is constructed by running an sampler in which one particle trajectory is set deterministically to a reference trajectory that is specified a priori. After a complete run of the algorithm, a new trajectory is obtained by selecting one of the particle trajectories with probabilities given by their importance weights. The effect of the reference trajectory is that the resulting Markov kernel leaves its target distribution invariant, regardless of the number of particles used in the underlying algorithm.

However, suffers from a serious drawback, which is that the mixing of the Markov kernel can be very poor when there is path degeneracy in the underlying sampler [5, 11]. Unfortunately, path degeneracy is inevitable for high-dimensional problems, which significantly reduces the applicability of . This problem has been addressed in the generic setting of s by adding a backward simulation step to the sampler, yielding a method denoted as with backward simulation (PGBS) [12, 13]. It has been found that this considerably improves mixing, making the method much more robust to a small number of particles as well as growth in the size of the data [5, 11, 12, 13].

Unfortunately, however, the application of backward simulation is problematic for models with more intricate dependencies than in s, such as non-Markovian latent variable models. The reason is that we need to consider complete trajectories of the latent process during the backward simulation pass (see Section 6 for details). The method proposed in this paper, which we refer to as particle Gibbs with ancestor sampling (PGAS), is geared toward this issue. PGAS alleviates the problem with path degeneracy by modifying the original kernel with a so called ancestor sampling step, thereby achieving the same effect as backward sampling, but without an explicit backward pass.

The PGAS Markov kernel is constructed in Section 3, extending the preliminary work that we have previously published in [14]. It is also illustrated how ancestor sampling can be used to mitigate the problems with path degeneracy which deteriorates the performance of . In Section 4 we establish the theoretical validity of the PGAS approach, including a novel uniform ergodicity result. We then show specifically how PGAS can be used for inference and learning of s and of non-Markovian latent variable models in Sections 5 and 6, respectively. The PGAS algorithm is then illustrated on several numerical examples in Section 7. As part of our development, we also propose a truncation strategy specifically for non-Markovian models. This is a generic method that is also applicable to PGBS, but, as we show in the simulation study in Section 7, the effect of the truncation error is much less severe for PGAS than for PGBS. Indeed, we obtain up to an order of magnitude increase in accuracy in using PGAS when compared to PGBS in this study. Finally, in Section 8 we conclude and point out possible directions for future work.

## 2 Sequential Monte Carlo

Let , for , be a sequence of unnormalized densities111With respect to some dominating measure which we denote simply as . on the measurable space , parameterized by . Let be the corresponding normalized probability densities:

 ¯γθ,t(x1:t)=γθ,t(x1:t)Zθ,t, (1)

where and where it is assumed that . For instance, in the (important) special case of an we have , , and . We discuss this special case in more detail in Section 5.

To draw inference about the latent variables , as well as to enable learning of the model parameter , a useful approach is to construct a Monte Carlo algorithm to draw samples from . The sequential nature of the problem suggests the use of methods; in particular, particle filters (PFs) [1, 2, 15].

We start by reviewing a standard sampler, which will be used to construct the PGAS algorithm in the consecutive section. We will refer to the index variable as time, but in general it might not have any temporal meaning. Let be a weighted particle system targeting . That is, the weighted particles define an empirical point-mass approximation of the target distribution given by

 ˆγ\Npθ,t−1(dx1:t−1)=\Np∑i=1wit−1∑lwlt−1δxi1:t−1(dx1:t−1). (2)

This particle system is propagated to time by sampling independently from a proposal kernel,

 Mθ,t(at,xt)=watt−1∑lwlt−1rθ,t(xt∣xat1:t−1). (3)

Note that depends on the complete particle system up to time , , but for notational convenience we shall not make that dependence explicit. Here, is the index of the ancestor particle of . In this formulation, the resampling step is implicit and corresponds to sampling these ancestor indices. When we write we refer to the ancestral path of particle . That is, the particle trajectory is defined recursively as

 xi1:t=(xait1:t−1,xit). (4)

Once we have generated ancestor indices and particles from the proposal kernel (3), the particles are weighted according to where the weight function is given by

 Wθ,t(x1:t)=γθ,t(x1:t)γθ,t−1(x1:t−1)rθ,t(xt∣x1:t−1), (5)

for . The procedure is initialized by sampling from a proposal density and assigning importance weights with . The sampler is summarized in Algorithm 1.

It is interesting to note that the joint law of all the random variables generated by Algorithm

1 can be written down explicitly. Let

 xt =\crangex1tx\Npt and at =\crangea1ta\Npt,

refer to all the particles and ancestor indices, respectively, generated at time  of the algorithm. It follows that the sampler generates a collection of random variables . Furthermore, are drawn independently (conditionally on the particle system generated up to time ) from the proposal kernel , and similarly at time

. Hence, the joint probability density function (with respect to a natural product of

and counting measure) of these variables is given by

 ψθ(x1:\T,a2:\T)≜\Np∏i=1rθ,1(xi1)\T∏t=2\Np∏i=1Mθ,t(ait,xit). (6)

## 3 The PGAS kernel

We now turn to the construction of PGAS, a family of Markov kernels on the space of trajectories . We will provide an algorithm for generating samples from these Markov kernels, which are thus defined implicitly by the algorithm.

### 3.1 Particle Gibbs

Before stating the PGAS algorithm, we review the main ideas of the algorithm of [6] and we then turn to our proposed modification of this algorithm via the introduction of an ancestor sampling step.

is based on an sampler, akin to a standard PF, but with the difference that one particle trajectory is specified a priori. This path, denoted as , serves as a reference trajectory. Informally, it can be thought of as guiding the simulated particles to a relevant region of the state space. After a complete pass of the algorithm, a trajectory is sampled from among the particle trajectories. That is, we draw with . This procedure thus maps

to a probability distribution on

, implicitly defining a Markov kernel on .

In a standard PF, the samples are drawn independently from the proposal kernel (3) for . When sampling from the kernel, however, we condition on the event that the reference trajectory is retained throughout the sampling procedure. To accomplish this, we sample according to (3) only for . The th particle and its ancestor index are then set deterministically as and . This implies that after a complete pass of the algorithm, the th particle path coincides with the reference trajectory, , .

The fact that is used as a reference trajectory in the sampler implies an invariance property of the kernel which is of key relevance. More precisely, as show by [6, Theorem 5], for any number of particles and for any , the PG kernel leaves the exact target distribution invariant. We return to this invariance property below, when it is shown to hold also for the proposed PGAS kernel.

### 3.2 Ancestor sampling

As noted above, the algorithm keeps the reference trajectory intact throughout the sampling procedure. While this results in a Markov kernel which leaves invariant, it has been recognized that the mixing properties of this kernel can be very poor due to path degeneracy [5, 11].

To address this fundamental problem we now turn to our new procedure, PGAS. The idea is to sample a new value for the index variable in an ancestor sampling step. While this is a small modification of the algorithm, the improvement in mixing can be quite considerable; see Section 3.3 and the numerical evaluation in Section 7. The ancestor sampling step is implemented as follows.

At time , we consider the part of the reference trajectory ranging from the current time to the final time point . The task is to artificially assign a history to this partial path. This is done by connecting to one of the particles . Recall that the ancestry of a particle is encoded via the corresponding ancestor index. Hence, we can connect the partial reference path to one of the particles by assigning a value to the variable . To do this, first we compute the weights

 ˜wit−1∣\T≜wit−1γθ,\T((xi1:t−1,x′t:\T))γθ,t−1(xi1:t−1) (7)

for . Here, refers to the point in formed by concatenating the two partial trajectories. Then, we sample with

. The expression above can be understood as an application of Bayes’ theorem, where the importance weight

is the prior probability of the particle

and the ratio between the target densities in (7) can be seen as the likelihood that originated from . A formal argument for why (7) provides the correct ancestor sampling distribution, in order to retain the invariance properties of the kernel, is detailed in the proof of Theorem 1 in Section 4.

The sampling procedure outlined above is summarized in Algorithm 2 and the family of PGAS kernels is formally defined below. Note that the only difference between and PGAS is on line 8 of Algorithm 2 (where, for , we would simply set ). However, as we shall see, the effect of this small modification on the mixing of the kernel is quite significant.

###### Definition 1 (PGAS kernels).

For any and any , Algorithm 2 maps stochastically into , thus implicitly defining a Markov kernel on . The family of Markov kernels , indexed by , is referred to as the PGAS family of kernels.

### 3.3 The effect of path degeneracy on and on PGAS

We have argued that ancestor sampling can considerably improve the mixing of . To illustrate this effect and to provide an explanation of its cause, we consider a simple numerical example. Further empirical evaluation of PGAS is provided in Section 7. Consider the stochastic volatility model,

 xt+1 =axt+vt, vt ∼\N(0,σ2), (8a) yt =etexp(12xt), et ∼\N(0,1), (8b)

where the state process is latent and observations are made only via the measurement process

. Similar models have been used to generalize the Black-Scholes option pricing equation to allow for the variance to change over time

[16, 17].

For simplicity, the parameter is assumed to be known. A batch of observations are simulated from the system. Given these, we seek the joint smoothing density . To generate samples from this density we employ both and PGAS with varying number of particles ranging from to . We simulate sample paths of length for each algorithm. To compare the mixing, we look at the update rate of versus , which is defined as the proportion of iterations where changes value. The results are reported in Figure 1, which reveals that ancestor sampling significantly increases the probability of updating for far from .

The poor update rates for is a manifestation of the well known path degeneracy problem of samplers (see, , [1]). Consider the process of sampling from the PG kernel for a fixed reference trajectory . A particle system generated by the PG algorithm (corresponding to Algorithm 2, but with line 8 replaced with ) is shown in Figure 2 (left). For clarity of illustration, we have used a small number of particles and time steps, and , respectively. By construction the reference trajectory (shown by a thick blue line) is retained throughout the sampling procedure. As a consequence, the particle system degenerates toward this trajectory which implies that (shown as a red line) to a large extent will be identical to .

What is, perhaps, more surprising is that PGAS is so much more insensitive to the degeneracy issue. To understand why this is the case, we analyze the procedure for sampling from the PGAS kernel for the same reference trajectory as above. The particle system generated by Algorithm 2 (with ancestor sampling) is shown in Figure 2 (right). The thick blue lines are again used to illustrate the reference particles, but now with updated ancestor indices. That is, the blue line segments are drawn between and for . It can be seen that the effect of ancestor sampling is that, informally, the reference trajectory is broken into pieces. It is worth pointing out that the particle system still collapses; ancestor sampling does not prevent path degeneracy. However, it causes the particle system to degenerate toward something different than the reference trajectory. As a consequence, (shown as a red line in the figure) will with high probability be substantially different from , enabling high update rates and thereby much faster mixing.

## 4 Theoretical justification

### 4.1 Stationary distribution

We begin by stating a theorem, whose proof is provided later in this section, which shows that the invariance property of is not violated by the ancestor sampling step.

###### Theorem 1.

For any and , the PGAS kernel leaves invariant:

 ¯γθ,\T(B) =∫P\Npθ(x′1:\T,B)¯γθ,\T(dx′1:\T), ∀B ∈\sigmaX\T.

An apparent difficulty in establishing this result is that it is not possible to write down a simple, closed-form expression for . In fact, the PGAS kernel is given by

 P\Npθ(x′1:\T,B)=\Eθ,x′1:\T[1B(xk1:\T)], (9)

where is the indicator function for the set and where denotes expectation with respect to all the random variables generated by Algorithm 2, , all the particles and ancestor indices , as well as the index . Computing this expectation is not possible in general. Instead of working directly with (9), however, we can adopt a strategy employed by [6], treating all the random variables generated by Algorithm 2, , as auxiliary variables, thus avoiding an intractable integration. In the following, it is convenient to view as a random variable with distribution .

Recall that the particle trajectory is the ancestral path of the particle . That is, we can write

 xk1:\T=xb1:\T1:\T≜\prangexb11xb\T\T, (10)

where the indices are given recursively by the ancestor indices: and . Let be the space of all random variables generated by Algorithm 2. Following [6], we then define a function as follows:

 ϕθ(x1:\T,a2:\T,k) ≜\Np∏i=1i≠b1¯γθ,\T(xb1:\T1:\T)\Np\Tmarginal\Np∏i=1i≠b1rθ,1(xi1)\T∏t=2\Np∏i=1i≠btMθ,t(ait,xit)% conditional, (11)

where we have introduced the notation

 x−it ={\rangex1txi−1t,\rangexi+1tx\Npt}, x−b1:\T1:\T =\crangex−b11x−b\T\T

and similarly for the ancestor indices. By construction, is nonnegative and integrates to one, , is a probability density function on . We refer to this density as the extended target density.

The factorization into a marginal and a conditional density is intended to reveal some of the structure inherent in the extended target density. In particular, the marginal density of the variables is defined to be equal to the original target density , up to a factor related to the index variables . This has the important implication that if are distributed according to , then, by construction, the marginal distribution of is .

By constructing an kernel with invariant distribution , we will thus obtain a kernel with invariant distribution (the PGAS kernel) as a byproduct. To prove Theorem 1 we will reinterpret all the steps of the PGAS algorithm as partially collapsed Gibbs steps for . The meaning of partial collapsing will be made precise in the proof of Lemma 2 below, but basically it refers to the process of marginalizing out some of the variables of the model in the individual steps of the Gibbs sampler. This is done in such a way that it does not violate the invariance property of the Gibbs kernel, , each such Gibbs step will leave the extended target distribution invariant. As a consequence, the invariance property of the PGAS kernel follows. First we show that the PGAS algorithm in fact implements the following sequence of partially collapsed Gibbs steps for .

###### Procedure 1 (Instrumental reformulation of PGAS).

Given and :

1. Draw and, for to , draw:

 {x−btt,a−btt}∼ ϕθ(⋅∣x−b′1:t−11:t−1,a2:t−1,x′,b′1:\T1:\T,b′t−1:\T), abtt∼ ϕθ(⋅∣x−b′1:t−11:t−1,a2:t−1,x′,b′1:\T1:\T,b′t:\T),
2. Draw .

###### Lemma 1.

Algorithm 2 is equivalent to the partially collapsed Gibbs sampler of Procedure 1, conditionally on and .

###### Proof.

From (11) we have, by construction,

 ϕθ(x−b1:\T1:\T,a−b2:\T2:\T∣xb1:\T1:\T,b1:\T)=\Np∏i=1i≠b1rθ,1(xi1)\T∏t=2\Np∏i=1i≠btMθ,t(ait,xit). (12)

By marginalizing this expression over we get

 ϕθ(x−b1:t1:t,a−b2:t2:t∣xb1:\T1:\T,b1:\T)=\Np∏i=1i≠b1rθ,1(xi1)t∏s=2\Np∏i=1i≠bsMθ,s(ais,xis), (13)

It follows that

 ϕθ(x−b11∣xb1:\T1:\T,b1:\T) =\Np∏i=1i≠b1rθ,1(xi1), (14a) and, for t=\range2\T, ϕθ (x−btt,a−btt∣x−b1:t−11:t−1,a−b2:t−12:t−1,xb1:\T1:\T,b1:\T) =ϕθ(x−b1:t1:t,a−b2:t2:t∣xb1:\T1:\T,b1:\T)ϕθ(x−b1:t−11:t−1,a−b2:t−12:t−1∣xb1:\T1:\T,b1:\T)=\Np∏i=1i≠btMθ,t(ait,xit). (14b)

Hence, we can sample from (14a) and (14b) by drawing for and for , respectively. Consequently, with the choice for , the initialization at line 1 and the particle propagation at line 5 of Algorithm 2 correspond to sampling from (14a) and (14b), respectively.

Next, we consider the ancestor sampling step. Recall that identifies to . We can thus write

 ϕθ(abtt ∣x1:t−1,a2:t−1,xbt:\Tt:\T,bt:\T)∝ϕθ(x1:t−1,a2:t−1,xbt:\Tt:\T,bt−1:\T) =γθ,\T(xb1:\T1:\T)γθ,t−1(xb1:t−11:t−1)γθ,t−1(xb1:t−11:t−1)Zθ,\T\Np\T\Np∏i=1i≠b1rθ,1(xi1)t−1∏s=2\Np∏i=1i≠bsMθ,s(ais,xis). (15)

To simplify this expression, note first that we can write

 γθ,t−1(x1:t−1)=γθ,1(x1)t−1∏s=2γθ,s(x1:s)γθ,s−1(x1:s−1). (16)

By using the definition of the weight function (5), this expression can be expanded according to

 γθ,t−1(x1:t−1)=Wθ,1(x1)rθ,1(x1)t−1∏s=2Wθ,s(x1:s)rθ,s(xs∣x1:s−1). (17)

Plugging the trajectory into the above expression, we get

 γθ,t−1(xb1:t−11:t−1) =wb11rθ,1(xb11)t−1∏s=2wbssrθ,s(xbss∣xb1:s−11:s−1) =⎛⎝t−1∏s=1\Np∑l=1wls⎞⎠wb11∑lwl1rθ,1(xb11)t−1∏s=2wbss∑lwlsrθ,s(xbss∣xb1:s−11:s−1) =wbt−1t−1∑lwlt−1⎛⎝t−1∏s=1\Np∑l=1wls⎞⎠rθ,1(xb11)t−1∏s=2Mθ,s(abss,xbss). (18)

Expanding the numerator in (15) according to (18) results in

 ϕθ( abtt∣x1:t−1,a2:t−1,xbt:\Tt:\T,bt:\T) ∝γθ,\T(xb1:\T1:\T)γθ,t−1(xb1:t−11:t−1)wbt−1t−1∑lwlt−1(∏t−1s=1∑lwls)Zθ,\T\Np\T\Np∏i=1rθ,1(xi1)t−1∏s=2\Np∏i=1Mθ,s(ais,xis) ∝wbt−1t−1γθ,\T((xb1:t−11:t−1,xbt:\Tt:\T))γθ,t−1(xb1:t−11:t−1). (19)

Consequently, with and , sampling from (19) corresponds to the ancestor sampling step of line 8 of Algorithm 2. Finally, analogously to (19), it follows that , which corresponds to line 12 of Algorithm 2. ∎

Next, we show that Procedure 1 leaves invariant. This is done by concluding that the procedure is a properly collapsed Gibbs sampler; see [18]. Marginalization, or collapsing, is commonly used within Gibbs sampling to improve the mixing and/or to simplify the sampling procedure. However, it is crucial that the collapsing is carried out in the correct order to respect the dependencies between the variables of the model.

###### Lemma 2.

The Gibbs sampler of Procedure 1 is properly collapsed and thus leaves invariant.

###### Proof.

Consider the following sequence of complete Gibbs steps:

1. Draw and, for to , draw:

 {x−btt,at,x––−b′t+1:\Tt+1:\T,a––−b′t+1:\Tt+1:\T}∼ϕθ(⋅∣x−b′1:t−11:t−1,a2:t−1,x′,b′1:\T1:\T,b′t:\T).
2. Draw .

In the above, all the samples are drawn from conditionals under the full joint density . Hence, it is clear that the above procedure will leave invariant. Note that some of the variables above have been marked by an underline. It can be seen that these variables are in fact never conditioned upon in any subsequent step of the procedure. That is, the underlined variables are never used. Therefore, to obtain a valid sampler it is sufficient to sample all the non-underlined variables from their respective marginals. Furthermore, from (14b) it can be seen that are conditionally independent of , , it follows that the complete Gibbs sweep above is equivalent to the partially collapsed Gibbs sweep of Procedure 1. Hence, the Gibbs sampler is properly collapsed and it will therefore leave invariant. ∎

###### Proof (Theorem 1).

Let

 L(dx−b′1:\T1:\T,da2:\T,dk∣x′1:\T,b′1:\T) (20)

denote the law of the random variables generated by Procedure 1, conditionally on and on . Using Lemma 2 and recalling that we have

 ¯γθ,\T(B) =∫1B(xk1:\T)L(dx−b′1: