Scalable Monte Carlo inference for state-space models

We present an original simulation-based method to estimate likelihood ratios efficiently for general state-space models. Our method relies on a novel use of the conditional Sequential Monte Carlo (cSMC) algorithm introduced in Andrieu_et_al_2010 and presents several practical advantages over standard approaches. The ratio is estimated using a unique source of randomness instead of estimating separately the two likelihood terms involved. Beyond the benefits in terms of variance reduction one may expect in general from this type of approach, an important point here is that the variance of this estimator decreases as the distance between the likelihood parameters decreases. We show how this can be exploited in the context of Monte Carlo Markov chain (MCMC) algorithms, leading to the development of a new class of exact-approximate MCMC methods to perform Bayesian static parameter inference in state-space models. We show through simulations that, in contrast to the Particle Marginal Metropolis-Hastings (PMMH) algorithm of Andrieu_et_al_2010, the computational effort required by this novel MCMC scheme scales very favourably for large data sets.

Authors

• 9 publications
• 11 publications
• 67 publications
• Replica Conditional Sequential Monte Carlo

We propose a Markov chain Monte Carlo (MCMC) scheme to perform state inf...
05/13/2019 ∙ by Alexander Y. Shestopaloff, et al. ∙ 0

• Ensemble MCMC: Accelerating Pseudo-Marginal MCMC for State Space Models using the Ensemble Kalman Filter

Particle Markov chain Monte Carlo (pMCMC) is now a popular method for pe...
06/05/2019 ∙ by Christopher Drovandi, et al. ∙ 0

• Efficient State-space Exploration in Massively Parallel Simulation Based Inference

Simulation-based Inference (SBI) is a widely used set of algorithms to l...
06/29/2021 ∙ by Sourabh Kulkarni, et al. ∙ 0

• Sequential Stratified Regeneration: MCMC for Large State Spaces with an Application to Subgraph Counting Estimation

This work considers the general task of estimating the sum of a bounded ...
12/07/2020 ∙ by Carlos H. C. Teixeira, et al. ∙ 0

• Global consensus Monte Carlo

For Bayesian inference with large data sets, it is often convenient or n...
07/24/2018 ∙ by Lewis J. Rendell, et al. ∙ 0

• Learning of state-space models with highly informative observations: a tempered Sequential Monte Carlo solution

Probabilistic (or Bayesian) modeling and learning offers interesting pos...
02/06/2017 ∙ by Andreas Svensson, et al. ∙ 0

• State Space LSTM Models with Particle MCMC Inference

Long Short-Term Memory (LSTM) is one of the most powerful sequence model...
11/30/2017 ∙ by Xun Zheng, et al. ∙ 0

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

State-space models (SMMs) form an important class of statistical model used in many fields; see Douc et al. (2014) for a recent overview. In its simplest form a SSM is comprised of an -valued latent Markov chain and a -valued observed process

. The latent process has initial probability with density

and transition density ; both probability densities defined on and with respect to a common dominating measure on denoted generically as and parametrized by some . Naturally, non-dynamical models for which form a particular case. The observation at time

is assumed conditionally independent of all other random variables given

and its conditional observation density is on with respect to the dominating measure on . For a given value we will refer to this model as , and the corresponding joint density of the latent and observed variables up to time is

 pθ(x1:T,y1:T)=μθ(x1)T∏t=2fθ(xt−1,xt)T∏t=1gθ(xt,yt). (1)

from which the likelihood function associated to the observations can be obtained

 lθ(y1:T):=∫XTpθ(x1:T,y1:T)dx1:T. (2)

Such models are typically intractable, therefore requiring the use of numerical methods to carry out inference about . Significant progress was made in the 1990s and early 2000’s to solve numerically the so-called filtering/smoothing problem, that is, assuming known, efficient methods were proposed to approximate the posterior density or some of its marginals. Indeed particle filters, or more generally Sequential Monte Carlo methods (SMC), have been shown to provide a set of versatile and efficient tools to approximate the aforementioned posteriors by exploiting the sequential structure of , and their theoretical properties are now well understood (Del Moral, 2004).

Estimating , the static parameter, is however known to be much more challenging. Indeed, likelihood based methods (e.g. maximum likelihood or Bayesian estimation) usually require evaluation of or its derivatives in order to be implemented practically; see Kantas et al. (2015) for a recent review. As we shall see, of particular interest is the estimation of the likelihood ratio, that is for ,

 L(θ,θ′):=lθ′(y1:T)lθ(y1:T).

In a classical set-up

plays a central role in testing, for example, but is also a direct route to the numerical evaluation of the gradient of the log-likelihood function or the implementation of Markov chain Monte Carlo (MCMC) algorithms used to perform Bayesian inference.

The first contribution of the present paper is the realization that the conditional SMC (cSMC) kernel introduced in Andrieu et al. (2010), an MCMC kernel to sample from , can be combined with Annealing Importance Sampling (AIS) (Crooks, 1998; Neal, 2001) in order to develop efficient estimators of . Central to the good behaviour of this class of estimators is the fact that rather than estimating numerator and denominator independently, as suggested by current methods, this is here performed jointly using a unique source of randomness. Alternative approaches exploiting this principle have been explored briefly in Lee and Holmes (2010) and studied more thoroughly in Deligiannidis et al. (2015) in the context of MCMC simulations. Our estimator differs substantially from these earlier proposals. The second contribution here is to provide theory for this novel likelihood ratio estimator and show how this estimator can be exploited in numerical procedures in order to design algorithms which scale well with the number of data points. In particular we present a new exact approximate MCMC scheme for perform Bayesian static parameter inference in SSMs and we demonstrate its performance through simulations.

2 Likelihood ratio estimation in SSM with cSMC

An efficient technique to estimate for consists of using SMC methods. The algorithm is presented in Algorithm 1

; it requires a user defined instrumental probability distribution

and a Markov kernel , referred to as can be made time dependent, but we aim to keep notation simple here. We also use the notation to refer to the probability distribution of a discrete valued random variable taking values in such that . An estimator of the likelihood can be obtained by

 ^lθ(y1:T):=T∏t=11NN∑i=1w(i)t. (3)

This estimator has attractive properties. It is unbiased (Del Moral, 2004) and has a relative variance which scales linearly in under practically relevant conditions (Cérou et al., 2011). One can therefore use two such independent SMC estimators for and and compute their ratio to estimate . However better estimators are possible if one introduces positive dependence between the two estimators, this is exploited in Deligiannidis et al. (2015) and Lee and Holmes (2010). Our approach relies on the same idea but the estimator we propose is very different from these earlier proposals and complementary, as discussed later in the paper.

We rely here on the AIS method of Crooks (1998); Neal (2001) which is a state-of-the-art Monte Carlo approach to estimate ratio of normalizing constants. For the method requires one to first choose a family of probability distributions defined on , whose aim is to “bridge” and , and a family of transition probabilities and a mapping . The role of these quantities is clarified below. For we say that , and associated with and satisfy (A2) if

• Conditions on , and ,

1. is a family of probability distributions on satisfying

1. the end point conditions and as defined by and ,

2. for any and such that , implies ,

2. is such that for any , leaves invariant,

3. a non-decreasing mapping such that and .

In order to implement the AIS procedure, one chooses and considers the sub-family of probability distributions such that for any and the corresponding family of transition kernels . The integer therefore represents the number of intermediate distributions introduced to bridge and , which is allowed to be zero. For notational simplicity we will drop the dependence on of the elements of and when no ambiguity is possible. Let and consider the non-homogeneous Markov chain such that and for . It is routine to show that under these assumptions the quantity

has expectation . The interest of this identity is that whenever where is an unknown normalising constant but can be evaluated pointwise then

 K∏k=0γθ,θ′,k+1(Uk)γθ,θ′,k(Uk) (4)

is an unbiased estimator of

. Consequently, for and , this provides us with a way of estimating the desired likelihood ratio . The algorithm is summarized in Algorithm 2, which should be initialised with to lead to an unbiased estimator of .

In general sampling exactly from is not possible. Instead one can run an MCMC with transition kernel , hence targetting , for iterations. Provided is ergodic one can feed into and control bias through . There are several ways one can reduce variability of this estimator. Under natural smoothness assumptions on and the mapping , and ergodicity of one can show that this estimator is consistent as . More simply it is also possible, for fixed, to consider independent copies of the estimator and consider their average–the latter strategy has the advantage that it lends itself trivially to parallel computing architectures, in contrast to the former.

There is an additional natural and “free” control of both bias and variance when computation of is required only for and “close”. Indeed in such scenarios, provided the models considered are smooth enough in , one expects the estimation of to be easier since the densities and will be close to one another. For illustration and concreteness we briefly describe this fact in the context of a stochastic gradient algorithm to maximize –the main focus of the paper is on sampling, but this requires additional technicalities. Assume is intractable and that we wish to use a finite difference method to approximate this quantity. The simultaneous perturbation (SPSA) approach of Spall (1992) is such a method, which naturally lends itself to the use of our class of estimators . Let be a, possibly random,

dimensional vector such that

, then a possible estimator of could be the vector whose th component is

 12[δ]ilog(lθ+δ(y1:T)lθ−δ(y1:T))≈12[δ]i(lθ+δ(y1:T)lθ−δ(y1:T)−1),

which depends on the likelihood ratio . A natural idea is to plug-in the AIS estimator developed earlier and note that such a strategy is likely to be better than a strategy which would consists of estimating numerator and denominator independently.

We now discuss natural choices of and for this AIS procedure in the context of state-space models. These choices are crucial to the good performance of the algorithm.

Choice of Pθ,θ′

A standard choice consists of using geometric annealing, that is define for ,

 γθ,θ′,ς(x1:T):=γθ(x1:T)1−ςγθ′(x1:T)ς,

and, for example, set for . This can be written in a form similar to that arising from a state-space model

 γθ,θ′,ς(x1:T)=~μθ,θ′,ς(x1)T∏t=2~fθ,θ′,ς(xt−1,xt)T∏t=1~gθ,θ′,ς(xt,yt),

where for and , , and . This could at first sight be a good choice since the sequential structure of the model crucial to the implementation of efficient sampling techniques is preserved. However, except for very specific cases such as when the densities involved belong to the exponential family, the normalising constant of may be intractable, while being dependent on and . While this is not an issue for the computation of 4, this may lead to complications when implementing sampling techniques relying on SMC (see Algorithm 1 and Remark 1). A way around this problem consists of defining such that and , and

 γθ,θ′,ς(x1:T)=γϑ(ς)(x1:T),

which trivially admits the desired sequential structure and defines a tractable model. For example when is convex the choice will always work.

Choice of Rθ,θ′

The conditional SMC (cSMC) algorithm belongs to the class of particle MCMC algorithms introduced in Andrieu et al. (2009, 2010). It is an SMC based algorithm (see Algorithm 1) particularly well suited to sampling from distributions arising from models with a sequential structure, similar to that of for any . More precisely, for the cSMC targetting yields a Markov transition kernel of invariant distribution , therefore lending itself to being used as an MCMC method. The cSMC update has been shown both empirically and theoretically to possess good convergence properties–see Andrieu et al. (2013); Chopin and Singh (2015); Lindsten et al. (2015) for recent studies of its theoretical properties. In its original form the algorithm, corresponding to in Algorithm 3, may suffer from the so-called path degeneracy, meaning that because of the successive resampling steps involved the particle paths at time have few distinct values for , resulting in poor mixing of the corresponding MCMC. The cSMC with backward resampling as suggested by Whiteley (2010) overcomes this problem by enabling reselection of ancestors; a closely related approach is the ancestor resampling technique of Lindsten et al. (2014). This is described in the second part of Algorithm 3, and corresponds to .

Reversibility of cSMC with or without backward sampling with respect to as well as its theoretical superiority over the original cSMC are proven in Chopin and Singh (2015). As shown in (Chopin and Singh, 2015; Andrieu et al., 2013; Lindsten et al., 2015), convergence to stationarity can be made arbitrarily fast as increases. For conciseness we will refer to in Algorithm 2 for which consists of for all relevant ’s as where is the set of instrumental methods required to implement the cSMCs targetting the distributions in .

Remark 1.

Contrary to the original cSMC, cSMC with backward sampling is limited to scenarios where the transition density is computable pointwise. Even when pointwise evaluation is feasible, the backward sampling approach will be inefficient if is close to singular; e.g. if arises from the fine time discretization of a diffusion process.

Remark 2.

It is clear that there is another way of reducing variability : one can draw several paths in the backward sampling stage and average the corresponding estimators. We do not pursue this here.

3 Application to exact approximate MCMC for SSM

In a Bayesian framework, the static parameter is ascribed a probability distribution with density (with respect to a dominating measure denoted ) from which one defines the posterior distribution of given observations with density

 π(θ,x1:T)∝η(θ)pθ(x1:T,y1:T), (6)

(we drop in for notational simplicity). This posterior distribution and its marginal

are potentially highly complex objects to manipulate in practice and (sampling) Monte Carlo methods are often the only viable methods available to extract information from such models. Assume for a moment that our primary interest is in inferring

, and therefore that sampling from is our concern. Among Monte Carlo methods, MCMC techniques are often the only possible option–we however refer the reader to Crişan and Miguez (2013); Kantas et al. (2015) for purely particle based on-line methods. MCMC rely on the design of ergodic Markov chains with the distribution of interest as invariant distribution, say with invariant distribution for our problem. The Metropolis–Hastings (MH) algorithm plays a central role in the design of MCMC transition probabilities, and proceeds as follows in our context. Given a family of user defined and instrumental probability distributions on ,

We will refer to as the acceptance ratio and call this MH algorithm targeting the marginal MH algorithm. A crucial point for the implementation of the algorithm is the requirement to be able to evaluate the likelihood ratio . This significantly reduces the class of models for which the algorithm above can be used. In particular, one cannot apply this algorithm to non-linear non-Gaussian SSMs as the likelihood (2) is intractable.

3.1 State of the art

A classical way around this type of intractability problem consists of running an MCMC algorithm targeting the joint distribution

when evaluating this density, possibly up to a constant, is feasible. This significantly broadens the class of models under consideration to which MCMC can be applied. There are, however, well documented difficulties with this approach. The standard strategy consists of updating alternately conditional upon and conditional upon . As is a high-dimensional vector, one typically updates it by sub-blocks using MH steps with tailored proposal distributions (Shephard and Pitt, 1997). However, for complex SSMs, it is very difficult to design efficient proposal distributions. An alternative consists of using the cSMC update described in Algorithm 3 which allows one to update the state conditional upon in one block. A strong dependence between and may however still lead to underperforming algorithms. We will come back to this point later in the paper.

A powerful alternative method to tackle intractability which has recently attracted some interest consists of replacing the value of with a non-negative random estimator whenever it is required in (7) for the implementation of the marginal MH algorithm. If for all and a constant it turns out to lead to exact algorithms, that is sampling from is guaranteed at equilibrium under very mild assumptions on . This approach leads to so called pseudo-marginal algorithms (Andrieu and Roberts, 2009). As SMC provides a nonnegative unbiased estimate (3) of for SSMs (Del Moral, 2004), a pseudo-marginal approximation of the marginal MH algorithm for state-space models is possible in this context. The resulting algorithm, the particle marginal MH (PMMH) introduced Andrieu et al. (2009, 2010), is presented in Algorithm 5 .

The PMMH defines a Markov chain which leaves invariant marginally. However, as shown in Andrieu et al. (2009, 2010), it is easy to recover samples from by adding an additional step to Algorithm 5.

Although the PMMH has been recognised as significantly extending the applicability of MCMC to a broader class of state-space models Flury and Shephard (2011), it comes with some drawbacks. In particular the performance of the resulting MCMC algorithm depends heavily on the variability of the induced acceptance ratio (Andrieu and Roberts, 2009; Andrieu and Vihola, 2015, 2014; Doucet et al., 2015; Pitt et al., 2012; Sherlock et al., 2015), and overestimates of lead to an algorithm rejecting many transitions away from , resulting in poor performance. This means for example that should scale linearly with in order to maintain a set level of performance as increases. In the following, we present another new class of exact approximate MCMC algorithms targetting , which still update jointly but can be interpreted as using unbiased estimates of the acceptance ratio computed afresh at each iteration of the MCMC algorithm. This lack of memory is to be contrasted with the potentially calamitous reliance of the PMMH’s acceptance ratio on the estimate of the likelihood obtained the last time an acceptance occurred (refreshing this quantity using SMC would lead to an invalid algorithm, see Beaumont (2003); Andrieu and Roberts (2009)). In addition, as we shall see, algorithms such as the marginal MH in Algorithm 4 requires a proposal such that the distance between and is of order in order to account for the concentration of the posterior distribution. This turns out to provide us with an additional built-in beneficial mechanism to reduce variability of our estimator of the acceptance ratio, independent of .

3.2 AIS within Metropolis-Hastings

In order to define a valid MH update which uses the estimators of described in Section 2, additional conditions to those of (A2) are required–fortunately these conditions are satisfied by the cSMC update, with or without backward sampling (Chopin and Singh, 2015).

• For any , and satisfying (A2), and such that

1. the distributions in satisfy for any

2. the transition kernels in satisfy, for any ,

1. ,

2. is reversible.

Following the setup above, the pseudocode of MCMC AIS is given in Algorithm 6.

It can be shown that this algorithm is reversible with respect to for any ; see Neal (2004) and Karagiannis and Andrieu (2013) for details. An important point here is that although the approximated acceptance ratio is reminiscent of that of a MH algorithm targeting , the present algorithm targets the joint density : the simplification occurs only because the random variable corresponding to will be approximately distributed according to when is large enough, under proper mixing conditions. When this transition leads to a reducible algorithm since is not updated. However this scheme can be used as part of a Metropolis-within-Gibbs where is updated conditional upon the parameter using, say, . We will refer to the latter algorithm for which is a cSMC with backward sampling as Metropolis-within-Particle-Gibbs (MwPG) in the rest of the paper.

Remark 3.

In the scenario where a cSMC procedure involving particles is used, the algorithm above may seem wasteful as only one particle is used in order to approximate the likelihood ratio in (8). Ideally one would want to use particles and average likelihood ratio estimators in order to reduce variability and improve the properties of the algorithm. Using this averaged estimator of the likelihood ratio in Algorithm 6 would, however, lead to a Markov kernel which does preserve as an invariant density. A novel methodology allowing the use of such averaged estimators within MCMC has been developed in Andrieu et al. (2016).

4 A theoretical analysis

In this section we develop an analysis of the likelihood ratio estimator and of the MCMC AIS algorithm in a scenario which can be treated rigorously in a few pages, but yet is of practical interest–in particular our findings are supported empirically by the simulations of Section 5, where more general scenarios are considered, and shed some light on some of our empirical results. Extension to more general scenarios is however far beyond the scope of the present manuscript. We consider the scenario where for any , is independent of , that is for any

 pθ(x1:T,y1:T)=T∏t=1pθ(xt,yt),

with

 pθ(xt,yt):=fθ(xt)gθ(yt∣xt).

We define the conditional distributions where . We further assume that the marginal MH algorithm underpinning our update is a random walk Metropolis (RWM) algorithm and that . Our aim is to show that as the algorithm does not degenerate, in a sense to be made more precise below, provided the RWM proposal distribution is properly scaled with and sufficiently large, where is the number of particles used in the cSMC. In particular is not required to grow with , as observed in simulations–see Theorem 1 for a precise formulation of our result. This should be contrasted with results from the simulated likelihood literature where the condition is necessary to ensure asymptotic efficiency of the maximum simulated likelihood estimator (Flury and Shephard, 2011; Lee, 1992) . We now introduce some notation useful in order to formulate and prove our result. The intermediate distribution is defined as

 γθ,θ′,1(x1:T):=p(θ+θ′)/2(x1:T,y1:T);

it will be clear from our proof that this is in no way a restriction but has the advantage of keeping our development as simple as possible. To define our RWM we require an increment proposal distribution based on a symmetric increment distribution (independent of ) and such that . It will be convenient in what follows to define a proposed sample in the following way: for any ( will be distributed according to ) we let

 θ′(ϵ,T):=θ+ϵ√Tand~θ(ϵ,T):=θ+θ′(ϵ,T)2.

For simplicity of presentation we assume that . As a result for any and we let

be the marginal acceptance ratio, which is zero whenever . For the acceptance ratio of the MCMC-AIS algorithm can be written as

 ~rT(θ,ϵ;ω,ξ):=rT(θ,ϵ;ω)exp(ΛT(θ,ϵ;ω,ξ))

where for ,

 ΛT(θ,ϵ;ω,ξ) :=logp~θ(ϵ,T)(x1:T∣y1:T)pθ(x1:T∣y1:T)+logpθ′(ϵ,T)(x′1:T∣y1:T)p~θ(ϵ,T)(x′1:T∣y1:T) (9) =T∑t=1⎧⎨⎩logp~θ(ϵ,T)(xt∣yt)pθ(xt∣yt)+logpθ′(ϵ,T)(x′t∣yt)p~θ(ϵ,T)(x′t∣yt)⎫⎬⎭.

In order to limit the amount of notation we will not distinguish between random variables and their realisations using small/capital letters whenever Greek letters are used. For any and we let denote an MCMC kernel targeting the probability distribution of density using a tuning parameter governing its ergodicity properties: we have here in mind a conditional SMC using particles, but this will not be a requirement (one could iterate a given ergodic and reversible kernel times for example). Now for any and we define the process as a sequence of independent random vectors with marginal laws given by –we omit the dependence of on (and ) for notational simplicity, may write for when no ambiguity is possible, but we should bear in mind that we will deal with triangular arrays of random variables in what follows. We let , and