# Stochastic Gradient MCMC for State Space Models

State space models (SSMs) are a flexible approach to modeling complex time series. However, inference in SSMs is often computationally prohibitive for long time series. Stochastic gradient MCMC (SGMCMC) is a popular method for scalable Bayesian inference for large independent data. Unfortunately when applied to dependent data, such as in SSMs, SGMCMC's stochastic gradient estimates are biased as they break crucial temporal dependencies. To alleviate this, we propose stochastic gradient estimators that control this bias by performing additional computation in a `buffer' to reduce breaking dependencies. Furthermore, we derive error bounds for this bias and show a geometric decay under mild conditions. Using these estimators, we develop novel SGMCMC samplers for discrete, continuous and mixed-type SSMs. Our experiments on real and synthetic data demonstrate the effectiveness of our SGMCMC algorithms compared to batch MCMC, allowing us to scale inference to long time series with millions of time points.

## Authors

• 6 publications
• 12 publications
• 11 publications
• 27 publications
• ### Stochastic Gradient MCMC for Nonlinear State Space Models

State space models (SSMs) provide a flexible framework for modeling comp...
01/29/2019 ∙ by Christopher Aicher, et al. ∙ 12

• ### Stochastic Gradient MCMC Methods for Hidden Markov Models

Stochastic gradient MCMC (SG-MCMC) algorithms have proven useful in scal...
06/14/2017 ∙ by Yi-an Ma, et al. ∙ 0

• ### Scalable approximate inference for state space models with normalising flows

By exploiting mini-batch stochastic gradient optimisation, variational i...
10/02/2019 ∙ by Tom Ryder, et al. ∙ 0

• ### Scalable MCMC for Mixed Membership Stochastic Blockmodels

We propose a stochastic gradient Markov chain Monte Carlo (SG-MCMC) algo...
10/16/2015 ∙ by Wenzhe Li, et al. ∙ 0

• ### Mini-batch Metropolis-Hastings MCMC with Reversible SGLD Proposal

Traditional MCMC algorithms are computationally intensive and do not sca...
08/08/2019 ∙ by Tung-yu Wu, et al. ∙ 1

• ### Factorized Inference in Deep Markov Models for Incomplete Multimodal Time Series

Integrating deep learning with latent state space models has the potenti...
05/30/2019 ∙ by Zhi-Xuan Tan, et al. ∙ 0

• ### Rao-Blackwellized Stochastic Gradients for Discrete Distributions

We wish to compute the gradient of an expectation over a finite or count...
10/10/2018 ∙ by Runjing Liu, et al. ∙ 0

## Code Repositories

### sgmcmc_ssm_code

Python code for SGMCMC for Time Series SSMs

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

State space models (SSMs) are ubiquitous in the analysis of time series in fields as diverse as biology [69], finance and economics [40, 74], and systems and control [26]. As a defining feature, SSMs augment the observed time series with a latent state sequence

to model complex time series dynamics with a latent Markov chain dependence structure. Given a time series, inference of model parameters involves sampling or marginalizing this latent state sequence. Unfortunately, both the runtime and memory required scale with the length of the time series, which is prohibitive for long time series (e.g. high frequency stock prices

[33], genome sequences [25], or neural impulse recordings [16]). In practice, given a long time series, one could ‘segment’ or ‘downsample’ to reduce length; however, this preprocessing can destroy or change important signals and computational considerations should ideally not limit scientific modeling.

To help scale inference in SSMs, we consider stochastic gradient Markov chain Monte Carlo (SGMCMC), a popular method for scaling Bayesian inference to large data sets [12, 47, 68]. The key idea of SGMCMC is to employ stochastic gradient estimates based on subsets or ‘minibatches’ of data, avoiding costly computation of gradients on the full dataset, such that the resulting dynamics produce samples from the posterior distribution over SSM parameters. This approach has found much success in independent data models, where the stochastic gradients are unbiased estimates of the true gradients. However, when applying SGMCMC to SSMs, naive stochastic gradients are biased, as subsampling the data breaks dependencies in the SSM’s latent state sequence. This bias can destroy the dynamics of SGMCMC causing it to fail when applied to SSMs. The challenge is to correct these stochastic gradients for SSMs while maintaining the computational benefits of SGMCMC.

In this work, we develop computationally efficient stochastic gradient estimators for inference in general discrete-time SSMs. To control the bias of stochastic gradients, we marginalize the latent state sequence in a buffer around each subsequence, propagating critical information from outside each subsequence to its local gradient estimate while avoiding costly full-chain computations. Similar buffering ideas have been previously considered for belief propagation [32], variational inference [27]

, and in our earlier work on SGMCMC for hidden Markov models (HMMs)

[48], but all are limited to discrete latent states. Here, we present buffering as an approximation to Fisher’s identity [9], allowing us to naturally extend buffering trick to continuous and mixed-type latent states.

We further develop analytic bounds on the bias of our proposed gradient estimator that, under mild conditions, decay geometrically in the buffer size. To obtain these bounds we prove that the latent state sequence posterior distribution has an exponential forgetting property [9, 17]. However unlike classic results which prove a geometric decay between the approximate and exact marginal posterior distributions in total variation distance, we use Wasserstein distance [66] to allow analysis of continuous and mixed-type latent state SSMs. Our approach is similar to proofs of Wasserstein ergodicity in homogeneous Markov chains [24, 49, 57]; however we extend these ideas to the nonhomogeneous Markov chains defined by the latent state sequence posterior distribution. These geometrically decaying bounds guarantee that we only need a small buffer size in practice, allowing scalable inference in SSMs.

Although our proposed gradient estimator can be generally applied to any stochastic gradient method, here, we develop SGMCMC samplers for Bayesian inference in a variety of SSMs such as HMMs, linear Gaussian SSMs (LGSSM), and switching linear dynamical systems (SLDS) [9, 29]. We also derive preconditioning matrices to take advantage of information geometry, which allows for more rapid mixing and convergence of our samplers [31, 52]. Finally, we validate our algorithms and theory on a variety of synthetic and real data experiments, finding that our gradient estimator can provide orders of magnitude run-time speed ups compared to batch sampling.

This paper significantly expands upon our initial work [48], by (i) connecting buffering to Fisher’s identity, simplifying its presentation and analysis, (ii) non-trivially generalizing the approach to SSMs beyond the HMM, including continuous and mixed-type latent states, (iii) developing a general framework for bounding the error of buffered gradient estimators using Wasserstein distance, and (iv) providing extensive validation on a number of real and synthetic datasets.

The paper is organized as follows. First, we review background on SSMs and SGMCMC methods in Section 2. We then present our framework of constructing buffered gradient estimators to extend SGMCMC to SSMs in Section 3. We prove the geometrically decaying bounds for our proposed buffered gradient estimate in Section 4. We apply our framework and error bounds to discrete, continuous and mixed-type latent state SSMs in Section 5. Finally, we investigate our algorithms on both synthetic and real data in Section 6.

## 2 Background

### 2.1 State Space Models for Time Series

State space models (SSMs) for time series are a class of discrete-time bivariate stochastic process , , consisting of a latent state sequence generated by a homogeneous Markov chain and an observation sequence generated independently conditioned on  [9]. Examples of state space models include: HMMs, LGSSMs, and SLDSs (see Section 5

for details). For a generic SSM, the joint distribution of

and factorizes as

 Pr(y,u|θ)=T∏t=−TPr(yt|ut,θ)Pr(ut|ut−1,θ), (1)

where are model-specific parameters. As the latent state sequence is unobserved, the likelihood of given only the observations (marginalizing ) is

 Pr(y|θ)=∫T∏t=−TPr(yt|ut,θ)Pr(ut|ut−1,θ)du, (2)

Unconditionally, the observations are not independent and the graphical model of this marginal likelihood, Eq. (2), has many long term dependencies, Figure 1 (right). In contrast, when conditioned on the observations are independent and the complete-data likelihood, Eq. (1), has a simpler chain structure, Figure 1 (left).

To infer given , we can maximize the marginal likelihood or, given a prior , sample from the posterior . However, traditional inference methods for

, such as expectation maximization (EM), variational inference, or Markov chain Monte Carlo (MCMC), take advantage of the conditional independence structure in

, Eq. (1), rather than working directly with , Eq. (2[5, 58]. To use with unobserved , these methods rely on sampling or taking expectations of from the posterior . As an example, gradient-based methods take advantage of Fisher’s identity [9]

 ∇logPr(y|θ)=Eu|y,θ[∇logPr(y,u|θ)]=Eu∼γ[∇logPr(y,u|θ)], (3)

which allows gradients of Eq. (2) to be computed in terms of Eq. (1). To compute the posterior , these methods use the well-known forward-backward algorithm [9, 58]. The algorithm works by recursively computing a sequence of forward messages and backward messages which are used to compute the pairwise marginals of . More specifically,

 αt(ut) :=Pr(ut,y≤t|θ)=∫Pr(yt,ut|ut−1,θ)αt−1(ut−1)dut−1 (4) βt(ut) :=Pr(y>t|ut,θ)=∫Pr(yt+1,ut+1|ut,θ)βt+1(ut+1)dut+1 (5) γt−1:t(ut−1,ut) :=Pr(ut−1,ut|y,θ)∝αt−1(ut−1)Pr(yt,ut|ut−1,θ)βt(ut). (6)

When message passing is tractable (i.e., when Eqs. (4)-(5) involve discrete or conjugate likelihoods), the forward-backward algorithm can be calculated in closed form. When message passing is intractable, the messages can be approximated using Monte-Carlo sampling methods (e.g. blocked Gibbs sampling [10, 28], particle methods [2, 21, 60]). In both cases, when the length of the time series is much larger than the dimension of , the forward-backward algorithm (running over the entire sequence) requires time and memory at each iteration.

The SSM challenge is to scale inference of model parameters to long time series when the computation and storage per iteration is prohibitive.

One popular method for scalable Bayesian inference is stochastic gradient Markov chain Monte Carlo (SGMCMC) [12, 47, 68]. The idea behind gradient-based MCMC is to simulate continuous dynamics of a potential energy function such that the dynamics generate samples from the posterior distribution . For example, the Langevin diffusion over is given by the stochastic differential equation (SDE)

 dθs=−∇U(θ)ds+√2dWs, (7)

where is Brownian motion, , and indexes continuous time. As , the distribution of converges to the SDE’s stationary distribution, which by the Fokker-Planck equation is the posterior  [47, 55]. Because we cannot perfectly simulate Eq. (7), in practice we use a discretized numerical approximation. One straightforward approximation is the Euler-Mayurma discretization

 θ(s+1)←θ(s)−h∇U(θ(s))+N(0,2h), (8)

where is the stepsize and indexes discrete time steps. This recursive update defines the Langevin Monte-Carlo (LMC) algorithm. Typically, a Metropolis-Hastings correction step is added to account for the discretization error [55].

For large datasets, computing at every step in Eq. (8) is computationally prohibitive. To alleviate this, the key ideas of stochastic gradient Langevin dynamics (SGLD) are to replace

with a quick-to-compute unbiased estimator

and to use a decreasing stepsize to avoid costly Metropolis-Hastings correction steps [68]

 θ(s+1)←θ(s)−h(s)∇˜U(θ(s))+N(0,2h(s)). (9)

For i.i.d. data, an example of is to use a random minibatch ,

 ∇ˆU(θ)=−1Pr(S)∑t∈S∇logPr(yt|θ)−∇logPr(θ), (10)

which only requires time to compute. When is unbiased and with an appropriate decreasing stepsize schedule , SGLD converges to the posterior distribution [12, 61]. However, in practice one uses a small, finite step-size for greater efficiency, which introduces a small bias [15].

A Riemannian extension of SGLD (SGRLD) simulates the Langevin diffusion over a Riemannian manifold with metric by preconditioning the gradient and noise of Eq. (9) by . By incorporating geometric information about structure of , SGRLD aims for a diffusion which mixes more rapidly. Suggested examples of the metric are the Fisher information matrix or a noisy Hessian estimate  [31, 52]. Given , each step of SGRLD is

 θ(s+1)←θ(s)−h[D(θ(s))∇~U(θ(s))+Γ(θ(s))]+N(0,2hD(θ(s))) (11)

where the vector

is a correction term to ensure the dynamics converge to the target posterior [47, 70]. Many other SGMCMC have been proposed [4, 12, 13, 19, 43]. Although our ideas extend to these formulations as well, we focus on the popular SGLD and SGRLD algorithms.

To apply SGMCMC to SSMs, we must choose whether to use the complete-data loglikelihood or the marginal data loglikelihood in the potential . If we use the complete-data loglikelihood, then we treat as the parameters. Although the observations conditioned on are independent, we must calculate gradients for at each iteration, which is prohibitive for long sequences and intractable for discrete or mixed-type . On the other hand, if we use the marginal loglikelihood, then we only need to take gradients in . However, the observations conditioned on alone are not independent and therefore the minibatch gradient estimator Eq. (10) breaks crucial dependencies causing it to be biased. Our SGMCMC challenge is correcting the bias in stochastic gradient estimates when applied to SSMs.

## 3 General Framework

We now present our framework for scalable Bayesian inference in SSMs with long observation sequences. Our approach is to extend SGMCMC to SSMs by developing a gradient estimator that ameliorates the issue of broken temporal dependencies. In particular, we develop a computationally efficient gradient estimator that uses a buffer to avoid breaking crucial dependencies, only breaking weak dependencies. We first present a (computationally prohibitive) unbiased estimator of for SSMs using Fisher’s identity. We then derive a general computationally efficient gradient estimate that accounts for the dependence in observations using a buffer. We also propose preconditioning matrices for SGRLD with SSMs. Finally, we present our general SGMCMC pseudocode for SSMs.

The main challenge in constructing an efficient estimate of for SSMs is handling the lack of independence (marginally) in . Because the observations in SSMs are not independent, we cannot produce an unbiased estimate of with a randomly selected subset of data points as in Eq.(10). For example, a naive estimate for is to take the gradient of a random contiguous subsequence with

 ∇ˆU(θ)=−1Pr(S)∇logPr(yS|θ)−∇logPr(θ). (12)

This estimate only requires time compared to the for . However because the marginal likelihood does not factorize as in the independent observations case, this estimate is biased .

To obtain an unbiased estimate for , we use Fisher’s identity Eq. (3) to rewrite in terms of the complete-data loglikelihood as a sum over time points

 ∇U(θ) =−∇logPr(y|θ)−∇logPr(θ) (13) =−Eu|y,θ[∇logPr(y,u|θ)]−∇logPr(θ) =−∑t∈TEu|y,θ[∇logPr(yt,ut|ut−1,θ)]−∇logPr(θ)

From this, we straightforwardly identify an unbiased estimator for a subsequence

 ∇¯¯¯¯U(θ)=−1Pr(S)∑t∈SEu|y,θ[∇logPr(yt,ut|ut−1,θ)]−∇logPr(θ). (14)

Although Eq. (14) reduces the number of gradient terms to compute from to , the summation terms require calculating expectations of . More specifically, Eq. (14) requires expectations with respect to the pairwise marginal posteriors for . Recall that computing these marginals take time to pass messages over the entire sequence . This defeats the purpose of using a subsequence. If we instead only pass messages over the subsequence , then the pairwise marginals are and we return to the naive biased gradient estimator

 ∇ˆU(θ)=−1Pr(S)∑t∈SEu|yS,θ[∇logPr(yt,ut|ut−1,θ)]∇logPr(yS|θ)−∇logPr(θ). (15)

We instead propose passing messages over a buffered subsequence for some positive buffer size , with (see Figure 2). The idea is that there exists a large enough such that . Our buffered gradient estimator sums only over , but takes expectations over instead of

 ∇~U(θ)=−1Pr(S)∑t∈SEu|yS∗,θ[∇logPr(yt,ut|ut−1,θ)]−∇logPr(θ). (16)

When this is equivalent to the naive biased estimator of Eq. (15). When this is equivalent to the unbiased estimator of Eq. (14).

The trade-off between accuracy (bias) and runtime depends on the size of the buffer and current model parameters . Intuitively, when produces pairwise marginals that are similar to i.i.d. data, we can use a small buffer . When produces strongly dependent pairwise marginals, we must use a larger buffer . In Section 4, we analyze, for a fixed value of , how quickly the bias between and decays with increasing . We show a geometric decay

 ES∥∇¯¯¯¯U−∇˜U∥2≤Cθρ−Bθ, for some Cθ>0, (17)

where is large for i.i.d. data and small for strongly dependent data. For a gradient accuracy of , we only need a logarithmic buffer size 111As . Therefore our buffered gradient estimator reduces the computation time from to . In our experiments (Section 6), we find that modest buffers are both necessary and sufficient to correct for bias.

The correction factor in Eq. (16) needs to be calculated carefully to ensure points near the beginning and end of the sequence

are properly weighted. Furthermore, an average over gradient estimates of a collection of subsequences can be used to control the variance of

. For our analysis, we consider the simple case of uniformly sampling a single sequence from separate subsequences (). Details for more complex sampling schemes are provided in the Supplement.

### 3.3 Preconditioning and Fisher Information

The desirable properties for the preconditioning matrix for SGRLD are (i) the resulting dynamics takes advantage of the geometric structure of , (ii) both and can be efficiently computed, and (iii) neither nor are numerically unstable.

The expected Fisher information is the Riemannian metric proposed in [31]

 D−1(θ)=G(θ)=Iy=Ey|θ[∇2logPr(y|θ)]. (18)

This corresponds to the Riemannian manifold over with distance measured by the symmetric Kullback-Liebler (KL) divergence between marginal likelihoods. Unfortunately for SSMs, the lack of independence in the marginal likelihood requires a double sum over to compute , which is computationally intractable for long time series. We instead replace with the complete data Fisher information

 Iu,y=Eu,y|θ[∇2logPr(y,u θ)]=|T|⋅Eu,y|θ[∇2logPr(yt,ut|ut−1,θ)]. (19)

Similar to , using corresponds to the manifold with distance measured by the symmetric KL divergence between complete-data likelihoods. Because can be calculated analytically for the SSMs we consider (Section 5), we use when possible. In our experiments, we find that in practice, using the complete data Fisher information matrix works well and outperforms vanilla SGLD.

### 3.4 Algorithm Pseudocode

Algorithms 1 and 2 summarize our generic SGMCMC method for SSMs222Python code for our method will be available at https://github.com/aicherc/sgmcmc_ssm_code.

## 4 Buffered Gradient Estimator Error Bounds

In this section, we establish a bound on the expected error between the full data gradient and our buffered gradient estimator Eq. (16). Given such a bound, we can control the overall error in our SGLD or SGRLD scheme when the SGMCMC dynamics possess a contraction property [36]. Using the triangle inequality, we decompose the error of using the unbiased gradient estimator Eq. (14)

 ES∥∇U(θ)−∇˜U(θ)∥2≤ES∥∇U(θ)−∇¯¯¯¯U(θ)∥2+ES∥∇¯¯¯¯U(θ)−∇˜U(θ)∥2, (20)

where the expectation is taken over the random subsequence . The first term is the error from subsampling and is the same error from minibatching in SGMCMC for i.i.d. data, Eq. (10[68]. The second term is the error from approximating the latent state posterior, which is unique to our buffered gradient estimator. Because error in the first term is studied in depth in the existing stochastic gradient MCMC literature [4, 11, 22, 61], we restrict our focus to bounding the second term. In particular, we will uniformly bound for any .

Our approach is to bound in terms of the Wasserstein distance between the exact posterior and our approximate posterior and then show this Wasserstein distance decays geometrically. To bound the Wasserstein distance, we follow existing work on bounding Markov processes in Wasserstein distance [24, 49, 57]. However, unlike previous work that focuses on the homogeneous Markov process of the joint model , we instead focus on the induced nonhomogeneous Markov process of the conditional model . To do so, we use the forward () and backward () random maps of  [18]

 ut∼Pr(ut|y,θ) ⇒(ft(ut),ut)∼Pr(ut+1,ut|y,θ) (21) ut∼Pr(ut|y,θ) ⇒(bt(ut),ut)∼Pr(ut−1,ut|y,θ), (22)

If and satisfy a contractive property, then we can bound the Wasserstein distance between in terms of and respectively. Bounding the error of the induced nonhomogeneous Markov process has been previously studied in the SSM literature using total variation (TV) distance [9, 17, 42, 62]. These works bound the error in total variation distance by quantifying how quickly the smoothed posterior forgets the initial condition. However, these bounds typically require stringent regularity conditions, which are hard to prove outside of finite or compact spaces. In particular, these bounds are not immediately applicable for LGSSMs. In contrast, we bound the error in Wasserstein distance by proving contraction properties of and , allowing us to extend results to continuous and mixed-type SSMs such as the LGSSM (Section 5.3.1).

Our main result is that if, for fixed , the gradient of satisfies a Lipschitz condition and the random maps all satisfy a contraction property, then the error decays geometrically in the buffer size .

###### Theorem 1.

Let and be the 1-Wasserstein distances between and at the left and right ends of respectively. Let . If the gradients of are all Lipschitz in with constant , and random maps and are all Lipschitz in with constant , then we have

 ∥∇¯¯¯¯U(θ)−∇˜U(θ)∥2≤|T|⋅LU⋅1+L1−L⋅1−LSS⋅LB⋅2ϵ1. (23)

As , Theorem 1 states that the error of the buffered gradient estimator decays geometrically as . Therefore, the required buffer size for an error tolerance of scales logarithmically as . In contrast, the error of the gradient estimator decays only linearly in the subsequence length, ; therefore much longer subsequences, , are required to reduce bias. This agrees with the intuition that the bias is dominated by the error at the endpoints of subsequence. A similar result for when the gradient of the complete data loglikelihood is Lipschitz in instead of (as needed for LGSSM) will be proved in Section 4.3.

The remainder of this section is as follows. First, in Section 4.1, we show how to bound the error in in terms of Wasserstein distances between . Second, in Section 4.2, we show these Wasserstein distances decay geometrically in . Finally, in Section 4.3, we prove our main results: Theorems 1 and 2. To keep the presentation clean, we leave proofs of Lemmas to the Supplement.

### 4.1 Functional Bound in terms of Wasserstein

We first review the definition of Wasserstein distance. Let be the -Wasserstein distance

 Wp(γ,˜γ):=[infξ∫∥u−˜u∥p2dξ(u,˜u)]1/p (24)

where is a joint measure or coupling over with marginals and . Wasserstein distance satisfies all the properties of a metric. A useful property of the -Wasserstein distance is the following Kantorovich-Rubinstein duality formula for the difference of expectations of Lipschitz functions [66]

 W1(γ,˜γ)=sup∥ψ∥Lip≤1{∫ψdγ−∫ψd˜γ}⇒|Eγ[ψ]−E˜γ[ψ]|≤∥ψ∥Lip⋅W1(γ,˜γ), (25)

where denotes the Lipchitz constant of .

We connect the error to the Wasserstein distances between , by applying this duality formula Eq. (25) to the difference of Eqs. (14) and (16)

 ∇¯¯¯¯U−∇˜U=|T|S∑t∈SEγt−1:t[∇logPr(yt,ut|ut−1,θ)]−E˜γt−1:t[∇logPr(yt,ut|ut−1,θ)]. (26)

Applying the triangle inequality gives Lemma 1.

###### Lemma 1.

If are Lipschitz in with constant ,

 ∥∇¯¯¯¯U(θ)−∇˜U(θ)∥2≤|T|S⋅LU⋅∑t∈SW1(γt−1:t,˜γt−1:t). (27)

If is not Lipschitz in , but is Lipschitz in (as in LGSSMs), then the following Lemma lets us bound the -Wasserstein distance of in terms of the -Wasserstein distance of .

###### Lemma 2.

Let be the distribution of . Let be the distribution of . Let . (Note implies .) Then,

 W1(γ′,˜γ′)≤(2√M+1)⋅max{W2(γ,˜γ)1/2,W2(γ,˜γ)}.

### 4.2 Geometric Wasserstein Decay

We first review why contractive random maps induce Wasserstein bounds. If two distributions have identically distributed random maps , that is there exists a random function satisfying

 u∼γt and u′∼γ′t⇒ft(u)∼γt+1 and ft(u)∼γ′t+1, (28)

then we can bound the Wasserstein distance of in terms of the Wasserstein distance of given a bound on the random map’s Lipschitz constant

 Wp(γt+1,γ′t+1)p =infξt+1∫∥ut+1−u′t+1∥p2dξt+1(ut+1,u′t+1) (29) ≤infξt∫∥ft(ut)−ft(u′t)∥p2dξt(ut,u′t)dft ≤infξt∫Lp⋅∥ut−u′t∥p2dξt(ut,u′t)=Lp⋅Wp(γt,γ′t)p.

Unfortunately for SSMs, Eq. (29) does not apply as the random maps of and of are not identically

distributed. To see this, we first review the conditional probability distributions used to define

. The forward random map draws from the forward smoothing kernel

 Ft(ut+1|ut):=Pr(ut+1|ut,y>t)=Pr(ut+1|ut)Pr(yt+1|ut+1)βt+1(ut+1)/βt(ut) (30)

and the backward random map draws from the backward smoothing kernel

 Bt(ut−1|ut):=Pr(ut−1|ut,y≥t)=Pr(ut|ut−1)Pr(yt|ut)αt−1(ut−1)/αt(ut). (31)

Because uses different forward and backward messages , in Eqs. (30) and (31), the kernels are not identical to (and the random maps are not identically distributed). This is unlike homogeneous Markov chains, where the kernels are identical at each time (and the random maps are identically distributed).

Instead of connecting to directly, we use the triangle inequality to connect them through an intermediate distribution

 Wp(γ,˜γ)≤Wp(γ,ˆγ)+Wp(ˆγ,˜γ). (32)

Introducing this particular intermediate distribution is the key step for our Wasserstein bounds between and . Because conditions on all after , and have identical backward messages and therefore identically distributed forward random maps . Similarly, because does not condition on before , and have identical forward messages and identically distributed backward random maps . Therefore, we can bound using and bound using with the contraction trick Eq. (29) giving us Lemma 3.

###### Lemma 3.

If there exists such that for all , and , then for all we have

 Wp(γt−1:t,ˆγt−1:t) ≤(1+Lpf)1/p⋅Wp(γt−1,ˆγt−1) (33) ≤(1+Lpf)1/p⋅LB+t−1f⋅Wp(γ−B,ˆγ−B) Wp(ˆγt−1:t,˜γt−1:t) ≤(1+Lpb)1/p⋅Wp(ˆγt,˜γt) (34) ≤(1+Lpb)1/p⋅LB+S−tb⋅Wp(ˆγS+B,˜γS+B)

We show sufficient conditions for the random maps to be contractions (i.e. ) for specific models in Section 5.1.1 (HMMs) and 5.3.1 (LGSSMs).

### 4.3 Proof of Main Theorems

Putting together the results of the previous two subsections gives us our geometric error bounds: Theorem 1 when the gradient terms are Lipschitz in and Theorem 2 when the gradient terms are Lipschitz in . Both theorems require the random maps of the forward and backward smoothing kernels are contractions. We first prove Theorem 1.

###### Proof of Theorem 1.

Combining Lemmas 1 and 3 with some algebra

 ∥∇¯¯¯¯U(θ)−∇˜U(θ)∥2 ≤|T|S⋅LU⋅∑t∈SW1(γt−1:t,˜γt−1:t) ≤|T|S⋅LU⋅S∑t=1(1+Lf)LB+t−1fϵ1+(1+Lb)LB+S−tbϵ1 ≤|T|⋅LU⋅1+L1−L⋅1−LSS⋅LB⋅2ϵ1,

where . ∎

We now prove a similar result for when is Lipschitz in .

###### Theorem 2.

Let . If the gradients are Lipschitz in with constant , and there exists for Lemma 3, then with and

 ∥∇¯¯¯¯U(θ)−∇˜U(θ)∥2≤|T|⋅LU⋅√1+L21−L1/2⋅1−LS/4S/2⋅LB/2⋅maxr∈{1/2,1}(2ϵ2)r.

Similar to Theorem 1, Theorem 2 states that the squared error of the buffered gradient estimator decays geometrically if the complete-data loglikelihood is Lipschitz in instead of . However, the price we pay is a square-root: the error decays instead of .

###### Proof of Theorem 2.

Applying Lemmas 2 and 3, we have

 ∥∇¯¯¯¯U(θ)−∇˜U(θ)∥2 ≤|T|S⋅LU⋅S∑t=1maxr∈{1/2,1}[W2(γt−1:t,ˆγt−1:t)+W2(ˆγt−1:t,˜γt−1:t)]r ≤|T|S⋅LU⋅S∑t=1maxr∈{1/2,1}[(LB+t−1+LB+S−t)√1+L2