Stochastic Gradient MCMC for Nonlinear State Space Models

01/29/2019 ∙ by Christopher Aicher, et al. ∙ 12

State space models (SSMs) provide a flexible framework for modeling complex time series via a latent stochastic process. Inference for nonlinear, non-Gaussian SSMs is often tackled with particle methods that do not scale well to long time series. The challenge is two-fold: not only do computations scale linearly with time, as in the linear case, but particle filters additionally suffer from increasing particle degeneracy with longer series. Stochastic gradient MCMC methods have been developed to scale inference for hidden Markov models (HMMs) and linear SSMs using buffered stochastic gradient estimates to account for temporal dependencies. We extend these stochastic gradient estimators to nonlinear SSMs using particle methods. We present error bounds that account for both buffering error and particle error in the case of nonlinear SSMs that are log-concave in the latent process. We evaluate our proposed particle buffered stochastic gradient using SGMCMC for inference on both long sequential synthetic and minute-resolution financial returns data, demonstrating the importance of this class of methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Nonlinear state space models (SSMs) are widely used in many scientific domains for modeling time series and sequential data. For example, nonlinear SSMs can be applied in engineering (e.g. target tracking, Gordon et al. (1993)), in epidemiology (e.g. compartmental disease models, Dukic et al. (2012)), and to financial time series (e.g. stochastic volatility models, Shephard (2005)

). To capture complex dynamical structure, nonlinear SSMs augment the observed time series with a latent state sequence, inducing a Markov chain dependence structure. Parameter inference for nonlinear SSMs requires us to handle this latent state sequence. This is typically achieved using

particle filtering methods.

Particle filtering algorithms are a set of flexible Monte Carlo simulation-based methods, which use a set of samples or particles

to approximate the posterior distribution over the latent states. Unfortunately, inference in nonlinear SSMs does not scale well to long sequences: (i) the cost of each pass through the data scales linearly with the length of the sequence, and (ii) the number of particles (and hence the computation per data point) required to control the variance of the particle filter scales with the length of the sequence.

Stochastic gradient Markov chain Monte Carlo (SGMCMC) is a popular method for scaling Bayesian inference to large data sets, which replace full data gradients with stochastic gradient estimates based on subsets of data 

(Ma et al., 2015). In the context of SSMs, naive stochastic gradients are biased because subsampling breaks temporal dependencies in the data (Ma et al., 2017; Aicher et al., 2018). To correct for this, Ma et al. (2017) and Aicher et al. (2018) have developed buffered stochastic gradient estimators that control the bias. The latent state sequence is marginalized in a buffer around each subsequence, allowing fewer dependencies to be broken. However, the work so far has been limited to SSMs where analytic marginalization is possible (e.g. HMMs and linear dynamical systems).

In this work, we propose particle buffered gradient estimators that generalize the buffered gradient estimators to nonlinear SSMs. In particular, we show how buffering in nonlinear SSMs can be approximated with a modified particle filter. Beyond the regular speedup gains from using a subsequence over a batch, our method also reduces the number of particles required to control the variance of the particle filter. We provide an error analysis of our proposed estimators by decomposing the error into buffering error and particle filter error. We also extend the buffering error bounds of Aicher et al. (2018) to nonlinear SSMs with log-concave likelihoods and show that buffer error decays geometrically in buffer size, ensuring that a small buffer size can be used in practice.

This paper is organized as follows. First, we review background on particle filtering in nonlinear SSMs and SGMCMC for analytic SSMs in Section 2. We then present our particle buffered stochastic gradient estimator and its error analysis in Section 3. Finally, we test our estimator for nonlinear SSMs on both synthetic and EUR-US exchange rate data in Section 4.

2 Background

2.1 Nonlinear State Space Models for Time Series

State space models are a class of discrete-time bivariate stochastic processes consisting of a latent state process and a second observed process, . The evolution of the state variables is typically assumed to be a time-homogeneous Markov process, such that the latent state at time , , is determined only by the latent state at time , . The observed states, , are therefore conditionally independent given the latent states. Given the prior and parameters , the generative model for is thus

(1)

where is the transition density and is the emission density.

Examples of nonlinear SSMs include the stochastic volatility model (SVM) (Shephard, 2005) or the generalized autoregressive conditional heteroskedasticity (GARCH) model (Bollerslev, 1986). For a review of applications of state space modeling, see Langrock (2011).

For an arbitrary sequence , we use to denote the sequence . To infer the model parameters , a quantity of interest is the score function, or the gradient of the marginal loglikelihood, . Using the score function, the loglikelihood can for instance be maximized iteratively via a (batch) gradient ascent algorithm (Robbins and Monro, 1951), given the observations, .

If the latent state posterior can be expressed analytically, we can calculate the score using Fisher’s identity (Cappé et al., 2005),

(2)

However, if the latent state posterior, , is not available in closed-form, we can approximate the expectations of the latent state posterior. One popular approach is via particle filtering methods.

2.1.1 Particle Filtering and Smoothing

Particle filtering algorithms (see e.g. Doucet and Johansen, 2009; Fearnhead and Künsch, 2018) can be used to create an empirical approximation of the expectation of a function with respect to the posterior density, . This is done by generating a collection of random samples or particles, and calculating their associated importance weights, , recursively over time. We update the particles and weights with sequential importance resampling (SIR) (Doucet and Johansen, 2009) in the following manner.

  1. Resample auxiliary ancestor indices

    with probabilities proportional to the importance weights, i.e.

    .

  2. Propagate particles , using a proposal distribution .

  3. Update and normalize the weight of each particle,

    (3)

The auxiliary variables, , represent the indices of the ancestors of the particles, , sampled at time . The introduction of ancestor indices allows us to keep track of the lineage of particles over time (Andrieu et al., 2010). The multinomial resampling scheme given in (i) describes the procedure by which offspring particles are produced.

Resampling at each iteration is used to mitigate against the problem of weight degeneracy. This phenomenon occurs when the variance of the importance weights grows, causing more and more particles to have negligible weight. Aside from the multinomial resampling scheme described above, there are various other resampling schemes outlined in the particle filtering literature, such as stratified sampling (Kitagawa, 1996) and residual sampling (Liu and Chen, 1998).

If the proposal density is the transition density , SIR is also known as the bootstrap particle filter (Gordon et al., 1993). By using the transition density for proposals, the importance weight recursion in Eq. (3) simplifies to .

When our target function decomposes into a pairwise sum – such as for Fisher’s identity – then we only need to keep track of the partial sum rather than the full list of during SIR. The complete particle filtering scheme is detailed in Algorithm 1.

1:Input: number of particles, , pairwise statistics, , observations , proposal density ,
2:Draw , set , and .
3:for  do
4:   Resample ancestor indices .
5:   Propagate particles .
6:   Update each according to Eq. (3).
7:   Update statistics .
8:end for
9:Return .
Algorithm 1 Particle Filter

A key challenge for particle filters is handling large . Not only do long sequences require computation, but particle filters require a large number of particles, , to avoid particle degeneracy: the use of resampling in the particle filter causes path-dependence over time, depleting the number of distinct particles available overall. For Algorithm 1, the variance in scales as  (Poyiadjis et al., 2011). Therefore to maintain a constant variance, the number of particles would need to increase quadratically with , which is computationally infeasible for long sequences. Poyiadjis et al. (2011); Nemeth et al. (2016) and Olsson et al. (2017) propose alternatives to Step 7. of Algorithm 1 that trade additional computation or bias to decrease the variance in to . Fixed-lag particle smoothers provide another approach to avoid particle degeneracy, where sample paths are not updated after a fixed lag (Kitagawa and Sato, 2001; Dahlin et al., 2015). All of these methods perform a full pass over the data , which requires computation.

2.2 Stochastic Gradient MCMC

One popular method to conduct scalable Bayesian inference for large data sets is stochastic gradient Markov chain Monte Carlo (SGMCMC). Given a prior , to draw a sample from the posterior , gradient-based MCMC methods simulate a stochastic differential equation (SDE) based on the gradient of the loglikelihood , such that the posterior is the stationary distribution of the SDE. SGMCMC methods replace the full-data gradients with stochastic gradients, , using subsamples of the data to avoid costly computation.

A fundamental method within the SGMCMC family is the stochastic gradient Langevin dynamics (SGLD) algorithm (Welling and Teh, 2011):

(4)

where is the stepsize. When is unbiased and with an appropriate decreasing stepsize, the distribution of asymptotically converges to the posterior distribution (Teh et al., 2016). Dalalyan and Karagulyan (2017) provide non-asymptotic bounds on Wasserstein distance to the posterior after steps of SGLD for fixed and possibly biased .

Many extensions of SGLD exist in the literature, including using control variates to reduce the variance of  (Nagapetyan et al., 2017; Baker et al., 2018; Chatterji et al., 2018) and augmented dynamics to improve mixing (Ma et al., 2015) such as SGHMC (Chen et al., 2014), SGNHT (Ding et al., 2014), and SGRLD (Girolami and Calderhead, 2011; Patterson and Teh, 2013).

2.2.1 Stochastic Gradients for SSMs

An additional challenge when applying SGMCMC to SSMs is handling the temporal dependence between observations. Based on a subset of size , an unbiased stochastic gradient estimate of Eq. (2) is

(5)

Although Eq. (5) requires only a sum over terms, it requires taking expectations with respect to , which requires processing the full sequence . One approach to reduce computation is to randomly sample as a contiguous subsequence and approximate Eq. (5) using only

(6)

However, Eq. (6) is biased because the expectation over the latent states is conditioned only on rather than .

To reduce the bias in stochastic gradients while avoiding accessing the full sequence, previous work on SGMCMC for SSMs proposed buffered stochastic gradients (Ma et al., 2017; Aicher et al., 2018)111Previous work on inference in general SSMs has shown that the Markov chain displays a forgetting property (see Chapter 3 of Cappé et al. (2005)). Therefore, conditional on the current value of , it is sensible to use buffering, as we expect distant time points to have negligible impact.:

(7)

where is the buffered subsequence for (see Figure 1). Note Eq. (5) is and Eq. (6) is . As increases from to , the estimator trades computation for reduced bias.

Figure 1: Graphical model of with and .

In particular, when the SSM model and gradient both satisfy a Lipschitz property, the bias decays geometrically in buffer size (see Theorem 4.1 of Aicher et al. (2018)). Specifically,

(8)

where is a bound for the Lipschitz constants of the forward and backward smoothing kernels222We follow Aicher et al. (2018) and consider Lipschitz constant for a kernel is measured in terms of the -Wasserstein distance between distributions of and . See the Supplement for additional details.

(9)

The bound provided in Eq. (8) ensures that only a modest buffer size is required (e.g. for an accuracy of ). Unfortunately, neither the buffered stochastic gradient nor the smoothing kernels have a closed-form for nonlinear SSMs.

3 Method

In this section, we propose a particle buffered stochastic gradient for nonlinear SSMs, by applying the particle approximations of Section 2.1 to Eq. (7). In addition, we extend the error bounds of Aicher et al. (2018) to the nonlinear SSM case, guaranteeing that the error decays geometrically in , without requiring an explicit form for the smoothing kernels. We also analyze the approximation error by decomposing the buffering error and the particle filter error.

3.1 Buffered Stochastic Gradient Estimates for Nonlinear SSMs

Let denote the particle approximation of with particles. We approximate the expectation over in Eq. (7) using Algorithm 1. In particular, the complete data loglikelihood, , in Eq. (7) decomposes into a sum of pairwise statistics where

(10)

We highlight that the statistic is zero for in the left and right buffers . Although is not updated by for in , running the particle filter over the buffers is crucial to reduce the bias of .

Note that allows us to approximate the non-analytic expectation in Eq. (7) with a modest number of particles , by avoiding the particle degeneracy and full sequence runtime bottlenecks, as the particle filter is only run over , which has length .

3.2 SGMCMC Algorithm

Using as our stochastic gradient estimate in SGLD, Eq. (4), gives us Algorithm 2.333Python code for Algorithm 2 and experiments of Section 4 is available at https://github.com/aicherc/sgmcmc_ssm_code.

1:Input: data , initial , stepsize , subsequence size , buffer size , particle size
2:for  do
3:   Sample
4:   Set .
5:   Calculate using Algorithm 1 on Eq. (10).
6:   Set
7:end for
8:Return
Algorithm 2 Buffered PF-SGLD

Algorithm 2 can be extended and improved by (i) averaging over multiple sequences or varying the subsequence sampling method (Schmidt et al., 2015; Ou et al., 2018), (ii) using different particle filters such as those listed in Section 2.1.1, and (iii) using more advanced SGMCMC schemes such as those listed in Section 2.2.

3.3 Error Analysis

Although defining the particle variant of SGMCMC is relatively straightforward by building on Aicher et al. (2018), the error analysis presents new challenges. To analyze the error of the SGMCMC sampler, it is sufficient to bound the bias and variance of our stochastic gradient estimator to the exact full-data gradient (Dalalyan and Karagulyan, 2017). We link the error between the full gradient and through and ,

(11)

Therefore there are three error sources to consider in (11)

  1. Subsequence Error, : the error in approximating Fisher’s identity with a stochastic subsequence. The error in this term follows the standard stochastic gradient literature, which depends on the subsequence length and how subsequences are sampled. For a random minibatch of size sampling without replacement, the variance scales ; However, for a random contiguous subsequences of size , the variance scales where is a bound on the autocorrelation between terms (see the Supplement for details).

  2. Buffering Error, : the error in approximating the latent state posterior with . If the smoothing kernels are contractions for all (i.e. ), then from Eq. (8) the error in this term scales as  (Aicher et al., 2018). In Section 3.3.1, we show sufficient conditions for .

  3. Particle Error, : the error from the particle smoother Monte-Carlo approximation. This error depends on the number of particles and the length of sequence . For the particle filter, Algorithm 1, the asymptotic variance in this term scales as  (Poyiadjis et al., 2011).

The error term (I-III) that dominates depends on the regime . For example, increasing , decreases the error in term (II), but increases the error in term (III); therefore, increasing to reduce buffering bias will not be effective if is not sufficiently large to avoid particle degeneracy.

3.3.1 Buffering Error for Nonlinear SSMs

To obtain a bound for the buffering error term (II), we require the Lipschitz constant of smoothing kernels to be less than . Typically the smoothing kernels are not available in closed-form for nonlinear SSMs and therefore directly bounding the Lipschitz constant is difficult. Instead, we show that we bound the Lipschitz constant of in terms of the Lipschitz constant of either the prior kernels , or the filtered kernels

(12)

The prior kernels are defined by the model and therefore usually available. When are also available, they can be used to obtain even tighter bounds.

We now present our results for the forward kernels ; similar arguments can be made for the backward kernels . These results rely on the transition and emission densities being log-concave in .

Theorem 1 (Lipschitz Bound for Log-Concave Models).

Assume the prior for is log-concave in . If the transition density is log-concave in and the emission density is log-concave in , then

(13)

This theorem lets us bound with the Lipschitz constant of either the prior kernels or filtered kernels. The proof of Theorem 1 is provided in the Supplement and is based on Caffarelli’s log-concave perturbation theorem (Villani, 2008; Colombo et al., 2015). Examples of SSMs that are log-concave include the LGSSM, the stochastic volatility model, or any linear SSM with log-concave transition or emission noise. Examples of SSMs that are not log-concave include the GARCH model or any linear SSM with a transition or emission noise distribution that is not log-concave (e.g. Student’s ).

4 Experiments

We first introduce the three models: (i) linear Gaussian SSM (LGSSM), a case where analytic buffering is possible, to assess the impact of the particle filter; (ii) the SVM, where the emissions are non-Gaussian; and (iii) a GARCH model, where the latent transitions are nonlinear. We then empirically test the gradient error of our particle buffered gradient estimator on synthetic data for fixed . Finally, we evaluate the performance of our proposed SGLD algorithm (Algorithm 2) on both real and synthetic data.

4.1 Models

4.1.1 Linear Gaussian SSM

The linear Gaussian SSM (LGSSM) is

(14)
(15)

with and parameters .

The transition and emission distributions are both Gaussian and log-concave in , allowing Theorem 1 to apply. In the Supplement, we show that the filtered kernels of the LGSSM are bounded with the Lipschitz constant . Thus, the buffering error decays geometrically with increasing buffer size when . This linear model serves as a useful baseline since the various terms in Eq. (11) can be calculated analytically.

4.1.2 Stochastic Volatility Model

The stochastic volatility model (SVM) is

(16)
(17)

with parameters .

For the SVM, the transition and emission distributions are log-concave in , allowing Theorem 1 to apply. In the Supplement, we show that the prior kernels of the SVM are bounded with the Lipschitz constant . Thus, the buffering error decays geometrically with increasing buffer size when .

4.1.3 GARCH Model

We finally consider a GARCH(1,1) model (with noise)

(18)
(19)
(20)

with parameters . Unlike the LGSSM and SVM, the noise between and is multiplicative in rather than additive. This model is not log-concave and therefore our theory (Theorem 1) does not hold. However, we see empirically that buffering can help reduce the gradient error for the GARCH in the experiments below and in the Supplement.

4.2 Stochastic Gradient Error

We compare the error of stochastic gradient estimates using a buffered subsequence with , while varying and on synthetic data from each model. We generated synthetic data of length using for the LGSSM, for the SVM, and for the GARCH model.

Figure 2 displays the mean squared error (MSE) between our particle buffered stochastic gradient and averaged over 100,000 replications. We evaluate the gradients at equal to the data generating parameters. We vary the buffer size and the number of samples . For the LGSSM, we also consider , calculating

using the Kalman filter, which is tractable in the linear setting. We calculate

using the Kalman filter for the LGSSM, and use for the SVM and the GARCH model, assuming that particles and is sufficient for a highly accurate approximation.

Figure 2 demonstrates the trade-off between the buffering error (II) and the particle error (III) from Section 3.3. For all , when is small, the buffering error (II) dominates, and therefore the MSE decays exponentially as increases. However for , the particle error (III) dominates for larger values of . In fact, the MSE slightly increases due to particle degeneracy, as increases with . For in the LGSSM case, we see that the error continues to decreases exponentially with large as there is no particle filter error when using the Kalman filter.

Figure 2 also shows that buffering cannot be ignored in these three example models: there is high MSE for . In general, buffering has diminishing returns when is excessively large relative to .

Figure 2: Buffered Stochastic Gradient Estimate Error Plots. (left) LGSSM , (middle) SVM , (right) GARCH

4.3 SGLD Experiments

Having examined the stochastic gradient error, we now consider using our stochastic gradient estimators in SGLD.

4.3.1 SGLD Evaluation Methods

We assess the performance of our samplers given a fixed computation budget, by measuring both the heldout and predictive loglikelihoods on a test sequence. Given a sampled parameter value the heldout loglikelihood is

(21)

and the -step ahead predictive loglikelihood is

(22)

where are obtained from the particle filter on the test sequence. For synthetic data, we also measure the mean-squared error (MSE) of the posterior sample average to the true parameters .

We measure the sample quality of our MCMC chains using the kernel Stein discrepancy (KSD) given equal computation time (Gorham and Mackey, 2017; Liu et al., 2016). We choose to use KSD rather than classic MCMC diagnostics such as effective sample size (ESS) (Gelman et al., 2013), because KSD penalizes the bias present in our MCMC chains. Given a sample chain (after burnin and thinning) , let be the empirical distribution of the samples. Then the KSD between and the posterior distribution is

(23)

where

(24)

and is a valid kernel function. Following Gorham and Mackey (2017), we use the inverse multiquadratic kernel in our experiments. Since Eq. (24) requires full gradient evaluations of that are computationally intractable, we replace these terms with corresponding stochastic estimates using .

4.3.2 SGLD on Synthetic LGSSM Data

To assess the effect of using particle filters with buffered stochastic gradients, we first focus on SGLD on synthetic LGSSM data, where calculating is possible. We generate training sequences of length or and test sequences of length using the same parametrization as Section 4.2.

We consider three pairs of different gradient estimators: Full , Buffered and No Buffer each with particles using the particle filter and with using the Kalman filter. To select the stepsize, we performed a grid search over and selected the method with smallest KSD to the posterior on the training set. We present the KSD results (for the best ) in Table 1 and trace plots of the metrics in Figure 3.


(a) (b)
Figure 3: SGLD on Synthetic LGSSM data. (top) , (bottom) . (left) heldout-loglikelihood, (right) MSE of estimated posterior mean of to true .

From Figure 3, we see that the methods without buffering () have lower heldout loglikelihoods on the test sequence and have higher MSE as they are biased. We also see that the full sequence methods () perform poorly for large .

The KSD results further support this story. Table 1

presents the mean and standard deviation on our estimated

KSD for . Tables of the marginal KSD for individual components of can be found in the Supplement. The methods without buffering have larger KSD, as the inherent bias of led to an incorrect stationary distribution. The full sequence methods perform poorly for because of a lack of samples that can be computed in a fixed runtime.


KSD
0.85 (0.08) 4.92 (0.40)
0.64 (0.17) 4.85 (0.36)
40 0 1.58 (0.03) 4.68 (0.10)
1.55 (0.03) 4.68 (0.11)
40 10 0.68 (0.25) 3.43 (0.19)
0.61 (0.21) 3.25 (0.29)
Table 1: KSD for Synthetic LGSSM. Mean and SD.

In the Supplement, we present similar results for SGLD on synthetic SVM and GARCH data. Also in the Supplement, we present results for SGLD on LGSSM in higher dimensions. As is typical in the particle filtering literature, the performance degrades with increasing dimensions for fixed.

4.3.3 SGLD on Exchange Rate Log-Returns

We now consider fitting the SVM and the GARCH model to EUR-US exchange rate data at the minute resolution from November 2017 to October 2018. The data consists of 350,000 observations of demeaned log-returns. As the market is closed during non-business hours, we further break the data into 53 weekly segments of roughly 7,000 observations each. In our model, we assume independence between weekly segments and divide the data into a training set of the first 45 weeks and a test set of the last 8 weeks. Full processing details and example plots are in the Supplement. Note that our method (Algorithm 2) easily scales to the unsegmented series; however the abrupt changes between starts of weeks are not adequately modeled by Eqs. (16)-(17)

We fit both the SVM and the GARCH model using SGLD with four different gradient methods: (i) Full, the full gradient over all segments in the training set; (ii) Weekly, a stochastic gradient over a randomly selected segment in the training set; (iii) No Buffer, a stochastic gradient over a randomly selected subsequence of length ; and (iv) Buffer, our buffered stochastic gradient for a subsequence of length with buffer length . To estimate the stochastic gradients, we use Algorithm 1 with . To select the stepsize parameter, we performed a grid search over and selected the method with smallest KSD. We present the KSD results in Table 2. Figure 4 are trace plots of the heldout and predictive loglikelihood for the four different SGLD methods, each averaged over 5 chains.


(a) SVM

(b) GARCH

Figure 4: SGLD plots on exchange rate data. (top) SVM, (bottom) GARCH, (left) heldout-loglikelihood, (right) 3-step ahead predictive loglikelihood.

KSD
Method SVM GARCH
Full 4.03 (0.14) 2.84 (0.30)
Weekly 3.87 (0.08) 2.81 (0.21)
No Buffer 4.48 (0.01) 2.09 (0.09)
Buffer 3.56 (0.08) 2.19 (0.05)
Table 2: KSD for SGLD on exchange rate data. Mean and SD over 5 chains each.

For the SVM, we see that buffering improves performance on both heldout and predictive loglikelihoods, Figure 4(top), and also leads to more accurate MCMC samples, Table 2(left). In particular, the samples from SGLD without buffering have smaller and a larger , indicating that its posterior is (inaccurately) centered around a SVM with larger latent state noise. We also again see that the full sequence and weekly segment methods perform poorly due to the limited number of samples that can be computed in a fixed runtime.

For the GARCH model, Figure 4(bottom) and Table 2

(right), we see that the subsequence methods out perform the full sequence methods, but unlike in the SVM, buffering does not help with inference on the GARCH data. This is because the GARCH model that we recover on the exchange rate data (for all gradient methods) is close to white noise

. Therefore the model believes the observations are close to independent, hence no buffer is necessary. Although buffering performs worse on a runtime scale, here, it is leading to a more accurate posterior estimate (less bias) in all settings.

5 Discussion

In this work, we developed a particle buffered stochastic gradient estimators for nonlinear SSMs. Our key contributions are (i) combining buffered stochastic gradient MCMC with particle filtering for nonlinear SSM (Algorithm 1), (ii) decomposing the error of our proposed gradient estimator into parts due to buffering and particle filtering, and (iii) generalizing the geometric decay error bound for buffering to nonlinear SSMs with log-concave likelihoods (Theorem 1). We evaluated our proposed gradient estimator with SGLD for three models (LGSSM, SVM, GARCH) on both synthetic data and EUR-US exchange rate data. We find that our stochastic gradient methods (Algorithm 2) are able to out perform batch methods on long sequences.

Possible future extensions of this work include relaxing the log-concave restriction of Theorem 1, extensions to Algorithm 2 as discussed at the end of Section 3.2

, and applying our particle buffered stochastic gradient estimates to other applications than SGMCMC, such as optimization in variational autoencoders for sequential data 

(Maddison et al., 2017; Naesseth et al., 2018).

Acknowledgements

We would like to thank Nicholas Foti for helpful discussions. This work was supported in part by ONR Grants N00014-15-1-2380 and N00014-18-1-2862, NSF CAREER Award IIS-1350133, and EPSRC Grants EP/L015692/1, EP/S00159X/1, EP/R01860X/1, EP/R018561/1 and EP/R034710/1.

References

  • Aicher et al. [2018] Christopher Aicher, Yi-An Ma, Nicholas J. Foti, and Emily B. Fox. Stochastic Gradient MCMC for State Space Models. arXiv preprint arXiv:1810.09098, 2018.
  • Andrieu et al. [2010] Christophe Andrieu, Arnaud Doucet, and Roman Holenstein. Particle Markov chain Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(3):269–342, 2010.
  • Baker et al. [2018] Jack Baker, Paul Fearnhead, Emily B Fox, and Christopher Nemeth. Control variates for stochastic gradient MCMC. Statistics and Computing, Aug 2018. ISSN 1573-1375.
  • Bollerslev [1986] Tim Bollerslev. Generalized autoregressive conditional heteroskedasticity. Journal of Econometrics, 31(3):307 – 327, 1986. ISSN 0304-4076.
  • Cappé et al. [2005] Olivier Cappé, Eric Moulines, and Tobias Rydén. Inference in Hidden Markov Models. Springer, 2005.
  • Chatterji et al. [2018] Niladri S Chatterji, Nicolas Flammarion, Yi-An Ma, Peter L Bartlett, and Michael I Jordan. On the Theory of Variance Reduction for Stochastic Gradient Monte Carlo. In

    Proceedings of the 35th International Conference on Machine Learning

    , pages 764–773, 2018.
  • Chen et al. [2014] Tianqi Chen, Emily Fox, and Carlos Guestrin. Stochastic gradient Hamiltonian Monte Carlo. In International Conference on Machine Learning, pages 1683–1691, 2014.
  • Colombo et al. [2015] Maria Colombo, Alessio Figalli, and Yash Jhaveri. Lipschitz changes of variables between perturbations of log-concave measures. arXiv preprint arXiv:1510.03687, 2015.
  • Dahlin et al. [2015] Johan Dahlin, Fredrik Lindsten, and Thomas B Schön. Particle Metropolis–Hastings using gradient and Hessian information. Statistics and Computing, 25(1):81–92, 2015.
  • Dalalyan and Karagulyan [2017] Arnak S Dalalyan and Avetik G Karagulyan. User-friendly guarantees for the Langevin Monte Carlo with inaccurate gradient. arXiv preprint arXiv:1710.00095, 2017.
  • Ding et al. [2014] Nan Ding, Youhan Fang, Ryan Babbush, Changyou Chen, Robert D Skeel, and Hartmut Neven. Bayesian sampling using stochastic gradient thermostats. In Advances in Neural Information Processing Systems, pages 3203–3211, 2014.
  • Doucet and Johansen [2009] Arnaud Doucet and Adam M Johansen. A tutorial on particle filtering and smoothing: Fifteen years later.

    Handbook of Nonlinear Filtering

    , 12(656-704):3, 2009.
  • Dukic et al. [2012] Vanja Dukic, Hedibert F Lopes, and Nicholas G Polson. Tracking epidemics with Google flu trends data and a state-space SEIR model. Journal of the American Statistical Association, 107(500):1410–1426, 2012.
  • Fearnhead and Künsch [2018] Paul Fearnhead and Hans R. Künsch. Particle Filters and Data Assimilation. Annual Review of Statistics and Its Application, 5(1):421–449, 2018.
  • Gelman et al. [2013] Andrew Gelman, John B Carlin, Donald B Rubin, Aki Vehtari, David B Dunson, and Hal S Stern. Bayesian Data Analysis. CRC Press, 2013.
  • Girolami and Calderhead [2011] Mark Girolami and Ben Calderhead. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2):123–214, 2011.
  • Gordon et al. [1993] N. J. Gordon, D. J. Salmond, and A. F. M. Smith. Novel approach to nonlinear/non-Gaussian Bayesian state estimation. IEE Proceedings F - Radar and Signal Processing, 140(2):107–113, 1993.
  • Gorham and Mackey [2017] Jackson Gorham and Lester Mackey. Measuring Sample Quality with Kernels. arXiv preprint arXiv:1703.01717, 2017.
  • Kastner [2016] Gregor Kastner. Dealing with stochastic volatility in time series using the R package stochvol. Journal of Statistical Software, 69(5):1–30, 2016. doi: 10.18637/jss.v069.i05.
  • Kitagawa [1996] Genshiro Kitagawa. Monte Carlo filter and smoother for non-Gaussian nonlinear state space models. Journal of Computational and Graphical Statistics, 5(1):1–25, 1996.
  • Kitagawa and Sato [2001] Genshiro Kitagawa and Seisho Sato. Monte Carlo smoothing and self-organising state-space model. In Sequential Monte Carlo Methods in Practice, pages 177–195. Springer, 2001.
  • Langrock [2011] Roland Langrock. Some applications of nonlinear and non-Gaussian state–space modelling by means of hidden Markov models. Journal of Applied Statistics, 38(12):2955–2970, 2011.
  • Liu and Chen [1998] Jun S. Liu and Rong Chen. Sequential Monte Carlo methods for Dynamic Systems. Journal of the American Statistical Association, 93(443):1032–1044, 1998.
  • Liu et al. [2016] Qiang Liu, Jason Lee, and Michael Jordan. A kernelized stein discrepancy for goodness-of-fit tests. In International Conference on Machine Learning, pages 276–284, 2016.
  • Ma et al. [2015] Yi-An Ma, Tianqi Chen, and Emily Fox. A complete recipe for stochastic gradient MCMC. In Advances in Neural Information Processing Systems, pages 2917–2925, 2015.
  • Ma et al. [2017] Yi-An Ma, Nicholas J Foti, and Emily B Fox. Stochastic Gradient MCMC Methods for Hidden Markov Models. In International Conference on Machine Learning, pages 2265–2274, 2017.
  • Maddison et al. [2017] Chris J Maddison, John Lawson, George Tucker, Nicolas Heess, Mohammad Norouzi, Andriy Mnih, Arnaud Doucet, and Yee Teh. Filtering variational objectives. In Advances in Neural Information Processing Systems, pages 6573–6583, 2017.
  • Naesseth et al. [2018] Christian Naesseth, Scott Linderman, Rajesh Ranganath, and David Blei. Variational Sequential Monte Carlo. In

    International Conference on Artificial Intelligence and Statistics

    , pages 968–977, 2018.
  • Nagapetyan et al. [2017] Tigran Nagapetyan, Andrew B Duncan, Leonard Hasenclever, Sebastian J Vollmer, Lukasz Szpruch, and Konstantinos Zygalakis. The true cost of stochastic gradient Langevin dynamics. arXiv preprint arXiv:1706.02692, 2017.
  • Nemeth et al. [2016] Christopher Nemeth, Paul Fearnhead, and Lyudmila Mihaylova. Particle approximations of the score and observed information matrix for parameter estimation in state–space models with linear computational cost. Journal of Computational and Graphical Statistics, 25(4):1138–1157, 2016.
  • Olsson et al. [2017] Jimmy Olsson, Johan Westerborn, et al. Efficient particle-based online smoothing in general hidden Markov models: the PaRIS algorithm. Bernoulli, 23(3):1951–1996, 2017.
  • Ou et al. [2018] Rihui Ou, Alexander L Young, and David B Dunson. Clustering-Enhanced Stochastic Gradient MCMC for Hidden Markov Models with Rare States. arXiv preprint arXiv:1810.13431, 2018.
  • Patterson and Teh [2013] Sam Patterson and Yee Whye Teh. Stochastic gradient Riemannian Langevin dynamics on the probability simplex. In Advances in Neural Information Processing Systems, pages 3102–3110, 2013.
  • Poyiadjis et al. [2011] George Poyiadjis, Arnaud Doucet, and Sumeetpal S Singh. Particle approximations of the score and observed information matrix in state space models with application to parameter estimation. Biometrika, 98(1):65–80, 2011.
  • Robbins and Monro [1951] Herbert Robbins and Sutton Monro. A Stochastic Approximation Method. The Annals of Mathematical Statistics, 22(3):400–407, 09 1951.
  • Saumard and Wellner [2014] Adrien Saumard and Jon A Wellner. Log-concavity and strong log-concavity: a review. Statistics surveys, 8:45, 2014.
  • Schmidt et al. [2015] Mark Schmidt, Reza Babanezhad, Mohamed Ahmed, Aaron Defazio, Ann Clifton, and Anoop Sarkar. Non-uniform stochastic average gradient method for training conditional random fields. In Artificial Intelligence and Statistics, pages 819–828, 2015.
  • Shephard [2005] Neil Shephard. Stochastic Volatility: Selected Readings. Oxford University Press, 2005.
  • Teh et al. [2016] Yee Whye Teh, Alexandre H Thiery, and Sebastian J Vollmer. Consistency and fluctuations for stochastic gradient Langevin dynamics. The Journal of Machine Learning Research, 17(1):193–225, 2016.
  • Villani [2008] Cédric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.
  • Welling and Teh [2011] Max Welling and Yee W Teh. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 681–688, 2011.

Appendix A Error Analysis Supplement

a.1 Stochastic Subsequence Error

We are interested in the (mean-squared) error between the full gradient and the unbiased stochastic gradient estimate , specifically for the case of randomly sampling a contiguous subsequence . Because is unbiased, this reduces to calculating the variance of with respect to the sampling distribution of the subsequence . For simplicity, we consider the 1-D case and assume .

Let , thus

(A.1)
(A.2)

Consider the uniform random variable

over . To bound the variance of , we additionally assume , that is the correlation between gradients decays with time. This assumption is reasonable when both the observations and posterior latent states are ergodic (i.e. exhibit a exponential forgetting property) [Cappé et al., 2005]. Let be the variance and recall the following covariance formula . Then we have

(A.3)
(A.4)
(A.5)
(A.6)

Note that without the decaying correlation assumption (i.e. if ), there is no decay in the covariance terms, and thus the variance of does not necessarily decay with increasing .

a.2 Proof of Theorem 1

Theorem 1 states that if the prior distribution for , the transition distribution and the emission distribution are log-concave, then we can bound the Lipschitz constant of in terms of and .

We first briefly review Wasserstein distance, random mappings, and Lipschitz constants of kernels [Aicher et al., 2018, Villani, 2008]. Then we review Caffarelli’s log-concave perturbation theorem, the main tool we use in our proof. Finally, we present the proof in Section A.2.3.

a.2.1 Wasserstein Distance and Random Mappings

The -Wasserstein distance with respect to Euclidean distance is

(A.7)

where is a joint measure or coupling over with marginals and .

To bound the Wasserstein distance, we first must introduce the concept of a random mapping associated with a transition kernel.

Let be a transition kernel for random variables and , then for any measure over , we define the induced measure over as .

A random mapping is a random function that maps to such that if then . For example, if , then a random mapping for is the identity function plus Gaussian noise , where . Note that if is deterministic is the push-forward measure of through the mapping ; otherwise it is the average (or marginal) over of push-forward measures [Villani, 2008].

We say the kernel (and random mapping) has Lipschitz constant with respect to Euclidean distance if

(A.8)

Note that is is an upper-bound on the expected value of Lipschitz constants for random instances of .

These definitions are useful for proving bounds in Wasserstein distance. For example, we can show the kernel induces a contraction in -Wasserstein distance if . That is

Proof.
(A.9)