Spectral Subsampling MCMC for Stationary Time Series

10/30/2019 ∙ by Robert Salomone, et al. ∙ 0

Bayesian inference using Markov Chain Monte Carlo (MCMC) on large datasets has developed rapidly in recent years. However, the underlying methods are generally limited to relatively simple settings where the data have specific forms of independence. We propose a novel technique for speeding up MCMC for time series data by efficient data subsampling in the frequency domain. We demonstrate a speedup of two orders of magnitude while incurring negligible bias compared to MCMC on the full dataset. We also propose alternative control variates for variance reduction based on data grouping and coreset constructions.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Bayesian inference has gained widespread use in Statistics and Machine Learning largely due to convenient and quite generally applicable Markov Chain Monte Carlo (MCMC) and Hamiltonian Monte Carlo (HMC) algorithms that simulate from the posterior distribution of the model parameters.

However, it is now increasingly common for datasets to contain millions or even billions of observations. This is particularly true for temporal data recorded by sensors at increasingly faster sampling rates. MCMC is often too slow for such big data problems and practitioners are replacing MCMC with more scalable approximate methods such as Variational Inference (Blei et al., 2017), Approximate Bayesian Computation (Marin et al., 2012) and Integrated Nested Laplace Approximation (Rue et al., 2009).

Another strand of the literature proposes methods that speed up MCMC and HMC by data subsampling, where the costly likelihood evaluation in each MCMC iteration is replaced by an estimate from a subsample of data observations

(Quiroz et al., 2018b, 2019a; Dang et al., 2019) or by a weighted coreset of data points found by optimization (Campbell and Broderick, 2018; Campbell and Beronov, 2019; Campbell and Broderick, 2019). Data subsampling methods require that the log-likelihood is a sum, where each term depends on a unique piece of data — a condition satisfied for completely conditionally independent observations or for conditionally independent subjects in longitudinal data (with potentially dependent data within each subject) — but does not hold for general time series problems.

Our paper extends the applicability of previously proposed subsampling methods to stationary time series. The method is based on using the Fast Fourier Transform (FFT) to evaluate the likelihood function in the frequency domain for the periodogram data. The advantage of working in the frequency domain is that under quite general conditions the periodogram observations are known to be asymptotically independent and exponentially distributed with scale equal to the spectral density. The logarithm of this so called

Whittle likelihood approximation of the likelihood is therefore a sum even when the data are dependent in the time domain. The asymptotic nature of the Whittle likelihood makes it especially suitable here since subsampling tends to be used for large-scale problems where the Whittle likelihood is expected to be accurate. Moreover, our algorithm can also be used with recently proposed refinements of the Whittle likelihood (Sykulski et al., 2019) which give better likelihood approximations with smaller datasets.

It is by now well established that Subsampling MCMC methods require likelihood estimators with low variance to be efficient (Quiroz et al., 2019a, b). Variance reduction is typically achieved by using control variates that approximate the individual log-likelihood terms. The second contribution here is to propose two alternative schemes that may be more robust than the original control variates for Subsampling MCMC.

The structure of our paper is as follows. Section 2 introduces the necessary frequency domain concepts and defines the Whittle likelihood. Section 3 gives an overview of the Subsampling MCMC approach of Quiroz et al. (2019a) and its two crucial aspects — variance reduction of the log-likelihood estimator via the use of control variates, and efficient MCMC sampling via the use of efficient pseudo-marginal sampling. Section 4 introduces our novel control variate schemes. Section 5 summarizes the results of experiments on two examples of models that have previously not been feasible with large data methods — an Autoregressive Moving Average model and an extension involving fractional integration. Section 6 concludes by discussing several possible extensions.

2 Data Subsampling Using the Whittle Likelihood

2.1 Discrete Fourier Transformed Data

Let be a covariance stationary zero-mean time series with covariance function for , where

is the vector of model parameters. The

spectral density is the Fourier transform of (Lindgren, 2012)

(1)

where is called the angular frequency. The Discrete Fourier Transform (DFT) of is the complex valued series

(2)

for in the set of Fourier frequencies

The DFT is efficiently computed by the Fast Fourier Transform (FFT). The periodogram

is an asymptotically unbiased estimate of

.

2.2 Frequency Domain Asymptotics

The DFT in (2

) is a linear transformation that acts like a weighted average of time-domain data. A central limit theorem can therefore be used to prove that

are asymptotically complex Gaussian under quite general conditions (Shao et al., 2007, Corollary 2.1). Moreover, the real and imaginary parts are asymptotically independent and the scaled periodogram ordinate is asymptotically distributed as , or equivalently, a standard exponential variable. The exceptions are at the frequencies and where asymptotically. The orthogonality of the Fourier bases at the Fourier frequencies also means for any two distinct Fourier frequencies . Hence, under mild conditions we have the following asymptotic distribution of the periodogram (Shao et al., 2007, Corollary 2.1)

(3)

independently as , with the exponential distribution in the scale parameterization, i.e., parameterized by its mean. The asymptotic distribution of the periodogram ordinates in (3) motivates the Whittle log-likelihood (Whittle, 1953)

(4)

For real-valued data, both and are symmetric about the origin, so the Whittle log-likelihood can be evaluated by summing only over the positive frequencies, multiplying each term by two. Also, for demeaned data, the term for can be removed.

For the purposes of enabling scalable Bayesian inference methods, the Whittle log-likelihood has several desirable properties:

  • The periodogram does not depend on the parameter vector and can therefore be computed before the MCMC at a cost of via the Fast Fourier Transform algorithm. After this one-time cost, likelihood evaluations have the same cost as for independent data.

  • The Whittle log-likelihood is a sum in the frequency domain and is therefore amenable to subsampling using the same algorithms developed for independent data in the time domain.

  • As the Whittle log-likelihood relies on large sample properties of the periodogram, it is suited to big data situations where Subsampling MCMC and related methods are used.

In the sequel, the term log-likelihood refers to the Whittle log-likelihood, and to be the number of unique summands in (4) — i.e., we assume the DFT has been performed and we are working in the frequency domain.

3 Subsampling Mcmc

3.1 MCMC with an estimated likelihood

Let denote the posterior distribution from a sample of observations with likelihood function . MCMC and HMC algorithms sample iteratively from by proposing a parameter vector at the

th iteration and accepting it with probability

(5)

where is the proposal distribution. Repeated evaluations of the likelihood in the acceptance probability are costly when is large. Quiroz et al. (2019a) propose speeding up MCMC for large by replacing with an estimate based on a small random subsample of observations, where indexes the selected observations. Their algorithm samples and jointly from an extended target distribution . Andrieu et al. (2009) prove that such pseudo-marginal MCMC algorithms sample from the full-data posterior if the likelihood estimator is unbiased; i.e., . Quiroz et al. (2019a) use an unbiased estimator of the log-likelihood and subsequently debias to estimate the full-data likelihood. Although the debiasing approach in general cannot remove all bias, their pseudo-marginal sampler is still a valid MCMC algorithm, targeting a slightly perturbed posterior which is shown to be within total variation distance of the true posterior. See Quiroz et al. (2018b) for an alternative completely unbiased likelihood estimator and Dang et al. (2019) for a HMC extension.

3.2 Estimators based on control variates

Assume that the log-likelihood decomposes as a sum ; either by assuming independent data or by using the Whittle likelihood in the frequency domain for temporally dependent data. A naive estimator of the log-likelihood is

where and we suppress dependence on in the notation for for notational clarity. This estimator typically has large variance and is prone to occasional gross overestimates of the likelihood causing the MCMC sampler to become stuck for extended periods and thus become very inefficient.

Quiroz et al. (2019a) propose using a control variate to reduce the variance in the so-called difference estimator

(6)

where . The is the control variate for the th observation. It is evident from (6) that the variance of is small when the approximates well. Quiroz et al. (2019a) follow Bardenet et al. (2017) and propose using a second order Taylor expansion of around some central value as the control variate:

One advantage of this control variate is that the otherwise term can be computed at cost; see Bardenet et al. (2017).

3.3 Block Pseudo-Marginal Sampling

The acceptance probability in (5) reveals that it is actually the variability of the ratio of estimates at the proposed and current draw that matters for MCMC efficiency (Deligiannidis et al., 2018). Tran et al. (2016) propose a blocked pseudo-marginal scheme for subsampling that partitions the indicators in blocks and only updates one of the blocks in each MCMC iteration. Under simplifying assumptions Lemma 2 in Tran et al. (2016) shows that blocking induces a controllable correlation between subsequent estimates in the MCMC of the simple form

4 Alternative Control Variates via Grouping

The control variates presented in Section 3.2 are only good approximations if the individual log-likelihood terms, , are approximately quadratic or is sufficiently small. We propose two new control variates that may be preferable when this is not the case.

4.1 Grouped Quadratic Control Variates

Rather than sampling individual observations, we can sample observations in groups

. The advantage of sampling groups is that the quadratic control variates are expected to be more accurate for the group as a whole compared to individual observations. The reason is that the Bernstein-von Mises theorem (asymptotic normality of the posterior) suggests an approximately quadratic log-likelihood for the group provided the number of observations in the group is large enough; see

Tamaki (2008) for a Bernstein-von Mises theorem specifically for the Whittle likelihood.

Let be a partition of the set of indices into groups ; i.e. , where is the set of data indices associated with the th group. Similarly, write

for the sum of log-likelihood terms corresponding to the observations in the th group, noting that . Since is based on observations we expect it to be closer to a quadratic function than the of belonging to the individual samples in the group. Now, define the control variate for group as

where the same is used for all groups. The grouped difference estimator for a sample of groups is then

(7)

where .

4.1.1 Grouped Coreset Control Variates

When the grouped log-likelihoods are far from quadratic, we propose an alternative method using Bayesian Coresets (Huggins et al., 2016) to construct control variates. An advantage of this approach is that it is unnecessary to select a central point , nor rely on a quadratic expansion function which may be unsuitable.

Bayesian Coresets replace the true log-likelihood with the approximation , where is a sparse vector with a number of non-zero elements that is much less than . Let denote some weighting distribution that has the same support as the posterior and can easily be sampled from — for example a Gaussian based on Laplace approximation, or the empirical distribution of samples from an MCMC on a smaller data set. The log-likelihood approximation is constructed using a greedy algorithm, which, after steps, provides an approximate solution to

subject to the constraints that for , and , where is a user-specified number of iterations in the coreset optimization procedure. We use the Greedy Iterative Geodesic Ascent (GIGA) method in Campbell and Broderick (2018) for tackling the optimization.

We propose approximating the group log-likelihoods , by a separate coreset approximation for each group. We can use the grouped difference estimator in (7) with coreset approximations as control variates for each respective group. The coreset control variates are attractive as by design they approximate well for each group. While the construction of coreset control variates requires runs of the coreset procedure, each is only on a group of the dataset, so the overall effort is roughly that of the standard coreset approach, or less if run in parallel.

5 Experiments

5.1 Settings and Performance Measures

There are many ways to partition the dataset into groups for the control variates. We use the same amount of samples for each group. The th group is chosen by starting with the th lowest frequency and then systematically sampling every frequency after that. This way we ensure that each group contains periodogram ordinates across the entire frequency range. The homogeneity of the groups makes it possible to use the same for all groups.

For the choice of weighting function in the coreset approximation, we use the Laplace approximation of the posterior, truncated to the region of admissible parameters. Each coreset is fitted using the GIGA algorithm for iterations, using random projections; see Campbell and Broderick (2018) for details. We note that the mode used in the Laplace approximation comes with no extra cost compared to full-data MCMC as the latter uses the mode as a starting value for the sampler and to build the covariance matrix of the random walk Metropolis proposal. Likewise, the Taylor control variates are constructed using the mode as . Therefore, this part of the start-up cost is assumed to be the same for all algorithms. However, both control variates have additional start-up costs compared to full-data MCMC. The coreset control variate needs to perform the GIGA optimization, which makes sweeps of the full dataset, using density evaluations. As discussed above, this can in practice be done in parallel for each group, where each group uses observations. Recall that , which explains the cost of density evaluations. The Taylor control variate requires summing all the once (first term in (6)), hence adding to the total cost. For simplicity, assume all groups have the same number of observations . During run time, full-data MCMC requires density evaluations in each iteration, whereas the Taylor control variate and the coreset control variate , where the second term is the cost of evaluating the summation of all , which is much faster than full-data MCMC if the coreset size is small in relation to .

We follow Quiroz et al. (2019a) and use the computational time (CT) as our measure of performance. This measure balances the cost (number of density evaluations as discussed above) and the efficiency of the Markov chain. It is defined as

where the inefficiency factor (IF) is proportional to the asymptotic variance when estimating a posterior mean based on MCMC output. The IF is interpreted as the number of (correlated) samples needed to obtain the equivalent of a single independent sample. It is convenient to measure the cost using density evaluations since it makes the comparisons independent of the implementation. We use the CODA package (Plummer et al., 2006) in R to estimate IF. Our measure of interest is the relative CT (RCT) which we define as the ratio between the CT of full-data MCMC and that of the subsampling algorithm of interest. Hence values larger than mean that the subsampling algorithm is more efficient when balancing computing cost (density evaluations) and statistical efficiency (variance of the posterior mean estimator).

The following two examples use parametric models traditionally considered in the time domain. However, in a Bayesian setting the Whittle likelihood has been used for both semi-parametric and non-parametric spectral density estimation, see for example

Carter and Kohn (1997), Choudhuri et al. (2004), and Edwards et al. (2019).

5.2 Arma

The first example considers the Autoregressive Moving Average (ARMA) model, a classic parsimonious model for univariate time series.

For non-negative integers, and , define the lag polynomials , and , where is the lag operator, that is, . The ARMA model is

The process stationary if and only if the roots of lie outside of the unit circle in the complex plane, and in this case has spectral density (see for example Brockwell and Davis, 2016, Ch.4)

As an alternative to performing MCMC over the somewhat complicated non-linear constraint on the autoregressive parameters, we reparameterize the autoregressive parameters in terms of its partial autocorrelations as in Barndorff-Nielsen and Schou (1973), which we denote by , so that , for implies stationarity. We perform the same reparameterization to to obtain , which ensures the ARMA process is invertible provided that the constraint for is satisfied.

We use hourly temperature data for the city of Vancouver during the years 2012 to 2017 sourced from openweathermap.org. The original series is first decomposed and its trend and yearly seasonal effect are removed, using the stl function in R. After this, we confirm that the series passes the Augmented Dickley-Fuller and Phillips-Perron unit-root tests for stationarity. The final time series is of length , yielding a likelihood with frequency terms.

We employ the auto.arima function in the Forecast (Hyndman and Khandakar, 2008) package in R which uses a stepwise procedure to choose the order of the autoregressive and moving average parameters. As a result, we choose to fit an model.

In addition to the partial autocorrelation reparameterization defined above, we apply a log-transformation to . We impose priors on the transformed parameters — , , and .

Figure 1

shows the kernel density estimates obtained from

MCMC samples using full-data MCMC and Spectral Subsampling MCMC with the grouped Taylor control variate. The bias due to subsampling is negligible.

Figure 1: Kernel Density Estimates of some marginal distributions for the ARMA example — note that the incurred bias from subsampling is negligible. Results for the other marginal distributions are similar.

Figure 2 reports the relative computational time for all six parameters — showing that our method introduces close to two orders of magnitude speedup on this example.

Figure 2: Relative computational time for the ARMA example (larger is better). Results are relative to full-data MCMC.

Figure 3 plots the posterior mean of the spectral density obtained by Spectral Subsampling MCMC, overlayed on the periodogram. We note that the plot is indistinguishable from that in Figure 3 when using the samples from the full-data MCMC (not shown here).

Figure 3: Periodogram and Spectral Subsampling MCMC posterior mean spectral density for the ARMA example.

5.3 Fractionally Integrated Time Series

The second example fits a Autoregressive Fractionally Integrated Moving Average (ARFIMA) model, a generalization of the ARMA model considered in the previous example, using simulated data.

Define the fractional differencing operator (see for example, Granger and Joyeux, 2008)

where is called the fractional integration parameter. The process is defined as

Analogously to the ARMA process, the ARFIMA process is stationary for any if and only if the roots of lie outside of unit circle in the complex plane. The spectral density of the process then exists, and is given by (Brockwell and Davis, 2016, Ch.7)

(8)

The ARFIMA process is of particular interest as it possesses long-memory, i.e., for .

To test our proposed methodology, we employ two simulated ARFIMA(2,0.4,1) time series generated using the ARTFIMA package (Sabzikar et al., 2019) in R. Two datasets are simulated using parameters , , and . Figure 4 illustrates the qualitative difference between an ARMA and ARFIMA model, showing both models simulated with the same autoregressive and moving average parameters given above.

We use the same priors for the AR, MA, and variance parameters as in the last example. For the fractional integration parameter , we apply a scaled Fisher transformation and specify , which implies a weakly informative prior on in the region .

Figure 4: The long-memory property induced by fractional integration results in the ARFIMA process having more pronounced trends.

Series 1 — Frequencies

The first series is of length , yielding frequency terms. We use this example to compare the posterior distribution of the full-data MCMC to that of the subsampling algorithms with the different control variates. Our aim is to test if the subsampling approximation accurately recovers a distribution close to the Whittle posterior of the full-data. Figure 5 confirms the accuracy. Note that the true parameters are recovered.

Figure 6 shows the RCT, illustrating that the coreset control variate does not provide much improvement compared to full-data MCMC. This is because we have a small number of observations per group and the coreset is relatively large in comparison (an average size of approximately 20, i.e. 20% of ). It is our experience that better coreset results are obtained for larger sizes of the initial groups (as illustrated shortly in the second part of this example).

We compute the variance of the log-likelihood estimator using the ungrouped control variate and grouped control variates using different group sizes. Table 1 shows the variance reduction obtained by grouping; see the caption for details. The grouping is expected to be more effective in terms of variance reduction if the model is more complex. Finally, Figure 7 confirms that the spectral density corresponding to the posterior mean of samples obtained by Spectral Subsampling MCMC matches the known spectral density (recall that the data is simulated and thus the true spectral density is known).

(1) (10) (100)
min 1 1.02 1.03
median 1 1.07 1.20
max 1 2.71 3.82
Table 1: Relative variance of log-likelihood estimator over draws from the posterior for the ARFIMA example (larger is better). The results are shown for different number of groups (number of observations per group specified in parenthesis). The results are relative to no grouping, i.e.,

Series 2 — Frequencies

The second series is of length , yielding frequency terms in the Whittle likelihood. As mentioned, for the previous series the coreset control variate did not yield much improvement since the number of observations per group was small relative to that of the coreset. By increasing the number of frequency terms we now have larger groups and therefore the relative overhead cost of evaluating the coreset for each group decreases. Figure 8 shows a large improvement over Figure 6 for the coreset control variate, although the Taylor control variate remains the best for this problem.

Figure 5: Kernel Density Estimates of all marginal distributions except for the ARFIMA example — note that the incurred bias from subsampling is negligible.
Figure 6: Relative computational time for the ARFIMA example (larger is better) with . Results are relative to full-data MCMC.
Figure 7: Periodogram, True Spectral Density, and Spectral Subsampling MCMC posterior mean spectral density for the ARFIMA example.
Figure 8: Relative computational time for the ARFIMA example (larger is better) with . Results are relative to full-data MCMC.

6 Discussion

We introduce novel methods allowing efficient Bayesian inference in stationary models for large time series datasets. The idea is simple and elegant: overcome the lack of independence in the data by transforming them to the frequency domain where they are independent. This work is, to our knowledge, the first to extend scalable MCMC algorithms to models with temporal dependence. The focus here is on Subsampling MCMC, but the ideas introduced here can be directly applied to many other scalable inference methods involving subsampling and coresets, as well as algorithms that produce exact inferences (Quiroz et al., 2018b, a; Cornish et al., 2019). Subsampling of periodogram frequencies also extend beyond the MCMC setting, for example to Doubly Stochastic Variational Inference (Titsias and Lázaro-Gredilla, 2014). We also introduce novel control variate schemes based on grouping and coresets to improve the robustness of Subsampling MCMC.

In terms of further extensions to different models, one promising avenue is to extend our methods to vector valued and/or locally stationary processes. Dahlhaus et al. (2000) define a suitable analogue to the univariate Whittle likelihood for these cases. Finally, it has not escaped our notice that our approach can be directly extended to spatial and spatio–temporal data using the multidimensional DFT.

Acknowledgements

Salomone and Kohn were partially supported by the Australian Centre of Excellence for Mathematical and Statistical Frontiers, under Australian Research Council grant CE140100049. Villani was supported by the Swedish Foundation for Strategic Research (Smart Systems: RIT 15-0097).

References

  • Andrieu et al. (2009) Andrieu, C., Roberts, G. O., et al. (2009). The pseudo-marginal approach for efficient Monte Carlo computations. The Annals of Statistics, 37(2):697–725.
  • Bardenet et al. (2017) Bardenet, R., Doucet, A., and Holmes, C. (2017). On Markov chain Monte Carlo methods for tall data. The Journal of Machine Learning Research, 18(1):1515–1557.
  • Barndorff-Nielsen and Schou (1973) Barndorff-Nielsen, O. and Schou, G. (1973).

    On the parametrization of autoregressive models by partial autocorrelations.

    Journal of Multivariate Analysis

    , 3(4):408–419.
  • Blei et al. (2017) Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. (2017). Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877.
  • Brockwell and Davis (2016) Brockwell, P. and Davis, R. (2016). Introduction to Time Series and Forecasting. Springer Texts in Statistics. Springer International Publishing.
  • Campbell and Beronov (2019) Campbell, T. and Beronov, B. (2019). Sparse variational inference: Bayesian coresets from scratch. arXiv preprint arXiv:1906.03329.
  • Campbell and Broderick (2018) Campbell, T. and Broderick, T. (2018). Bayesian coreset construction via greedy iterative geodesic ascent. In Proceedings of the 35th International Conference on Machine Learning, pages 698–706.
  • Campbell and Broderick (2019) Campbell, T. and Broderick, T. (2019). Automated scalable Bayesian inference via Hilbert coresets. The Journal of Machine Learning Research, 20(1):551–588.
  • Carter and Kohn (1997) Carter, C. K. and Kohn, R. (1997). Semiparametric Bayesian inference for time series with mixed spectra. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 59(1):255–268.
  • Choudhuri et al. (2004) Choudhuri, N., Ghosal, S., and Roy, A. (2004). Bayesian estimation of the spectral density of a time series. Journal of the American Statistical Association, 99(468):1050–1059.
  • Cornish et al. (2019) Cornish, R., Vanetti, P., Bouchard-Côté, A., Deligiannidis, G., and Doucet, A. (2019). Scalable Metropolis-Hastings for exact Bayesian inference with large datasets. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, pages 1351–1360.
  • Dahlhaus et al. (2000) Dahlhaus, R. et al. (2000). A likelihood approximation for locally stationary processes. The Annals of Statistics, 28(6):1762–1794.
  • Dang et al. (2019) Dang, K.-D., Quiroz, M., Kohn, R., Tran, M.-N., and Villani, M. (2019). Hamiltonian Monte Carlo with energy conserving subsampling. Journal of Machine Learning Research, 20(100):1–31.
  • Deligiannidis et al. (2018) Deligiannidis, G., Doucet, A., and Pitt, M. K. (2018). The correlated pseudomarginal method. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(5):839–870.
  • Edwards et al. (2019) Edwards, M. C., Meyer, R., and Christensen, N. (2019). Bayesian nonparametric spectral density estimation using B-spline priors. Statistics and Computing, 29(1):67–78.
  • Granger and Joyeux (2008) Granger, C. and Joyeux, R. (2008). An introduction to long memory time series models and fractional differencing. Journal of Time Series Analysis, 1:15 – 29.
  • Huggins et al. (2016) Huggins, J., Campbell, T., and Broderick, T. (2016).

    Coresets for scalable Bayesian logistic regression.

    In Advances in Neural Information Processing Systems, pages 4080–4088.
  • Hyndman and Khandakar (2008) Hyndman, R. and Khandakar, Y. (2008). Automatic time series forecasting: The forecast package for R. Journal of Statistical Software, 27(3):1–22.
  • Lindgren (2012) Lindgren, G. (2012). Stationary stochastic processes: theory and applications. Chapman and Hall/CRC.
  • Marin et al. (2012) Marin, J.-M., Pudlo, P., Robert, C. P., and Ryder, R. J. (2012). Approximate Bayesian computational methods. Statistics and Computing, 22(6):1167–1180.
  • Plummer et al. (2006) Plummer, M., Best, N., Cowles, K., and Vines, K. (2006). CODA: Convergence diagnosis and output analysis for MCMC. R News, 6(1):7–11.
  • Quiroz et al. (2019a) Quiroz, M., Kohn, R., Villani, M., and Tran, M.-N. (2019a). Speeding up MCMC by efficient data subsampling. Journal of the American Statistical Association, 114(526):831–843.
  • Quiroz et al. (2018a) Quiroz, M., Tran, M.-N., Villani, M., and Kohn, R. (2018a). Speeding up MCMC by delayed acceptance and data subsampling. Journal of Computational and Graphical Statistics, 27(1):12–22.
  • Quiroz et al. (2018b) Quiroz, M., Tran, M.-N., Villani, M., Kohn, R., and Dang, K.-D. (2018b). The block-Poisson estimator for optimally tuned exact subsampling MCMC. arXiv preprint arXiv:1603.08232v5.
  • Quiroz et al. (2019b) Quiroz, M., Villani, M., Kohn, R., Tran, M.-N., and Dang, K.-D. (2019b). Subsampling MCMC - An introduction for the survey statistician. Sankhya A, 80(1):33–69.
  • Rue et al. (2009) Rue, H., Martino, S., and Chopin, N. (2009). Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(2):319–392.
  • Sabzikar et al. (2019) Sabzikar, F., McLeod, A. I., and Meerschaert, M. M. (2019). Parameter estimation for ARTFIMA time series. Journal of Statistical Planning and Inference, 200:129 – 145.
  • Shao et al. (2007) Shao, X., Wu, W. B., et al. (2007). Asymptotic spectral theory for nonlinear time series. The Annals of Statistics, 35(4):1773–1801.
  • Sykulski et al. (2019) Sykulski, A. M., Guillaumin, A. P., Olhede, S. C., Early, J. J., and Lilly, J. M. (2019). The debiased Whittle likelihood. Biometrika, 106(2):251–266.
  • Tamaki (2008) Tamaki, K. (2008). The Bernstein-von Mises theorem for stationary processes. Journal of the Japan Statistical Society, 38(2):311–323.
  • Titsias and Lázaro-Gredilla (2014) Titsias, M. and Lázaro-Gredilla, M. (2014). Doubly stochastic variational Bayes for non-conjugate inference. In International Conference on Machine Learning, pages 1971–1979.
  • Tran et al. (2016) Tran, M.-N., Kohn, R., Quiroz, M., and Villani, M. (2016). The block pseudo-marginal sampler. arXiv preprint arXiv:1603.02485.
  • Whittle (1953) Whittle, P. (1953). Estimation and information in stationary time series. Arkiv för matematik, 2(5):423–434.

References

  • Andrieu et al. (2009) Andrieu, C., Roberts, G. O., et al. (2009). The pseudo-marginal approach for efficient Monte Carlo computations. The Annals of Statistics, 37(2):697–725.
  • Bardenet et al. (2017) Bardenet, R., Doucet, A., and Holmes, C. (2017). On Markov chain Monte Carlo methods for tall data. The Journal of Machine Learning Research, 18(1):1515–1557.
  • Barndorff-Nielsen and Schou (1973) Barndorff-Nielsen, O. and Schou, G. (1973).

    On the parametrization of autoregressive models by partial autocorrelations.

    Journal of Multivariate Analysis

    , 3(4):408–419.
  • Blei et al. (2017) Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. (2017). Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877.
  • Brockwell and Davis (2016) Brockwell, P. and Davis, R. (2016). Introduction to Time Series and Forecasting. Springer Texts in Statistics. Springer International Publishing.
  • Campbell and Beronov (2019) Campbell, T. and Beronov, B. (2019). Sparse variational inference: Bayesian coresets from scratch. arXiv preprint arXiv:1906.03329.
  • Campbell and Broderick (2018) Campbell, T. and Broderick, T. (2018). Bayesian coreset construction via greedy iterative geodesic ascent. In Proceedings of the 35th International Conference on Machine Learning, pages 698–706.
  • Campbell and Broderick (2019) Campbell, T. and Broderick, T. (2019). Automated scalable Bayesian inference via Hilbert coresets. The Journal of Machine Learning Research, 20(1):551–588.
  • Carter and Kohn (1997) Carter, C. K. and Kohn, R. (1997). Semiparametric Bayesian inference for time series with mixed spectra. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 59(1):255–268.
  • Choudhuri et al. (2004) Choudhuri, N., Ghosal, S., and Roy, A. (2004). Bayesian estimation of the spectral density of a time series. Journal of the American Statistical Association, 99(468):1050–1059.
  • Cornish et al. (2019) Cornish, R., Vanetti, P., Bouchard-Côté, A., Deligiannidis, G., and Doucet, A. (2019). Scalable Metropolis-Hastings for exact Bayesian inference with large datasets. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, pages 1351–1360.
  • Dahlhaus et al. (2000) Dahlhaus, R. et al. (2000). A likelihood approximation for locally stationary processes. The Annals of Statistics, 28(6):1762–1794.
  • Dang et al. (2019) Dang, K.-D., Quiroz, M., Kohn, R., Tran, M.-N., and Villani, M. (2019). Hamiltonian Monte Carlo with energy conserving subsampling. Journal of Machine Learning Research, 20(100):1–31.
  • Deligiannidis et al. (2018) Deligiannidis, G., Doucet, A., and Pitt, M. K. (2018). The correlated pseudomarginal method. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(5):839–870.
  • Edwards et al. (2019) Edwards, M. C., Meyer, R., and Christensen, N. (2019). Bayesian nonparametric spectral density estimation using B-spline priors. Statistics and Computing, 29(1):67–78.
  • Granger and Joyeux (2008) Granger, C. and Joyeux, R. (2008). An introduction to long memory time series models and fractional differencing. Journal of Time Series Analysis, 1:15 – 29.
  • Huggins et al. (2016) Huggins, J., Campbell, T., and Broderick, T. (2016).

    Coresets for scalable Bayesian logistic regression.

    In Advances in Neural Information Processing Systems, pages 4080–4088.
  • Hyndman and Khandakar (2008) Hyndman, R. and Khandakar, Y. (2008). Automatic time series forecasting: The forecast package for R. Journal of Statistical Software, 27(3):1–22.
  • Lindgren (2012) Lindgren, G. (2012). Stationary stochastic processes: theory and applications. Chapman and Hall/CRC.
  • Marin et al. (2012) Marin, J.-M., Pudlo, P., Robert, C. P., and Ryder, R. J. (2012). Approximate Bayesian computational methods. Statistics and Computing, 22(6):1167–1180.
  • Plummer et al. (2006) Plummer, M., Best, N., Cowles, K., and Vines, K. (2006). CODA: Convergence diagnosis and output analysis for MCMC. R News, 6(1):7–11.
  • Quiroz et al. (2019a) Quiroz, M., Kohn, R., Villani, M., and Tran, M.-N. (2019a). Speeding up MCMC by efficient data subsampling. Journal of the American Statistical Association, 114(526):831–843.
  • Quiroz et al. (2018a) Quiroz, M., Tran, M.-N., Villani, M., and Kohn, R. (2018a). Speeding up MCMC by delayed acceptance and data subsampling. Journal of Computational and Graphical Statistics, 27(1):12–22.
  • Quiroz et al. (2018b) Quiroz, M., Tran, M.-N., Villani, M., Kohn, R., and Dang, K.-D. (2018b). The block-Poisson estimator for optimally tuned exact subsampling MCMC. arXiv preprint arXiv:1603.08232v5.
  • Quiroz et al. (2019b) Quiroz, M., Villani, M., Kohn, R., Tran, M.-N., and Dang, K.-D. (2019b). Subsampling MCMC - An introduction for the survey statistician. Sankhya A, 80(1):33–69.
  • Rue et al. (2009) Rue, H., Martino, S., and Chopin, N. (2009). Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(2):319–392.
  • Sabzikar et al. (2019) Sabzikar, F., McLeod, A. I., and Meerschaert, M. M. (2019). Parameter estimation for ARTFIMA time series. Journal of Statistical Planning and Inference, 200:129 – 145.
  • Shao et al. (2007) Shao, X., Wu, W. B., et al. (2007). Asymptotic spectral theory for nonlinear time series. The Annals of Statistics, 35(4):1773–1801.
  • Sykulski et al. (2019) Sykulski, A. M., Guillaumin, A. P., Olhede, S. C., Early, J. J., and Lilly, J. M. (2019). The debiased Whittle likelihood. Biometrika, 106(2):251–266.
  • Tamaki (2008) Tamaki, K. (2008). The Bernstein-von Mises theorem for stationary processes. Journal of the Japan Statistical Society, 38(2):311–323.
  • Titsias and Lázaro-Gredilla (2014) Titsias, M. and Lázaro-Gredilla, M. (2014). Doubly stochastic variational Bayes for non-conjugate inference. In International Conference on Machine Learning, pages 1971–1979.
  • Tran et al. (2016) Tran, M.-N., Kohn, R., Quiroz, M., and Villani, M. (2016). The block pseudo-marginal sampler. arXiv preprint arXiv:1603.02485.
  • Whittle (1953) Whittle, P. (1953). Estimation and information in stationary time series. Arkiv för matematik, 2(5):423–434.