Optimal mean squared error bandwidth for spectral variance estimators in MCMC simulations

04/16/2018
by   Ying Liu, et al.
University of California, Riverside
0

This paper proposes optimal mean squared error bandwidths for a family of multivariate spectral variance estimators. The asymptotic mean squared error of the spectral variance estimator is derived under conditions that are convenient to verify for Markov chain Monte Carlo simulations. Optimal bandwidths are obtained by minimizing the asymptotic mean squared error along with techniques to estimate the proportional constant. Auto-regressive examples illustrate the quality of the estimation procedure. Finally, we show optimal bandwidths proposed here outperform current bandwidth selection methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

02/18/2020

Asymptotic properties of Bernstein estimators on the simplex

In this paper, we study various asymptotic properties (bias, variance, m...
07/09/2020

Provably-Efficient Double Q-Learning

In this paper, we establish a theoretical comparison between the asympto...
10/09/2020

Rao-Blackwellizing the Straight-Through Gumbel-Softmax Gradient Estimator

Gradient estimation in models with discrete latent variables is a challe...
11/22/2021

Nonparametric estimator of the tail dependence coefficient: balancing bias and variance

A theoretical expression is derived for the mean squared error of a nonp...
04/10/2019

Infill asymptotics and bandwidth selection for kernel estimators of spatial intensity functions

We investigate the asymptotic mean squared error of kernel estimators of...
12/07/2018

Multitaper estimation on arbitrary domains

Multitaper estimators have enjoyed significant success in providing spec...
09/25/2018

Mostly Harmless Simulations? On the Internal Validity of Empirical Monte Carlo Studies

Currently there is little practical advice on which treatment effect est...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In Markov chain Monte Carlo (MCMC) simulations, it is desirable to estimate the variability of ergodic averages to assess the quality of estimation (see e.g. Flegal et al., 2008; Geyer, 2011; Jones and Hobert, 2001)

. Estimation of this variability can be approached through a multivariate Markov chain central limit theorem (CLT). To this end, let

be a probability distribution with support

and be a -integrable function. We are interested in estimating the -dimensional vector

using draws from a Harris -ergodic Markov chain, say . Then, letting for , with probability 1 as . The sampling distribution for is available via a Markov chain CLT if there exists a positive definite symmetric matrix such that

(1)

where

We assume throughout that the CLT at (1) holds (see e.g. Jones, 2004) and consider estimation of . Provided such an estimator is available, one can access variability of by constructing a -dimensional confidence ellipsoid. Recent work of Vats et al. (2018b, a) obtains necessary conditions for strongly consistent estimation for multivariate spectral variance (SV) and batch means (BM) estimators. Liu and Flegal (2018) proposes computationally efficient weighted BM estimators that allow for flexible choice of window function similar to SV estimators. These broad family of estimators account for the serial correlation in the Markov chain up to a certain lag. This lag, denoted as , is called the bandwidth and batch size in SV and BM estimators, respectively. We will refer to both as the batch size, unless otherwise stated. The choice of is crucial to finite sample performance, but estimating an optimal has not been carefully addressed in multivariate output analysis. Instead, suboptimal bandwidths and batch sizes inherited from univariate estimators are widely used by practitioners.

Optimal batch size selection has been studied carefully in other contexts such as nonparametric density and spectral density function estimation. For example, Politis (2003, 2011) and Politis and Romano (1999) discuss optimal bandwidth selection of SV estimators using the flat-top window function. Estimation of the time-average variance constant using a recursive estimator is proposed by Chan and Yau (2017) where an optimal batch size is suggested. An interested reader is further directed to Jones et al. (1996), Silverman (1999), Woodroofe (1970), Sheather (1983), Sheather and Jones (1991) for bandwidth selection in density estimation. Broadly speaking, these results are not computationally viable for high-dimensional MCMC where long run lengths are standard.

Until recently, most MCMC simulations only considered estimation of the diagonal entries of . An incomplete list of appropriate univariate estimators includes BM and overlapping BM (Jones et al., 2006; Flegal and Jones, 2010; Meketon and Schmeiser, 1984), SV methods including flat-top estimators (Anderson, 1994; Politis and Romano, 1996, 1995), initial sequence estimators (Geyer, 1992), and regenerative simulation (Mykland et al., 1995; Hobert et al., 2002; Seila, 1982). For more general dependent processes, Song and Schmeiser (1995) and Damerdji (1995) consider univariate BM estimators and obtain optimal batch sizes that minimize the asymptotic mean squared error (MSE). Flegal and Jones (2010)

also consider these estimators for MCMC simulations under weaker mixing and moment conditions. These papers show the optimal batch size is proportional to

, but there has been no work in MCMC settings with regard to estimating the proportionality constant. As a result, Flegal and Jones (2010) suggest using the suboptimal batch size equal to to avoid estimation of the unknown proportionality constant.

This paper has two major contributions. First, we provide a fast and stable estimation procedure for the optimal batch size proportionality constant. In short, we use a stationary autoregressive process of order to approximate the marginals of , which yields a closed form expression for the unknown proportionality constant. This gives a different batch size for each entry of that is difficult to implement in practice. Hence we propose using the average of the univariate optimal batch sizes as an overall batch size.

Second, we obtain an optimal batch size expression for multivariate SV estimators. This is a substantial generalization since prior results only consider univariate overlapping BM estimators (Damerdji, 1995; Flegal and Jones, 2010). That is, these results only consider SV estimators with a Bartlett lag window. As in the univariate case, the optimal batch size is proportional to . As a comparison, we obtain the optimal batch size for BM estimators, which are also proportional to .

A vector autoregressive process of order 1 is used to evaluate finite sample performance of our proposed method in comparison to the more common flat-top pilot estimators (Politis and Romano, 1999; Politis, 2011, 2003). In this simple example, the optimal batch size can be calculated analytically. Next, we present a Bayesian logistic regression example with real data and compare the performance of the optimal batch size methods with the more commonly used batch sizes of and . A similar analysis is done for a Bayesian dynamic space-time model with real data.

The rest of this paper is organized as follows. Section 2 presents the SV and BM estimators. Section 3 summarizes results on the optimal batch size and proposes a novel estimation technique. Section 4 presents the theoretical details on the derivation of the optimal batch size. Section 5 considers three examples to compare performances between suggested and more commonly used batch sizes. All proofs are relegated to the appendices.

2 Batch means and spectral variance estimators

The BM estimator of is constructed using the sample covariance matrix from batches of the Markov chain. For a Monte Carlo sample size , let , where is the number of batches and is the batch size. For , let denote the mean vector of the batch. The BM estimator is defined as,

(2)

Strong and mean square consistency requires both the batch size and the number of batches to increase with . Large batch sizes capture more lag correlations with fewer batches, implying larger variance. Alternatively, small batch sizes yield higher bias. We present theoretical results on the optimal choice of in Section 4.

A similar trade-off is seen for SV estimators, which we present below. Consider estimating the lag autocovariance denoted by with

The SV estimator truncates and downweights the summed lag autocovariances. That is,

where is the bandwidth or truncation point and is the lag window. The bandwidth serves the same role as the batch size in the BM estimator, hence we denote both by .

We assume throughout that the lag window is an even function defined on . In addition, we assume (i) for all and , (ii) for all , and (iii) for all . Some examples of lag windows are Bartlett, Bartlett flat-top, and Tukey-Hanning lag windows defined as

(3)
(4)
(5)

respectively. Figure 1 illustrates these lag windows. In MCMC, the Bartlett lag window is by far the most commonly used. However, the downweighting of lags induces significant bias. The motivation behind flat-top lag windows is to reduce downweighting of small lag terms by letting for near (see e.g. Politis and Romano, 1995, 1996).

Figure 1: Plot of Bartlett, Tukey-Hanning, and Bartlett flat-top lag windows.

3 Estimating optimal batch sizes

As we show later for BM and SV estimators, the optimal for estimating the th element of , , is

(6)

where

with components . Hence estimating requires estimates of and , which is the focus of this section.

Choosing a different for each element of requires substantial computational effort and it is unclear if it makes intuitive sense. Since can be calculated for each univariate component, we define the overall optimal by averaging the diagonals of . That is,

(7)

The proportionality constant is known and depends on the choice of variance estimator (see Section 4 for details).

To estimate , we require estimates of and , which is our initial goal. A common solution is to use pilot estimators, i.e. a pilot run of the process is obtained and used to estimate and (see e.g. Woodroofe, 1970; Jones et al., 1996; Loader, 1999; Politis, 2003). Two such procedures are the flat-top pilot estimator and empirical rule (Politis, 2003) and the iterative plug-in pilot estimator (Brockmann et al., 1993; Bühlmann, 1996). In both settings, an SV estimator is constructed for each of the components for the pilot run where the bandwidth is chosen by an empirical or iterative rule that monitors lag autocorrelations.

Unlike final estimators of , we do not require consistency for estimators of . Thus, the pilot step need not be based on BM or SV estimators. The choice of the estimator for only impacts the finite sample properties of the estimator of , while maintaining the asymptotic results. That is, if the asymptotic results of strong consistency and mean square consistency hold for , then they also hold for , for some constant .

We argue estimators of should be (i) computationally inexpensive and (ii) have low variability. Both the empirical rule and iterative plug-in estimators can be computationally involved (especially for slow mixing chains) and hence they fail our first criteria. These estimators can also yield highly variable estimates of , which we illustrate in our examples. Low variability is particularly important since the user cannot be expected to run multiple pilot runs to obtain a stable estimate of .

We now present pilot estimators of and satisfying our two criteria. In short, we propose using a stationary autoregressive process of order (AR) approximation to the marginals of . For , let be such that

where has mean 0 and variance , and are the autoregressive coefficients. Let be the lag

autocovariance function for the process. Then by the Yule-Walker equations, it is known that for

, , and . The asymptotic variance in the CLT is known to be

Obtaining an expression for is more involved. Following Taylor (2018),

We propose to fit an AR model for each marginal of the Markov chain, where is determined by Akaike information criterion. The autocovariances are estimated by the sample lag autocovariances, . Then and are estimated by and , respectively, by solving the Yule-Walker equations. For the th component of the Markov chain, the resulting pilot estimators are

(8)

An AR approximation for MCMC has also been studied by Thompson (2010) who considers estimating the integrated autocorrelation time of a process. Further, the R package coda (Plummer et al., 2006) uses (8) as the final estimator of when calculating univariate effective sample sizes.

4 Theoretical results

This section presents optimal bandwidth and batch size results for SV and BM estimators, respectively. We assume the following two assumptions for mean square consistency.

Assumption 1.

The batch size is an integer sequence such that and as , where and are both monotonically non-decreasing.

Denote the Euclidean norm by and let be a -dimensional Brownian motion. Our second assumption is that of a strong invariance principle.

Assumption 2.

There exists a lower triangular matrix , a non-negative increasing function

on the positive integers, a finite random variable

, and a sufficiently rich probability space such that for almost all and for all ,

(9)

Assumption 2 implies the CLT at (1) holds and is needed to obtain expressions for element-wise variance of BM and SV estimators. We now turn to deriving the bias and variance for the SV estimators.

Consider an alternative representation of the SV estimator, which differs on only some end effects. This expression has been used previously by Damerdji (1995) and Flegal and Jones (2010). First define and . Let for and consider the estimator

Suppose that is such that (the expression for is in Appendix B). When is fixed, (20) shows hence the estimators are asymptotically equivalent. Our first result shows with probability 1 as . Define where the square is component-wise for . Preliminaries for the following result can be found in Appendix A followed by the proof in Appendix B.

Theorem 1.

Assume Assumption 1 holds and Assumption 2 holds for both and . If , , , and as , then with probability as

We now derive the asymptotic bias and variance for , which enables calculation of an optimal bandwidth by minimizing the MSE. We use this MSE to approximate that of , which is difficult to obtain directly. Let be the th entry of , then we begin by calculating the asymptotic bias.

Theorem 2.

Suppose for and the lag window satisfies . If is a polynomially ergodic Markov chain of order for some , then

Proof.

By Vats and Flegal (2018, Theorem 2), . Then under Assumption 1, for all and ,

(10)

Since , by (10),

Next, we present the element-wise variance of the SV estimator. Appendix C contains a number of preliminary results, followed by the proof of Theorem 3 in Appendix D.

Theorem 3.

Suppose Assumption 2 holds for with , , and with , , . Further, suppose , Assumption 1 holds and if

  1. , and

  2. , then,

Combining results from Theorem 2 and Theorem 3 yields as . If in Theorem 3 and for a constant , if then

Further, the bandwidth that minimizes the asymptotic MSE is

Thus, for the SV estimator, the proportionality constant in (6) is . Expressions of and can be derived for different lag windows.

Example 1.

Consider the Bartlett window in (3). Here, and for . So, , and since , using Theorem 2, . Condition 1 of Theorem 3 is satisfied since , and since , . Thus, using (7),

We now present MSE results for the BM estimator. The result is a consequence of setting in Theorems 5 and 6 of Vats and Flegal (2018).

Corollary 1.

Let for some and let be a polynomially ergodic Markov chain of order for . Then Assumption 2 holds for for . If in addition, , , and as , then

As a consequence, the optimal batch size according to (7) is

5 Examples

5.1 Vector auto-regressive example

Consider the -dimensional vector auto-regressive process of order 1 (VAR(1))

for where , are i.i.d.  and is a

matrix. The Markov chain is geometrically ergodic when the largest eigenvalue of

in absolute value is less than 1 (Tjøstheim, 1990). In addition if denotes the Kronecker product, the invariant distribution is , where . Consider estimating with . By the CLT at (1),

With some algebra, it can be shown that

(11)

Thus, the true coefficient of the optimal batch size can be obtained by using the diagonals of and above. To ensure geometric ergodicity, we generate the process as follows. Consider a matrix

with each entry generated from standard normal distribution, let

be a symmetric matrix with the largest eigenvalue , then set . We then evaluate a series of , where , where larger values imply stronger auto-covariance and cross-covariance of the chain.

We compare our proposed optimal batch size coefficient estimates with that of the flat-top pilot estimator and empirical rule of Politis (2003). For each , optimal coefficients using both methods are computed for the corresponding based on VAR(1) of length 1e4. MSE over 1000 replications are plotted in Figure 2 with confidence intervals. Note, our proposed AR approximation method consistently provides better estimates of the coefficients while yielding smaller variability.

For an estimator of , let . MSE across entries of can be reflected by

which we use to evaluate various batch sizes. Figure 2 shows the log of the average over 1000 replications of Markov chain length 1e5, illustrating the optimal batch size leads to smaller than batch sizes and . The two estimation methods perform similarly in terms of of the matrix.

Figure 2: Plots of average MSE with CI (left) and log of average with different batch sizes for BM (center) and SV (right) estimators.

5.2 Bayesian logistic regression

We consider the Anguilla australis data from Elith et al. (2008) available in the dismo R package. The dataset records the presence or absence of the short-finned eel in 1000 sites over New Zealand. Following Leathwick et al. (2008), we choose six of the twelve covariates recorded in the data; SegSumT, DSDist, USNative, DSMaxSlope and DSSlope are continuous and Method is categorical with five levels.

For , let record the presence () or absence of Anguilla australis. Let denote the vector of covariates for observation . We fit a model with intercept so that the regression coefficient . Let

We set as in Boone et al. (2014). The posterior distribution is intractable and we use the MCMClogit function in the R package MCMCpack to obtain posterior samples; this random walk Metropolis-Hastings sampler is geometrically ergodic (Vats et al., 2018a).

Figure 3: Boxplot of estimated coefficient of optimal batch size using our proposed method and the flat-top pilot estimates. The three horizontal lines correspond to , and .

In 1000 replications, we ran a pilot run of length 1e4 to estimate the optimal batch size using our AR approximation methods and the flat-top pilot estimators. We then reran the chain to estimate using the optimal batch size estimates. This was repeated for three Monte Carlo sample sizes, and . Figure 3 presents the variability in the estimates of the coefficient of the optimal batch size. Note that our AR estimator has significantly lower variability compared to the flat-top estimators. This is particularly useful since a pilot estimator is only run once by a user. In addition, since is relatively close to the estimated coefficients of the optimal batch size, we expect a batch size of to perform relatively well for and .

Coverage probabilities over the 1000 replications are provided in Table 1. Given that the estimated coefficient is significantly larger than 1, it is not surprising that performs poorly. For small sample sizes, both the optimal methods have better coverage probabilities. For Monte Carlo sample size , as expected, fares marginally better. Almost universally, our AR approximation method outperforms the flat-top pilot estimators.

Batch Means Spectral Variance
Sample Size / AR FT AR FT

1e4
0.279 0.722 0.725 0.716 0.276 0.727 0.733 0.733
5e4 0.499 0.826 0.832 0.826 0.497 0.834 0.838 0.839
1e5 0.615 0.861 0.857 0.853 0.615 0.864 0.864 0.863
Table 1: Coverage probabilities for confidence regions over 1000 replications.

5.3 Bayesian dynamic space-time model

This example considers the Bayesian dynamic model of Finley et al. (2012) to model monthly temperature data collected at 10 nearby station in northeastern United States in 2000. A data description can be found in the spBayes R package (Finley and Banerjee, 2013).

Suppose denotes the temperature observed at location and time for and . Let be a vector of predictors and be a coefficient vector, which is a purely time component and be a space-time component. The model is

where is a spatial Gaussian process with , is an exponential correlation function with controlling the correlation decay, and represents the spatial variance components. The Gaussian spacial process allows closer locations to have higher correlations. Time effect for both and are characterized by transition equations to achieve reasonable dependence structure. We are interested in estimating posterior expectation of 185 parameters , their prior follows the spDynlM function in the spBayes package.

The only predictor in our analysis is elevation, hence for where is the intercept and is the coefficient for elevation. Consider estimating the coefficient of the covariate for the last two months, and . We obtain the true posterior mean of these two components by averaging over 1000 chains of length 1e6. Our simulation setup is similar to that in Section 5.2. In Figure 4 we present boxplots of the estimated coefficient of the optimal batch size; here again the variability in the flat-top estimator is high. Also note that the average estimated coefficient of , immediately indicates the inappropriateness of using batch size and . We also present the confidence regions created in one run of length using the BM estimators for varying batch sizes. Given the dependence in time of the two components, we expect such thin ellipses, due to high posterior correlation, assisted by high Markov chain lag cross-correlation.

Figure 4: Confidence regions for based on and a chain length of 1e5.

Coverage probabilities over 1000 replications are shown in Table 2. Unsurprisingly, the and do not perform well for both the BM and SV estimators. The AR approximation method and the flat-top pilot estimators are comparable, with our AR approximation method being yielding marginally better coverage probabilities.

Batch Means Spectral Variance
Sample Size / AR FT AR FT
1e4 0.274 0.571 0.771 0.758 0.274 0.565 0.765 0.761
1e5 0.426 0.764 0.869 0.864 0.424 0.756 0.872 0.865
2e5 0.445 0.789 0.856 0.857 0.446 0.787 0.865 0.859

Table 2: Coverage probabilities for confidence regions over 1000 replications.

6 Discussion

This paper provides theoretical evidence and practical guidance for optimal batch size selection in MCMC simulations. Estimators with the proposed optimal batch sizes are shown to have superior performance versus conventional batch sizes. Optimal batch size has not been addressed previously in multivariate MCMC setting although sampling multivariate posteriors is ubiquitous in Bayesian analyses. Optimal batch size in univariate MCMC simulations has been discussed previously, where sub-optimal batch sizes that ignore the proportionality constant are conventionally used.

To reduce computational effort, we used MCMC samples to obtain pilot estimates regardless of the total chain length. As a result, performance of the estimator can be improved without significant additional computation, which is crucial for multivariate problems in practice. Our choice was a compromise between computation effort and accuracy.

Optimal batch sizes with estimated coefficients can also be expanded to other estimators such as weighted BM estimators (Liu and Flegal, 2018). These estimators are based on fewer batches compared with mSV estimators. Hence they are more computationally efficient with a larger variance. Nevertheless, obtaining optimal batch sizes should follow results discussed in this paper and is a direction of future work.

Appendix

Appendix A Preliminaries

First, we introduce some notation and propositions. Recall is a -dimensional standard Brownian motion. Let be the th component of vector . Denote , . The Brownian motion counterpart of is

Proposition 1.

(Csörgő and Révész, 1981) Suppose Assumption 1 holds, then for all and for almost all sample paths, there exists such that for all and all

Let , where is a lower triangular matrix. Let and be the th component of . Suppose , and .

Proposition 2.

(Vats et al., 2018b) For all and for almost all sample paths, there exists such that for all and all

where is the th diagonal entry of

Proposition 3.

(Vats et al., 2018b) If Assumption 1 holds, then for all and for almost all sample paths, there exists such that for all and all

where is the th diagonal entry of

Proposition 4.

If variable and are jointly normally distributed with

then .

Proposition 5.

(Janssen and Stoica, 1987) If , , , and are jointly normally distributed with mean 0, then