In Markov chain Monte Carlo (MCMC) simulations, it is desirable to estimate the variability of ergodic averages to assess the quality of estimation (see e.g. Flegal et al., 2008; Geyer, 2011; Jones and Hobert, 2001)
. Estimation of this variability can be approached through a multivariate Markov chain central limit theorem (CLT). To this end, let
be a probability distribution with supportand be a -integrable function. We are interested in estimating the -dimensional vector
using draws from a Harris -ergodic Markov chain, say . Then, letting for , with probability 1 as . The sampling distribution for is available via a Markov chain CLT if there exists a positive definite symmetric matrix such that
We assume throughout that the CLT at (1) holds (see e.g. Jones, 2004) and consider estimation of . Provided such an estimator is available, one can access variability of by constructing a -dimensional confidence ellipsoid. Recent work of Vats et al. (2018b, a) obtains necessary conditions for strongly consistent estimation for multivariate spectral variance (SV) and batch means (BM) estimators. Liu and Flegal (2018) proposes computationally efficient weighted BM estimators that allow for flexible choice of window function similar to SV estimators. These broad family of estimators account for the serial correlation in the Markov chain up to a certain lag. This lag, denoted as , is called the bandwidth and batch size in SV and BM estimators, respectively. We will refer to both as the batch size, unless otherwise stated. The choice of is crucial to finite sample performance, but estimating an optimal has not been carefully addressed in multivariate output analysis. Instead, suboptimal bandwidths and batch sizes inherited from univariate estimators are widely used by practitioners.
Optimal batch size selection has been studied carefully in other contexts such as nonparametric density and spectral density function estimation. For example, Politis (2003, 2011) and Politis and Romano (1999) discuss optimal bandwidth selection of SV estimators using the flat-top window function. Estimation of the time-average variance constant using a recursive estimator is proposed by Chan and Yau (2017) where an optimal batch size is suggested. An interested reader is further directed to Jones et al. (1996), Silverman (1999), Woodroofe (1970), Sheather (1983), Sheather and Jones (1991) for bandwidth selection in density estimation. Broadly speaking, these results are not computationally viable for high-dimensional MCMC where long run lengths are standard.
Until recently, most MCMC simulations only considered estimation of the diagonal entries of . An incomplete list of appropriate univariate estimators includes BM and overlapping BM (Jones et al., 2006; Flegal and Jones, 2010; Meketon and Schmeiser, 1984), SV methods including flat-top estimators (Anderson, 1994; Politis and Romano, 1996, 1995), initial sequence estimators (Geyer, 1992), and regenerative simulation (Mykland et al., 1995; Hobert et al., 2002; Seila, 1982). For more general dependent processes, Song and Schmeiser (1995) and Damerdji (1995) consider univariate BM estimators and obtain optimal batch sizes that minimize the asymptotic mean squared error (MSE). Flegal and Jones (2010)
also consider these estimators for MCMC simulations under weaker mixing and moment conditions. These papers show the optimal batch size is proportional to, but there has been no work in MCMC settings with regard to estimating the proportionality constant. As a result, Flegal and Jones (2010) suggest using the suboptimal batch size equal to to avoid estimation of the unknown proportionality constant.
This paper has two major contributions. First, we provide a fast and stable estimation procedure for the optimal batch size proportionality constant. In short, we use a stationary autoregressive process of order to approximate the marginals of , which yields a closed form expression for the unknown proportionality constant. This gives a different batch size for each entry of that is difficult to implement in practice. Hence we propose using the average of the univariate optimal batch sizes as an overall batch size.
Second, we obtain an optimal batch size expression for multivariate SV estimators. This is a substantial generalization since prior results only consider univariate overlapping BM estimators (Damerdji, 1995; Flegal and Jones, 2010). That is, these results only consider SV estimators with a Bartlett lag window. As in the univariate case, the optimal batch size is proportional to . As a comparison, we obtain the optimal batch size for BM estimators, which are also proportional to .
A vector autoregressive process of order 1 is used to evaluate finite sample performance of our proposed method in comparison to the more common flat-top pilot estimators (Politis and Romano, 1999; Politis, 2011, 2003). In this simple example, the optimal batch size can be calculated analytically. Next, we present a Bayesian logistic regression example with real data and compare the performance of the optimal batch size methods with the more commonly used batch sizes of and . A similar analysis is done for a Bayesian dynamic space-time model with real data.
The rest of this paper is organized as follows. Section 2 presents the SV and BM estimators. Section 3 summarizes results on the optimal batch size and proposes a novel estimation technique. Section 4 presents the theoretical details on the derivation of the optimal batch size. Section 5 considers three examples to compare performances between suggested and more commonly used batch sizes. All proofs are relegated to the appendices.
2 Batch means and spectral variance estimators
The BM estimator of is constructed using the sample covariance matrix from batches of the Markov chain. For a Monte Carlo sample size , let , where is the number of batches and is the batch size. For , let denote the mean vector of the batch. The BM estimator is defined as,
Strong and mean square consistency requires both the batch size and the number of batches to increase with . Large batch sizes capture more lag correlations with fewer batches, implying larger variance. Alternatively, small batch sizes yield higher bias. We present theoretical results on the optimal choice of in Section 4.
A similar trade-off is seen for SV estimators, which we present below. Consider estimating the lag autocovariance denoted by with
The SV estimator truncates and downweights the summed lag autocovariances. That is,
where is the bandwidth or truncation point and is the lag window. The bandwidth serves the same role as the batch size in the BM estimator, hence we denote both by .
We assume throughout that the lag window is an even function defined on . In addition, we assume (i) for all and , (ii) for all , and (iii) for all . Some examples of lag windows are Bartlett, Bartlett flat-top, and Tukey-Hanning lag windows defined as
respectively. Figure 1 illustrates these lag windows. In MCMC, the Bartlett lag window is by far the most commonly used. However, the downweighting of lags induces significant bias. The motivation behind flat-top lag windows is to reduce downweighting of small lag terms by letting for near (see e.g. Politis and Romano, 1995, 1996).
3 Estimating optimal batch sizes
As we show later for BM and SV estimators, the optimal for estimating the th element of , , is
with components . Hence estimating requires estimates of and , which is the focus of this section.
Choosing a different for each element of requires substantial computational effort and it is unclear if it makes intuitive sense. Since can be calculated for each univariate component, we define the overall optimal by averaging the diagonals of . That is,
The proportionality constant is known and depends on the choice of variance estimator (see Section 4 for details).
To estimate , we require estimates of and , which is our initial goal. A common solution is to use pilot estimators, i.e. a pilot run of the process is obtained and used to estimate and (see e.g. Woodroofe, 1970; Jones et al., 1996; Loader, 1999; Politis, 2003). Two such procedures are the flat-top pilot estimator and empirical rule (Politis, 2003) and the iterative plug-in pilot estimator (Brockmann et al., 1993; Bühlmann, 1996). In both settings, an SV estimator is constructed for each of the components for the pilot run where the bandwidth is chosen by an empirical or iterative rule that monitors lag autocorrelations.
Unlike final estimators of , we do not require consistency for estimators of . Thus, the pilot step need not be based on BM or SV estimators. The choice of the estimator for only impacts the finite sample properties of the estimator of , while maintaining the asymptotic results. That is, if the asymptotic results of strong consistency and mean square consistency hold for , then they also hold for , for some constant .
We argue estimators of should be (i) computationally inexpensive and (ii) have low variability. Both the empirical rule and iterative plug-in estimators can be computationally involved (especially for slow mixing chains) and hence they fail our first criteria. These estimators can also yield highly variable estimates of , which we illustrate in our examples. Low variability is particularly important since the user cannot be expected to run multiple pilot runs to obtain a stable estimate of .
We now present pilot estimators of and satisfying our two criteria. In short, we propose using a stationary autoregressive process of order (AR) approximation to the marginals of . For , let be such that
where has mean 0 and variance , and are the autoregressive coefficients. Let be the lag
autocovariance function for the process. Then by the Yule-Walker equations, it is known that for, , and . The asymptotic variance in the CLT is known to be
Obtaining an expression for is more involved. Following Taylor (2018),
We propose to fit an AR model for each marginal of the Markov chain, where is determined by Akaike information criterion. The autocovariances are estimated by the sample lag autocovariances, . Then and are estimated by and , respectively, by solving the Yule-Walker equations. For the th component of the Markov chain, the resulting pilot estimators are
4 Theoretical results
This section presents optimal bandwidth and batch size results for SV and BM estimators, respectively. We assume the following two assumptions for mean square consistency.
The batch size is an integer sequence such that and as , where and are both monotonically non-decreasing.
Denote the Euclidean norm by and let be a -dimensional Brownian motion. Our second assumption is that of a strong invariance principle.
There exists a lower triangular matrix , a non-negative increasing function on the positive integers, a finite random variable
on the positive integers, a finite random variable, and a sufficiently rich probability space such that for almost all and for all ,
Consider an alternative representation of the SV estimator, which differs on only some end effects. This expression has been used previously by Damerdji (1995) and Flegal and Jones (2010). First define and . Let for and consider the estimator
Suppose that is such that (the expression for is in Appendix B). When is fixed, (20) shows hence the estimators are asymptotically equivalent. Our first result shows with probability 1 as . Define where the square is component-wise for . Preliminaries for the following result can be found in Appendix A followed by the proof in Appendix B.
We now derive the asymptotic bias and variance for , which enables calculation of an optimal bandwidth by minimizing the MSE. We use this MSE to approximate that of , which is difficult to obtain directly. Let be the th entry of , then we begin by calculating the asymptotic bias.
Suppose for and the lag window satisfies . If is a polynomially ergodic Markov chain of order for some , then
Further, the bandwidth that minimizes the asymptotic MSE is
Thus, for the SV estimator, the proportionality constant in (6) is . Expressions of and can be derived for different lag windows.
We now present MSE results for the BM estimator. The result is a consequence of setting in Theorems 5 and 6 of Vats and Flegal (2018).
5.1 Vector auto-regressive example
Consider the -dimensional vector auto-regressive process of order 1 (VAR(1))
for where , are i.i.d. and is a
matrix. The Markov chain is geometrically ergodic when the largest eigenvalue ofin absolute value is less than 1 (Tjøstheim, 1990). In addition if denotes the Kronecker product, the invariant distribution is , where . Consider estimating with . By the CLT at (1),
With some algebra, it can be shown that
Thus, the true coefficient of the optimal batch size can be obtained by using the diagonals of and above. To ensure geometric ergodicity, we generate the process as follows. Consider a matrix
with each entry generated from standard normal distribution, letbe a symmetric matrix with the largest eigenvalue , then set . We then evaluate a series of , where , where larger values imply stronger auto-covariance and cross-covariance of the chain.
We compare our proposed optimal batch size coefficient estimates with that of the flat-top pilot estimator and empirical rule of Politis (2003). For each , optimal coefficients using both methods are computed for the corresponding based on VAR(1) of length 1e4. MSE over 1000 replications are plotted in Figure 2 with confidence intervals. Note, our proposed AR approximation method consistently provides better estimates of the coefficients while yielding smaller variability.
For an estimator of , let . MSE across entries of can be reflected by
which we use to evaluate various batch sizes. Figure 2 shows the log of the average over 1000 replications of Markov chain length 1e5, illustrating the optimal batch size leads to smaller than batch sizes and . The two estimation methods perform similarly in terms of of the matrix.
5.2 Bayesian logistic regression
We consider the Anguilla australis data from Elith et al. (2008) available in the dismo R package. The dataset records the presence or absence of the short-finned eel in 1000 sites over New Zealand. Following Leathwick et al. (2008), we choose six of the twelve covariates recorded in the data; SegSumT, DSDist, USNative, DSMaxSlope and DSSlope are continuous and Method is categorical with five levels.
For , let record the presence () or absence of Anguilla australis. Let denote the vector of covariates for observation . We fit a model with intercept so that the regression coefficient . Let
We set as in Boone et al. (2014). The posterior distribution is intractable and we use the MCMClogit function in the R package MCMCpack to obtain posterior samples; this random walk Metropolis-Hastings sampler is geometrically ergodic (Vats et al., 2018a).
In 1000 replications, we ran a pilot run of length 1e4 to estimate the optimal batch size using our AR approximation methods and the flat-top pilot estimators. We then reran the chain to estimate using the optimal batch size estimates. This was repeated for three Monte Carlo sample sizes, and . Figure 3 presents the variability in the estimates of the coefficient of the optimal batch size. Note that our AR estimator has significantly lower variability compared to the flat-top estimators. This is particularly useful since a pilot estimator is only run once by a user. In addition, since is relatively close to the estimated coefficients of the optimal batch size, we expect a batch size of to perform relatively well for and .
Coverage probabilities over the 1000 replications are provided in Table 1. Given that the estimated coefficient is significantly larger than 1, it is not surprising that performs poorly. For small sample sizes, both the optimal methods have better coverage probabilities. For Monte Carlo sample size , as expected, fares marginally better. Almost universally, our AR approximation method outperforms the flat-top pilot estimators.
|Batch Means||Spectral Variance|
|Sample Size /||AR||FT||AR||FT|
5.3 Bayesian dynamic space-time model
This example considers the Bayesian dynamic model of Finley et al. (2012) to model monthly temperature data collected at 10 nearby station in northeastern United States in 2000. A data description can be found in the spBayes R package (Finley and Banerjee, 2013).
Suppose denotes the temperature observed at location and time for and . Let be a vector of predictors and be a coefficient vector, which is a purely time component and be a space-time component. The model is
where is a spatial Gaussian process with , is an exponential correlation function with controlling the correlation decay, and represents the spatial variance components. The Gaussian spacial process allows closer locations to have higher correlations. Time effect for both and are characterized by transition equations to achieve reasonable dependence structure. We are interested in estimating posterior expectation of 185 parameters , their prior follows the spDynlM function in the spBayes package.
The only predictor in our analysis is elevation, hence for where is the intercept and is the coefficient for elevation. Consider estimating the coefficient of the covariate for the last two months, and . We obtain the true posterior mean of these two components by averaging over 1000 chains of length 1e6. Our simulation setup is similar to that in Section 5.2. In Figure 4 we present boxplots of the estimated coefficient of the optimal batch size; here again the variability in the flat-top estimator is high. Also note that the average estimated coefficient of , immediately indicates the inappropriateness of using batch size and . We also present the confidence regions created in one run of length using the BM estimators for varying batch sizes. Given the dependence in time of the two components, we expect such thin ellipses, due to high posterior correlation, assisted by high Markov chain lag cross-correlation.
Coverage probabilities over 1000 replications are shown in Table 2. Unsurprisingly, the and do not perform well for both the BM and SV estimators. The AR approximation method and the flat-top pilot estimators are comparable, with our AR approximation method being yielding marginally better coverage probabilities.
|Batch Means||Spectral Variance|
|Sample Size /||AR||FT||AR||FT|
This paper provides theoretical evidence and practical guidance for optimal batch size selection in MCMC simulations. Estimators with the proposed optimal batch sizes are shown to have superior performance versus conventional batch sizes. Optimal batch size has not been addressed previously in multivariate MCMC setting although sampling multivariate posteriors is ubiquitous in Bayesian analyses. Optimal batch size in univariate MCMC simulations has been discussed previously, where sub-optimal batch sizes that ignore the proportionality constant are conventionally used.
To reduce computational effort, we used MCMC samples to obtain pilot estimates regardless of the total chain length. As a result, performance of the estimator can be improved without significant additional computation, which is crucial for multivariate problems in practice. Our choice was a compromise between computation effort and accuracy.
Optimal batch sizes with estimated coefficients can also be expanded to other estimators such as weighted BM estimators (Liu and Flegal, 2018). These estimators are based on fewer batches compared with mSV estimators. Hence they are more computationally efficient with a larger variance. Nevertheless, obtaining optimal batch sizes should follow results discussed in this paper and is a direction of future work.
Appendix A Preliminaries
First, we introduce some notation and propositions. Recall is a -dimensional standard Brownian motion. Let be the th component of vector . Denote , . The Brownian motion counterpart of is
Let , where is a lower triangular matrix. Let and be the th component of . Suppose , and .
(Vats et al., 2018b) For all and for almost all sample paths, there exists such that for all and all
where is the th diagonal entry of
If variable and are jointly normally distributed with
(Janssen and Stoica, 1987) If , , , and are jointly normally distributed with mean 0, then