1 Introduction
It is common to thin a Markov chain sample, taking every ’th observation instead of all of them. Such subsampling is done to produce values that are more nearly independent. It also saves storage costs. It is well known that the average over a thinned sample set has greater variance than the plain average over all of the computed values (Geyer, 1992).
Most authors recommend against thinning, except where it is needed to reduce storage. MacEachern and Berliner (1994) go so far as to provide a ‘justification for the ban against subsampling’. Link and Eaton (2011) write that “Thinning is often unnecessary and always inefficient”. In discussing thinning of the Gibbs sampler, Gamerman and Lopes (2006)
say: “There is no gain in efficiency, however, by this approach and estimation is shown below to be always less precise than retaining all chain values.”
One exception is Geyer (1991) who acknowledges that thinning can in fact increase statistical efficiency. Thinning reduces the average cost of iterations which then makes it possible to run a thinned Markov chain longer than an unthinned one at the same computational cost. He gives some qualitative remarks about this effect, but ultimately concludes that it is usually a negligible benefit because the autocorrelations in the Markov chain decay exponentially fast. Link and Eaton (2011) also acknowledge this possibility in their discussion as does Neal (1993, page 106).
This paper revisits the thinning problem and shows that the usual advice against thinning can be misleading, by quantifying the argument of Geyer (1991) described above. The key variables are the cost of computing the quantity of interest (after advancing the Markov chain) and the speed at which correlations in the quantity of interest decay. When the cost is expensive and the decay is slow, then thinning can improve efficiency by a large factor.
We suppose that it costs one unit to advance the Markov chain and units each time the quantity of interest is computed. If lag autocorrelations satisfy , then there is always a for which thinning by a factor of will improve efficiency.
For a first order autoregressive autocorrelation structure in the quantity of interest, very precise results are possible. Given the update costs and the autocorrelation parameter we can compute the optimal thinning factor as well as the efficiency improvement with that factor. The autoregressive assumption is very convenient because it reduces the dependence problem to just one scalar parameter. Also, realworld autocorrelations commonly resemble those of an AR(1) model. In the social sciences, the book by Jackman (2009) shows many sample autocorrelation functions that resemble AR(1). The physicists Newman and Barkema (1999) writing about the Ising model state that “the autocorrelation is expected to fall off exponentially at long times” (p 60). Geyer (1991) notes an exponential upper bound for autocorrelations when processes are mixing.
Sometimes thinning is built in to standard simulation practice. For instance an Ising model may be simulated as a sequence of ‘passes’ with each pixel being examined on average once per pass. The state of the Markov chain might only be inspected once per pass. That represents a substantial, though not necessarily optimal amount of thinning. It might really be better to sample several times per pass or just once every passes.
An outline of this paper is as follows. Section 2 defines asymptotic efficiency of thinning to every ’th observation when the samples have unit cost to generate, the function of interest costs each time we compute it. If the autocorrelations for are nonnegative and nonincreasing and then there is always some finite for which thinning by a factor of is more efficient than not thinning. Much sharper results can be obtained when the autocorrelations take the form at lag . In many cases the optimal thinning factor is greater than one.
Section 3 presents some inequalities among the efficiency levels at different subsampling frequencies in the AR(1) case. Thinning never helps when . For , if any thinning level is to help, then taking every second sample must also help, and as a result we can get sharp expressions for the autocorrelation level at which thinning increases efficiency. In the limit very large thinning factors become optimal but frequently much smaller factors are nearly as good. The efficiency gain does not exceed for any and . Section 4 considers autocorrelations that are bounded between two autoregressive forms . The range of optimal thinning factors widens, but it is often possible to find meaningful efficiency improvements from thinning. Section 5 describes how to compute the optimal thinning factor given the parameters and an autoregression parameter . Section 6 has conclusions and discusses consequences of rejected proposals having essentially zero cost while accepted ones have a meaningfully large cost. An appendix has R code to compute the optimal .
We close with some practical remarks. When thinning benefits, it does not appear to be critical to find the optimal factor . Instead there are many near optimal thinning factors. If the autocorrelations decay slowly and the cost is large then a suggestion of Hans Anderson is to thin in such a way that about half of the cost is spent advancing the Markov chain and about half is spent computing the quantity of interest. That should be nearly as efficient as using the optimal .
2 Asymptotic efficiency
To fix ideas, suppose that we generate a Markov chain for . We have a starting value and then it costs one unit of computation to transition from to . The state space for can be completely general in the analysis below.
Interest centers on the expected value of for some realvalued function . There is ordinarily more than one such function, but here we focus on a single one. The cost to compute is . Often but it is also possible that is comparable to or even larger. For instance it may be inexpensive to perform one update on a system of particles, but very expensive to find the new minimum distance among all those particles or some similar quantity of interest. Or, it may be very inexpensive to flip one or more edges in a simulated network but expensive to compute a connectivity property of the resulting network. Finally, when computation must pause to store , then the cost of pausing is included in .
The efficiency of thinning derived here depends on the cost of computing from , the cost of transition from to , and the autocovariances of the series . We assume that is stationary: any necessary warmup has taken place.
The variance of is asymptotically where and . We assume that . Now suppose that we thin the chain as follows. We compute only for every ’th observation. The number of function values we get will depend on . If we take of them then we estimate by
To compute we must advance the chain times and evaluate at each of points for a total cost of . When our computational budget is a cost of , then we will use the largest with . That is .
The relative efficiency of thinning by a factor compared to not thinning at all is
The dependence on is minor and is a nuisance. We work instead with
(1) 
which is also the limit of as .
2.1 Generic autocorrelations
The efficiency of thinning depends on the autocorrelations only through certain sums of them. We can use this to get inequalities on autocorrelations that are equivalent to statements on the efficiency . Then under a monotonocity constraint on autocorrelations we can get a condition that ensures that thinning will help.
Lemma 1.
Let , and for a thinning factor , define and . Then if and only if
(2) 
Only one out of every consecutive autocorrelations contributes to while the other of them contribute to . If we let , then equation (2) becomes . For a Markov chain with slowly converging autocorrelations we will have . Then for thinning to be inefficient, the autocorrelations contributing to have to be enough larger than the others to overcome the factor . When is large we would then need every ’th autocorrelation to be surprisingly large compared to the nearby ones, in order to make thinning inefficient.
Now suppose that the autocorrelations satisfy
(3) 
This quite mild sufficient condition rules out a setting where every ’th autocorrelation is unusually large compared to its predecessors.
Theorem 1.
If (3) holds then can only hold for
Proof.
If are large and slowly decreasing, then will be quite large and will be very small. Then even for mild costs , Theorem 1 ensures that some form of thinning will improve asymptotic efficiency. The converse does not hold: thinning might still help, even if .
2.2 AR(1) autocorrelations
Here we consider a first order autoregressive model,
for and . In this setting it is possible to find the most efficient values of and to measure the efficiency gain from them. It is reasonable to expect qualitatively similar results from autocorrelations that have approximately the AR(1) form. Some steps in that direction are in Section 4.Under an AR(1) model
(4) 
We use to denote an efficiency computed under the autoregressive assumption and to denote a more general efficiency. The efficiency in (1) is a continuous function of the underlying inside of it, so small departures from the autoregressive assumption will make small changes in efficiency. When the peak of is flat then small changes in may bring large changes in .
Table 1 shows for a range of correlations and costs . This is computed via a search described in Section 5. As one would expect, the optimal thinning factor increases with both and .
Perhaps surprisingly, the optimal thinning factor can be large even for , when the chain mixes slowly. For instance with and , the optimal thinning takes every ’nd value. But Table 2 shows that in such cases only a small relative efficiency gain occurs. For and the improvement is just under % and this gain may not be worth the trouble of using thinning.
When the cost is comparable to one, then thinning can bring a meaningful efficiency improvement for slow mixing chains. The efficiency gain approaches in the limit as . See equation (7) in Section 3.
0.1  0.5  0.9  0.99  0.999  0.9999  0.99999  0.999999  

0.001  1  1  1  4  18  84  391  1817 
0.01  1  1  2  8  39  182  843  3915 
0.1  1  1  4  18  84  391  1817  8434 
1  1  2  8  39  182  843  3915  18171 
10  2  4  17  83  390  1816  8433  39148 
100  3  7  32  172  833  3905  18161  84333 
1000  4  10  51  327  1729  8337  39049  181612 
0.1  0.5  0.9  0.99  0.999  0.9999  0.99999  0.999999  

0.001  1.00  1.00  1.00  1.00  1.00  1.00  1.00  1.00 
0.01  1.00  1.00  1.00  1.01  1.01  1.01  1.01  1.01 
0.1  1.00  1.00  1.06  1.09  1.10  1.10  1.10  1.10 
1  1.00  1.20  1.68  1.93  1.98  2.00  2.00  2.00 
10  1.10  2.08  5.53  9.29  10.59  10.91  10.98  11.00 
100  1.20  2.79  13.57  51.61  85.29  97.25  100.17  100.82 
1000  1.22  2.97  17.93  139.29  512.38  845.38  963.79  992.79 
A more efficient thinning rule allows the user to wait less time for an answer, or to attain a more accurate answer in the same amount of time. It may be a slight nuisance to incorporate thinning and when storage is not costly, we might even prefer to explore a larger set of sampled values. Table 3 shows the least amount of thinning that we can do to get at least % efficiency relative to the most efficient value of . That is, we find the smallest with . When % efficiency is adequate and is small then there is no need to thin. Theorem 3 below shows that in the AR(1) model, there is no need to thin at any , if efficiency is acceptable.
0.1  0.5  0.9  0.99  0.999  0.9999  0.99999  0.999999  

0.001  1  1  1  1  1  1  1  1 
0.01  1  1  1  1  1  1  1  1 
0.1  1  1  2  2  2  2  2  2 
1  1  2  5  11  17  19  19  19 
10  2  4  12  45  109  164  184  189 
100  2  5  22  118  442  1085  1632  1835 
1000  2  6  31  228  1182  4415  10846  16311 
3 Some inequalities
Here we compare efficiencies for different choices of the thinning factor , under the autoregressive assumption . We find that thinning never helps when . In the limit as , the optimal diverges to infinity but we can attain nearly full efficiency by taking to be a modest multiple of . When , the critical value of , meaning one large enough to make , is an increasing function of . As a result we can determine when is optimal. The following basic lemma underpins several of the results.
Lemma 2.
Let be integers, and . Then if and only if
(5) 
Proof.
Because , the given inequality in efficiencies is equivalent to
Equation (5) follows by rearranging this inequality. ∎
It is obvious that thinning cannot improve efficiency when . Here we find that the same holds for .
Proposition 1.
If and then for all integers .
Proof.
With negative
there is an advantage to taking an odd integer
compared to nearby even integers, stemming from the factor in . For instance with Lemma 2 we find that when , but remains the best odd integer. From here on we restrict attention to . Also it is obvious that is best when and so we assume .Many applications have correlations very close to . Then
(7) 
The optimal grows without bound as and it has asymptotic efficiency . From Tables 1 and 2 we might anticipate that there are diminishing returns to very large in this limit. If we do not insist on maximum efficiency we can use much smaller . To obtain efficiency at least relative to the best in the large limit we impose
for . Rearranging this inequality we obtain
For instance to attain % efficiency relative to the best in the large limit, we may take and then .
The next proposition introduces a critical cost beyond which thinning by the factor is more efficient than not thinning. That threshold cost increases with and the result is that we may then characterize when (no thinning) is optimal.
Proposition 2.
Let and choose an integer . Then if and only if
(8) 
Proof.
This follows from Lemma 2 using and . ∎
For fixed and very large we find that
If is near zero, then choosing large is only efficient for very large . If is close to then the threshold for ’s efficiency can be quite low.
Proposition 3.
For the critical from (8) is an increasing function of .
Proof.
Let and put
It suffices to show that is increasing over . Differentiating,
and we need only show that the numerator
is positive. We will show that
First we show that . This function and its first three derivatives are
Because and we find that . Similarly, and so . Finally, so that , completing the first step.
For the second step, treating as a continuous variable,
We now show that this partial derivative is nonnegative. Because , it suffices to show that the second factor where . Differentiating yields
Proceeding as before, so this first partial derivative is nonnegative. Then so is nonpositive as required. ∎
Theorem 2.
For , the choice maximizes efficiency over integers whenever
(9) 
For , the choice maximizes efficiency if
(10) 
Proof.
From the monotonicity of in Proposition 3, if any is better than then . Then is most efficient if
establishing (9). The equation has two roots for fixed and (9) holds for outside the open interval formed by those two roots. One of those roots is larger than and the other is given as the right hand side of (10). ∎
The upper limit in (10) is asymptotically for large . That is .
Theorem 3.
For integer lag , cost and autocorrelation , the function is nondecreasing in , and so
Proof.
The second conclusion follows from the first using the limit in (7). The derivative of with respect to is
(11) 
It suffices to show that the numerator in (11) is nonnegative for . The numerator simplifies to twice . Now
for a factor . Because and we have . Therefore and because we conclude that and so is nondecreasing in . ∎
Next we consider how to locate the maximizer over of . The next result on relaxing to be a nonnegative real value helps.
Proposition 4.
For and the function is strictly concave in .
Proof.
We write for
Now , so is strictly concave for . It now suffices to show that is concave. Let . Then
using , and so
(12) 
The numerator in (12) is . We need only show that on which includes all relevant values of . The result follows because and . ∎
From Proposition 4, we know that if we relax to real values in , then is either nonincreasing as increases from , or it increases to a peak before descending, or it increases indefinitely as increases. Because we can rule out the third possibility.
As a result, we know that if for , then the optimal value of is in the set . Neither the value nor any larger value can be optimal because the function is already in its descending region by the time we consider lags as large as .
4 Approximately autoregressive dependence
In this section we consider how efficiencies and optimal thinning factors behave when the autocorrelations are nearly but not exactly of autoregressive form. Recall that the efficiency of thinning to every ’th observation versus not thinning () is given by of (1). More generally, the efficiency of thinning by factor versus thinning by factor is
Thinning should help for autocorrelations that are approximately autoregressive and decay very slowly. Then the numerator in will be large. For there to be a meaningful efficiency gain, the denominator in should not be too large. That is reasonable as we ordinarily expect that . We suppose now that the autocorrelations of satisfy
(13) 
Then need not follow exactly the autoregressive form and indeed they need not be monotonically decreasing in .
Under condition (13) we can get upper and lower bounds on the summed autocorrelations and these yield
Any value for which holds for some cannot be the optimal thinning factor.
There are two main ways that thinning can help. One is that the autocorrelations decay slowly. The other is that the cost to compute is large. We consider one example of each type.
First, consider a slow but not extremely slow correlation decay, and with moderately large . If , then thinning at factor must be more efficient than not thinning for any autocorrelation satisfying (13). We find that this holds whenever . If , then thinning at factor must be at least twice as efficient as not thinning. This holds for . If then thinning is at least four times as efficient as not thinning. In this not very extreme example, there are gains from thinning and they hold over a wide range of thinning factors . The thinning factors that are not dominated by some other thinning factor are given by . Any other cannot be optimal. The given values of , and allow for a large set of possible optimal , but they do not allow for to be optimal. Instead, is suboptimal by at least fourfold.
As a second example, consider a high cost with moderately slow correlation decay given by and . Then there is at least a fold efficiency gain for any and the optimal must satisfy .
5 Optimization
The most direct way to maximize over is to compute for all and then choose
It is necessary to find a value that we can be sure is at least as large as . In light of the discussion following the log concavity Proposition 4, we need only find a value where holds for some . We do this by repeatedly doubling until we encounter a decreased efficiency.
For moderately large values of and it is numerically very stable to compute . But for more extreme cases it is better to work with
where does not depend on . Many computing environments contain a special function that is a numerically more precise way to compute for small . Ignoring we then work with
Now, to find we set and then while set . At convergence take . R code to implement this optimization is given in the Appendix. Only in extreme circumstances will be larger than one million, and so the enumerative approach will ordinarily have a trivial cost and it will not then be necessary to use more sophisticated searches. It takes about of a second for this search to produce the values in Tables 1 and 2 on a MacBook Air. If is thought to be extraordinarily large then one could run a safeguarded Newton method to find and whichever of or maximizes .
6 Discussion
Contrary to common recommendations, thinning a Markov chain sample can improve statistical efficiency. This phenomenon always holds for monotonically decreasing nonnegative autocorrelations if the cost of evaluating is large enough. When the correlations follow an autoregressive model, the optimal subsampling rate grows rapidly as increases towards becoming unbounded in the limit. Sometimes those large subsampling rates correspond to only modest efficiency improvements. The magnitude of the improvement depends greatly on the ratio of the cost of function evaluation to the cost of updating the Markov chain. When is of order or higher, a meaningful efficiency improvement can be attained by thinning such a Markov chain. When the autocorrelations decay slowly but do not necessarily follow the exact autoregression pattern we may still find that thinning brings a large efficiency gain.
In some problems, the cost may have an important dependence on . In an MCMC, it is common to have because a proposal was rejected. In such cases need not be recomputed. Then an appropriate cost measure for would be the CPU time taken to evaluate , normalized by the time to generate a proposal, and then multiplied by the acceptance rate. Larger values of increase the chance that a proposal has been accepted and hence the average cost of computing . For instance, Gelman et al. (1996) find that an acceptance rate of is most efficient in high dimensional Metropolis random walk sampling. Then when thinning by factor , the appropriate cost is where is the cost of an accepted proposal and the efficiency becomes
under an autoregressive assumption. Optimizing this case is outside the scope of this article. It is more difficult because the autocorrelation depends on the acceptance rate . At any level of thinning, the optimal may depend on .
It is also common that one has multiple functions to evaluate. They might each have different optimal thinning ratios. Optimizing the efficiency over such a collection raises issues that are outside the scope of this article. For instance, the cost of evaluating a subset of these functions may be subadditive in the costs of evaluating them individually due to shared computations. The importance of estimating those different means may also be unequal. Finally, there may be greater statistical efficiency for comparisons of those corresponding means when the are evaluated on common inputs.
Acknowledgments
This work was supported by the NSF under grants DMS1407397 and DMS1521145. I thank Hera He, Christian Robert, Hans Andersen, Michael Giles and some anonymous reviewers for helpful comments.
References

Gamerman and Lopes (2006)
Gamerman, D. and Lopes, H. F. (2006).
Markov chain Monte Carlo: stochastic simulation for Bayesian inference
. CRC Press, Boca Raton,FL, 2nd edition.  Gelman et al. (1996) Gelman, A., Roberts, G., and Gilks, W. (1996). Efficient Metropolis jumping rules. In Bernardo, J. M., Berger, J. O., and Smith, A. F. M., editors, Bayesian statistics 5, pages 599–608.
 Geyer (1991) Geyer, C. J. (1991). Markov chain Monte Carlo maximum likelihood. In Keramides, E. M., editor, Proceedings of the 23rd Symposium on the Interface. Interface Foundation of North America.
 Geyer (1992) Geyer, C. J. (1992). Practical Markov chain Monte Carlo. Statistical Science, 7(2):473–483.
 Jackman (2009) Jackman, S. (2009). Bayesian analysis for the social sciences. John Wiley & Sons, New York.
 Link and Eaton (2011) Link, W. A. and Eaton, M. J. (2011). On thinning of chains in MCMC. Methods in ecology and evolution, 3(1):112–115.
 MacEachern and Berliner (1994) MacEachern, S. N. and Berliner, L. M. (1994). Subsampling the Gibbs sampler. The American Statistician, 48(3):188–190.
 Neal (1993) Neal, R. M. (1993). Probabilistic inference using Markov chain Monte Carlo methods. Technical report, University of Toronto.
 Newman and Barkema (1999) Newman, M. E. J. and Barkema, G. T. (1999). Monte Carlo Methods in Statistical Physics. Oxford University Press, New York.
Appendix: R code
# Code to find the optimal amount of thinning for a Markov sample. # It costs 1 unit to advance the chain, and theta units to evaluate # the function. The autocorrelation is rho. effk = function(k,theta,rho){ # Asymptotic efficiency of thinning factor k vs using k=1 # Compute and exponentiate log( effk ) # NB: log1p( x ) = log( 1+x ) t1 = log1p(theta)  log(k+theta) t2 = log1p(rho)  log1p(rho) t3 = log1p(rho^k)  log1p(rho^k) exp( t1 + t2 + t3 ) } leffkprime = function(k,theta,rho){ # Log of asymptotic efficiency at thinning factor k. # It ignores terms that do not depend on k. if( any( rho!=0 ) & any( abs(rho^k) == 0) ){ # Basic detection of underflow while still allowing rho=0 badk = min( k[abs(rho^k)==0] ) msg = paste("Underflow for k >=",badk,sep=" ") stop(msg) }  log(k+theta) + log1p(rho^k)  log1p(rho^k) } getkmax = function(theta,rho){ # Find an upper bound for the optimal thinning fraction k if( theta<0 )stop("Negative theta") if( rho<0 )stop("Negative rho") if( rho >=1 )stop("rho too close to one") m=1 while( leffkprime(2*m,theta,rho) > leffkprime(m,theta,rho) ) m = m*2 2*m } kopt = function(theta,rho,klimit=10^7){ # Find optimal k for the given theta and rho. # Stop if kmax is too large. That usually # means that theta is very large or rho is very nearly one kmax = getkmax(theta,rho) if( kmax > klimit ){ msg = paste("Optimal k too expensive. It requires checking",kmax,"values.") stop(msg) } leffvals = leffkprime( 1:kmax,theta,rho ) best = which.max(leffvals) best } kok = function(theta,rho,klimit=10^7,eta=.05){ # Find near optimal k for the given theta and rho. # This is the smallest k with efficiency >= 1eta times best. # NB: computations in kopt are repeated rather than # saved. This is inefficient but the effect is minor. best = kopt(theta,rho,klimit) leffvals = leffkprime( 1:best,theta,rho ) ok = min( which(leffvals >= leffvals[best] + log1p(eta) ) ) ok } kopttable = function( thvals = 10^c(3:3), rhovals = c(.1,.5,110^c(1:6)),eta=.05){ # Prepare tables of optimal k, its efficiency, and smallest # k with at least 1eta efficiency T = length(thvals) R = length(rhovals) bestk = matrix(0,T,R) row.names(bestk) = thvals colnames(bestk) = rhovals effbk = bestk okk = bestk for( i in 1:T ) for( j in 1:R ){ theta = thvals[i] rho = rhovals[j] bestk[i,j] = kopt(theta,rho) effbk[i,j] = leffkprime(bestk[i,j],theta,rho)leffkprime(1,theta,rho) effbk[i,j] = exp(effbk[i,j]) okk[i,j] = kok(theta,rho,eta=eta) } list( bestk=bestk, effbk=effbk, okk=okk ) }
Comments
There are no comments yet.