Finding our Way in the Dark: Approximate MCMC for Approximate Bayesian Methods

05/16/2019 ∙ by Evgeny Levi, et al. ∙ UNIVERSITY OF TORONTO 0

With larger data at their disposal, scientists are emboldened to tackle complex questions that require sophisticated statistical models. It is not unusual for the latter to have likelihood functions that elude analytical formulations. Even under such adversity, when one can simulate from the sampling distribution, Bayesian analysis can be conducted using approximate methods such as Approximate Bayesian Computation (ABC) or Bayesian Synthetic Likelihood (BSL). A significant drawback of these methods is that the number of required simulations can be prohibitively large, thus severely limiting their scope. In this paper we design perturbed MCMC samplers that can be used within the ABC and BSL paradigms to significantly accelerate computation while maintaining control on computational efficiency. The proposed strategy relies on recycling samples from the chain's past. The algorithmic design is supported by a theoretical analysis while practical performance is examined via a series of simulation examples and data analyses.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Since the early 1990s Bayesian statisticians have been able to operate largely due to the rapid development of Markov chain Monte Carlo (MCMC) sampling methods

(see, for example Craiu and Rosenthal, 2014, for a recent review). Given observed data with sampling density indexed by parameter

, Bayesian inference for functions of

rely on the characteristics of the posterior distribution

(1)

where denotes the prior distribution. When the posterior distribution in (1) cannot be studied analytically, we rely on MCMC algorithms to generate samples from . While traditional MCMC samplers such as Metropolis-Hastings or Hamiltonian MCMC (see Brooks et al., 2011, and references therein) can sample distributions with unknown normalizing constants, they rely on the closed form of the unnormalized posterior, .

The advent of large data has altered in multiple ways the framework we just described. For example, larger data tend to yield likelihood functions that are much more expensive to compute, thus exposing the liability inherent in the iterative nature of MCMC samplers. In response to this challenge, new computational methods based on divide and conquer (Scott et al., 2016; Wang and Dunson, 2013; Entezari et al., 2018), subsampling (Bardenet et al., 2014; Quiroz et al., 2015), or sequential Balakrishnan et al. (2006); Maclaurin and Adams (2015) strategies have emerged. Second, it is understood that larger data should yield answers to more complex problems. This implies the use of increasingly complex models, in as much as the sampling distribution is no longer available in closed form.

In the absence of a tractable likelihood function, statisticians have developed approximate methods to perform Bayesian inference when, for any parameter value , data can be sampled from the model. Here we consider two alternative approaches that have been proposed and gained considerable momentum in recent years: the Approximate Bayesian Computation (ABC) (Marin et al., 2012; Baragatti and Pudlo, 2014; Sisson et al., 2018a; Drovandi, 2018) and the Bayesian Synthetic Likelihood (BSL)(Wood, 2010; Drovandi et al., 2018; Price et al., 2018). Both algorithms are effective when they are combined with Markov chain Monte Carlo sampling schemes to produce samples from an approximation of and both share the need for generating many pseudo-data sets . This comes with serious challenges when the data is large and generating a pseudo-data set is computationally expensive. In this paper we tackle the reduction of computational burden by recycling draws from the chain’s history. While this reduces drastically the computation time, it alters the transition kernel of the original MCMC chain. We demonstrate that we can control the approximating error introduced when perturbing the original kernel using some of the error analysis for perturbed Markov chains developed recently by Mitrophanov (2005), Johndrow et al. (2015b) and Johndrow and Mattingly (2017).

The paper is structured as follows. Section 2 briefly reviews the ABC method and Section 3 introduces the proposed MCMC algorithms for ABC. Section 4 reviews BSL sampling and extends the proposed methods to this class of approximations. The practical impact of these algorithms is evaluated via simulations in Section 5 and data analyses in Section 6. The theoretical analysis showing control of perturbation errors in total variation norm is in Section 7. The paper closes with conclusions and ideas for future work.

2 Approximate Bayesian Computation

In order to illustrate the ABC sampler, let us consider the following Hidden Markov Model (HMM)

(2)
(3)

Unless Gaussian distributions are used to specify the transition and emission laws given in (

2) and (3), respectively, the marginal distribution

cannot be calculated in closed form. It is possible to treat the hidden random variables

as auxiliary and sample them using Particle MCMC (PMCMC) (Andrieu et al., 2010) or ensemble MCMC (Shestopaloff and Neal, 2013). However, computations become increasingly difficult as increases. Moreover, for some financial time series models such as Stochastic Volatility for log return, the

-Stable distribution may be useful to model transition and/or emission probabilities

(Nolan, 2003). The challenge is that the stable distributions do not have closed form densities, thus rendering the particle and ensemble MCMC impossible to use. Other widely used examples where the likelihood functions cannot be expressed analytically include various networks models (e.g., Kolaczyk and Csárdi, 2014) and Markov random fields (Rue and Held, 2005). For such models with intractable or computationally expensive likelihood evaluations, simulation based algorithms such as ABC are frequently used for inference. In its simplest form, the ABC is an accept/reject sampler. Given a user-defined summary statistic , the Accept/Reject ABC is described in Algorithm 1.

1:Given observed and required number of samples .
2:for  do
3:     Match = FALSE
4:     while Not Match do
5:         
6:         
7:         if  then
8:              ,
9:              Match = TRUE.
10:         end if
11:     end while
12:end for
Algorithm 1 Accept/Reject ABC

We emphasize that a closed form equation for the likelihood is not needed, only the ability to generate from for any . If is a sufficient statistics and then the algorithm yields posterior samples from the true posterior . Alas, the level of complexity for models where ABC is needed, makes it unlikely for these two conditions to hold. In order to implement ABC under more realistic assumptions, a (small) constant is chosen and is accepted whenever , where is a user-defined distance function. The introduction of and the use of non-sufficient statistics remove layers of exactness from the target distribution. The approximating distribution is denoted and we have

(4)

In light of (4) one would like to have , but if the sample size of

is large, then the curse of dimensionality leads to

. Thus, obtaining even a moderate number of samples using ABC can be an unattainable goal in this case. In almost all cases of interest, is not a sufficient statistics, implying that some information about is lost. Not surprisingly, much attention has been focused on finding appropriate low-dimensional summary statistics for inference (see, for example Robert et al., 2011; Fearnhead and Prangle, 2012; Marin et al., 2014; Prangle, 2015). In this paper we assume that the summary statistic is given.

In the absence of information about the model parameters, the prior and posterior distributions may be misaligned, having non-overlapping regions of mass concentration. Hence, parameter values that are drawn from the prior will be rarely retained making the algorithm very inefficient.

1:Given , and required number of samples .
2:Find initial with simulated such that .
3:for  do
4:     Generate
5:     Simulate and let
6:     Calculate
7:     Generate independent
8:     if  then
9:         .
10:     else
11:         .
12:     end if
13:end for
Algorithm 2 ABC MCMC

Algorithm 2 presents the ABC-MCMC algorithm of Marjoram et al. (2003) which avoids sampling from the prior and instead relies on building a chain with a Metropolis-Hastings (MH) transition kernel, with state space , proposal distribution and target distribution

(5)

where . Note that the goal is the marginal distribution for which is:

(6)

There are a few alternatives to Algorithm 2. For instance, Lee et al. (2012) approximates

via one of its unbiased estimators,

where and each is simulated from . The use of unbiased estimators for when computing the MH acceptance ratio can be validated using the theory of pseudo-marginal MCMC samplers (Andrieu and Roberts, 2009). Clearly, when the probability is very small, this method would require simulating a large number of s (or equivalently s) in order to move to a new state. Other MCMC designs suitable for ABC can be found in Bornn et al. (2014).

Sequential Monte Carlo (SMC) samplers have also been successfully used for ABC (henceforth denoted ABC-SMC) (Sisson et al., 2007; Lee, 2012; Filippi et al., 2013). ABC-SMC requires a specified decreasing sequence . Lee’s method Lee (2012) uses the Particle MCMC design (Andrieu et al., 2010) in which samples are updated as the target distribution evolves with . More specifically, it starts by sampling from using Accept-Reject ABC. Subsequently, at time all samples are sequentially updated so their distribution is (see Lee, 2012, for a complete description). The advantage of this method is not only that it starts from large , but also that it generates independent draws. A comprehensive coverage of computational techniques for ABC can be found in Sisson et al. (2018b) and references therein. We also note a general lack of guidelines concerning the selection of , which is unfortunate as the performance of ABC sampling depends heavily on its value. To make a fair comparison between different methods, we revise ABC-MCMC algorithm by introducing a decreasing sequence ( is number of ”steps”) similar to ABC-SMC and ”learning” transition kernel during burn-in as in Algorithm 3.

1:Given , sequence , constant , burn-in period and required number of samples .
2:Define .
3:Find initial with simulated such that .
4:Let be expectation of prior distribution and where is covariance of the prior .
5:Define, and define sequence .
6:for  do
7:     if  for some  then
8:         Set .
9:         Find as mean of and where is covariance of .
10:     end if
11:     Generate
12:     Simulate and let
13:     Calculate
14:     Generate independent
15:     if  then
16:         .
17:     else
18:         .
19:     end if
20:end for
Algorithm 3 ABC MCMC modified (ABC-MCMC-M)

Since the choice of proposal distribution can considerably influence the performance of ABC-MCMC, we consider finite adaptation during the burn-in period of length . In addition, during burn-in the also varies, starting with a higher value (which makes it easier to find the initial value) and gradually decreasing in accordance to a pre-determined scheme. In our implementations we use independent MH sampling or RWM. In the former case, the proposal is Gaussian with . The RWM proposal is with (Roberts et al., 1997, 2001).

All the algorithms discussed so far rely on numerous generations of pseudo-data. Since the latter can be computationally costly, proposals for reducing the simulation cost are made in Wilkinson (2014) and Järvenpää et al. (2018). The approaches are based on learning the dependence between and and, from it, establishing directly whether a proposal should be accepted or not. Flexible regression models are used to model these unknown functional relationships. The overall performance depends on the signal to noise ratio and on the model’s performance in capturing patterns that can be highly complex.

To accelerate ABC-MCMC we consider a different approach and propose to store and utilize past simulations (with appropriate weights) in order to speed up the calculation while keeping under control the resulting approximating errors. The objective is to approximate for any at every MCMC iteration using past simulated proposals, making the whole procedure computationally faster. The changes proposed perturb the chain’s transition kernel and we rely on the theory developed by Mitrophanov (2005) and Johndrow et al. (2015a)

to assess the approximating error for the posterior. The k-Nearest-Neighbor (kNN) method is used to integrate past observations into the transition kernel. The main advantage of kNN is that it is uniformly strongly consistent which guarantees that for a large enough chain history, we can control the error between the intended stationary distribution and that of the proposed accelerated MCMC as shown in Section 

7.

3 Approximated ABC-MCMC (AABC-MCMC)

In this section we describe an ABC-MCMC algorithm that utilizes past simulations to significantly improve computational efficiency. As noted previously, the ABC-MCMC with threshold targets the density

(7)

where with and . Denote and note that if were known for every then we could run an MH-MCMC chain with invariant target density proportional to . Alas, is almost always unknown and unbiased estimates can be computationally expensive or statistically inefficient. We build an alternative approach that relies on consistent estimates of that rely on the chain’s past history, are much cheaper to compute, and require a new theoretical treatment.

To fix ideas, suppose that at time we generate the proposal and suppose that at iteration , all the proposals , regardless whether they were accepted or rejected, along with corresponding distances are available for . This past history is stored in the set . Given a new proposal , we generate and compute . Set , , and estimate using

(8)

where are weights and is a decreasing function. We discuss a couple of choices for the function below.

Remark 1: Note that if some of the past proposals have been accepted, then the Markovian property of the chain is violated since the acceptance probability does not depend solely on the current state, but also on the past ones. We defer the theoretical considerations for dealing with adaptation in the context of perturbed Markov chains to a future communication. Below, we modify slightly the construction above while respecting the core idea.

In order to separate the samples used as proposals from those used to estimate in (8), we will generate at each time two independent samples and from . Then, the history collects the samples while the proposal used to update the chain is the sample. With this notation (8) becomes

(9)

where and .

1:Given with summary statistics , sequence , constant , burn-in period , required number of samples , initial simulations with , and .
2:Define .
3:Find initial with simulated such that .
4:Let be expectation of prior distribution and where is covariance of the prior .
5:Define, and define sequence
6:for  do
7:     if  for some  then
8:         Set .
9:         Find as mean of and where is covariance of .
10:     end if
11:     Generate and .
12:     Simulate and let
13:     Add the dual simulated pair of parameter and discrepancy to the past set: and set .
14:     .
15:     .
16:     Calculate
17:     Generate independent
18:     if  then
19:         .
20:     else
21:         .
22:     end if
23:end for
Algorithm 4 Approximated ABC MCMC (AABC-MCMC)

Remark 2: Even if is greater than (which would trigger automatically rejection for ABC-MCMC), suppose there is a close neighbour of whose corresponding is less than . Then the estimated

will not be zero and there is a chance of moving to a different state. Intuitively, this is expected to reduce the variance of the accepting probability estimate.

Remark 3: When comparing the unbiased estimator

(10)

with the consistent estimator

(11)

we hope to outperform both the small and large cases in (10). For the small , we expect to reduce the variability in our acceptance probabilities, while for larger we expect to reduce the computational costs without sacrificing much in terms of precision.

Since the proposed weighted estimate is no longer an unbiased estimator of , a new theoretical evaluation is needed to study the effect of perturbing the transition kernel on the statistical analysis. Central to the algorithm’s utility is the ability to control the total variation distance between the desired distribution of interest given in (7) and the modified chain’s target. As will be shown in Section 7, we rely on three assumptions to ensure that the chain would approximately sample from (7): 1) compactness of ; 2) uniform ergodicity of the chain using the true and 3) uniform convergence in probability of to as .

The k-Nearest-Neighbor (kNN) regression approach (Fix and Hodges, 1951; Biau and Devroye, 2015) has a property of uniform consistency (Cheng, 1984). Define (in our numerical experiments we have used ). Without loss of generality we relabel the elements of according to distance so that and corresponds to the smallest and largest among all distances , respectively. The kNN method sets to zero for all . For , we focus on the following two weighting schemes:

  1. The uniform kNN with for all ;

  2. The linear kNN with for so that the weight decreases from to as increases from to .

The kNN’s theoretical properties that are used to validate our sampler rely on independence between the pairs . Therefore, throughout the paper, we use an independent proposal in the MH sampler, i.e. and is Gaussian. The entire procedure is outlined in Algorithm 4.

To conclude, at the end of a simulation of size the MCMC samples are and the history used for updating the chain is . The two sequences are independent of one another, i.e. for any , the elements in are independent of the chain’s history up to time .

Note also that is required in order to determine the acceptance probability at step . In this case the -value may be updated if is small enough.

In the next section we extend the approximate MCMC construction to Bayesian Synthetic Likelihood. In Sections 5 and 6 we use numerical experiments to show that the proposed procedure generally improves the mixing of a chain.

4 BSL and Approximated BSL (ABSL)

An alternative approach to bypass the intractability of the sampling distribution is proposed by Wood (2010). His approach is based on the assumption that the conditional distribution for a user-defined statistic given is Gaussian with mean and covariance matrix . The Synthetic Likelihood (SL) procedure assigns to each the likelihood , where and denotes the density of a normal with mean and covariance . SL can be used for maximum likelihood estimation as in Wood (2010) or within the Bayesian paradigm as proposed by Drovandi et al. (2018) and Price et al. (2018). The latter work proposes to sample the approximate posterior generated by the Bayesian Synthetic Likelihood (BSL) approach, , using a MH sampler. Direct calculation of the acceptance probability is not possible because the conditional mean and covariance are unknown for any . However, both can be estimated based on statistics sampled from their conditional distribution given . More precisely, after simulating and setting , , one can estimate

(12)

so that the synthetic likelihood is

(13)
1:Given , number of simulations and required number of samples .
2:Get initial , estimate by simulating statistics given .
3:Define .
4:for  do
5:     Generate
6:     Estimate by simulating pseudo-data points and corresponding statistics given .
7:     Calculate .
8:     Calculate
9:     Independently generate
10:     if  then
11:         Set .
12:     else
13:         Set .
14:     end if
15:end for
Algorithm 5 Bayesian Synthetic Likelihood (BSL-MCMC)

The pseudo-code in Algorithm 5 shows the steps involved in the BSL-MCMC sampler. Since each MH step requires calculating the likelihood ratios between two SLs calculated at different parameter values, one can anticipate the heavy computational load involved in running the chain for thousands of iterations, especially if sampling data is expensive. Note that even though these estimates for the conditional mean and covariance are unbiased, the estimated value of the Gaussian likelihood is biased and therefore pseudo marginal MCMC theory is not applicable. Price et al. (2018) presented an unbiased Gaussian likelihood estimator and have empirically showed that using biased and unbiased estimates generally perform similarly. They have also remarked that this procedure is very robust to the number of simulations , and demonstrate empirically that using to produce similar results.

The normality assumption for summary statistics is certainly a strong assumption which may not hold in practice. Following up on this, An et al. (2018) relaxed the jointly Gaussian assumption to Gaussian copula with non-parametric marginal distribution estimates (NONPAR-BSL), which includes joint Gaussian as a special case, but is much more flexible. The estimation is based, as in the BSL framework, on pseudo-data samples simulated for each .

Clearly, BSL is computationally costly and requires many pseudo-data simulations to obtain Monte Carlo samples of even moderate sizes. To accelerate BSL-MCMC we propose to store and utilize past simulations of to approximate for any , making the whole procedure computationally faster. As in the previous section, we separate the simulation used to update the chain from the simulation used to enrich the history of the chain. The approach can trivially be extended for NONPAR-BSL but we do not pursue it further here. K-Nearest-Neighbor (kNN) method is used as a non-parametric estimation tool for different quantities described above. As will be shown in Section 7 with the proposed method we can control the error between the intended stationary distribution and that of the proposed accelerated MCMC.

Approximated Bayesian Synthetic Likelihood (ABSL)

Setting and assuming conditional normally for this statistic the objective is to sample from

(14)

During the MCMC run, the proposal is generated from and the history is enriched using , and . Then for any

, the conditional mean and covariance of statistics vector is estimated using past samples as weighted averages:

(15)
1:Given , constant , burn-in period , number of adaption points during burn-in , required number of samples , initial pseudo data simulations with , and .
2:Get initial .
3:Let be expectation of prior distribution and where is covariance of the prior .
4:Define, and define sequence
5:for  do
6:     if  for some  then
7:         Find as mean of and where is covariance of .
8:     end if
9:     Generate
10:     Simulate and let for .
11:     Add simulated parameter and statistics to the past set: and set .
12:     Calculate:
13:     Calculate:
14:     .
15:     .
16:     Calculate .
17:     Generate independent
18:     if  then
19:         .
20:     else
21:         .
22:     end if
23:end for
Algorithm 6 Approximated Bayesian Synthetic Likelihood (ABSL)

Again the weights are functions of distance between proposed value and parameters’ values from the past , where is the Euclidean norm. To get appropriate convergence properties we use the kNN approach to calculate weights , where only the closest values to are used in the calculation of conditional means and covariances. As in the previous section, uniform (U) and linear (L) weights are used. Once again we expect that the use of the chain’s cumulated history can significantly speed up the whole procedure since it relieves the pressure to simulate many data sets at every step. The use of the independent Metropolis kernel ensures that contains independent draws which is required for theoretical validation in Section 7. We will also show that under mild assumptions and if is compact, the proposed algorithm exhibits good error control properties. In order to get a rough idea about the proposal, we propose to perform finite adaptation with adaptation points during the burn-in period. Algorithm 6 outlines the proposed Approximated BSL (ABSL) method. For the simulations we report on in the next section, we have used and to be consistent with AABC-MCMC, ABC-MCMC-M and ABC-SMC procedures.

5 Simulations

We analyze the following statistical models:

  1. Simple Moving Average model of lag 2;

  2. Ricker’s model;

  3. Stochastic volatility with Gaussian emission noise;

  4. Stochastic volatility with -Stable errors.

For all these models, the simulation of pseudo data for any parameter is simple and computationally fast, but the use of standard estimation methods can be quite challenging, especially for (R), (SVG) and (SVS). For ABC samplers before running a MCMC chain we estimate initial and final thresholds and (15 equal steps in log scale were used for all models) and the matrix which is used to calculate the discrepancy .
To estimate , we use the following steps:

  • Set

  • Repeat steps I and II below for times (=3 in our implementations)

    I

    Generate 500 pairs from and calculate discrepancies with

    II

    Let with smallest discrepancy. Finally generate 100 pseudo-data from , compute corresponding summary statistics and set to be the inverse of covariance matrix of .

We set

to be the 5% quantile of the observed discrepancies. The final

is obtained by implementing a Random Walk version of Algorithm 3 and decreasing gradually by setting as the 1% quantile of discrepancies corresponding to accepted samples generated between adaption points and , for .

The number of simulations was set to 500 and 100 just for computational convenience and is not driven by any theoretical arguments.

We compare the following algorithms:

  1. Standard Sequential Monte Carlo for ABC;

  2. The modified ABC-MCMC algorithm which updates and the random walk Metropolis transition kernel during burn-in;

  3. The modified ABC-MCMC algorithm which updates and the Independent Metropolis transition kernel during burn-in;

  4. Modified BSL where it adapts the random walk Metropolis transition kernel during burn-in;

  5. Modified BSL where it adapts the independent Metropolis transition kernel during burn-in;

  6. Approximated ABC-MCMC with independent proposals and uniform (U) weights;

  7. Approximated ABC-MCMC with independent proposals and linear (L) weights;

  8. Approximated BSL-MCMC with independent proposals and uniform (U) weights;

  9. Approximated BSL-MCMC with independent proposals and linear (L) weights.

  10. Likelihood is computable and posterior samples are generated using an MCMC algorithm that is example-specific.

For SMC 500 particles were used, total number of iterations for ABC-RW, ABC-IS, AABC-U, AABC-L, ABSL-U and ABSL-L is 50000 with 10000 for burn-in. Since BSL-RW and BSL-IS are much more computationally expensive, total number of iterations were fixed at 10000 with 2000 burn-in and 50 pseudo-data simulations for every proposed parameter value (i.e. ). The Exact chain was run for 5000 iterations and 2000 for burn-in. It must be pointed out that all approximate samplers are based on the same summary statistics, same discrepancy function and the same sequence, so that they all start with the same initial conditions.

For more reliable results we compare these sampling algorithms under data set replications. In this study we set the number of replicates , so that for each model 100 data sets were generated and each one was analyzed with the described above sampling methods. Various statistics and measures of efficiency were calculated for every model and data set, letting represent posterior samples from replicate , iteration and parameter component and similarly