Scalable Metropolis-Hastings for Exact Bayesian Inference with Large Datasets

Bayesian inference via standard Markov Chain Monte Carlo (MCMC) methods such as Metropolis-Hastings is too computationally intensive to handle large datasets, since the cost per step usually scales like O(n) in the number of data points n. We propose the Scalable Metropolis-Hastings (SMH) kernel that exploits Gaussian concentration of the posterior to require processing on average only O(1) or even O(1/√(n)) data points per step. This scheme is based on a combination of factorized acceptance probabilities, procedures for fast simulation of Bernoulli processes, and control variate ideas. Contrary to many MCMC subsampling schemes such as fixed step-size Stochastic Gradient Langevin Dynamics, our approach is exact insofar as the invariant distribution is the true posterior and not an approximation to it. We characterise the performance of our algorithm theoretically, and give realistic and verifiable conditions under which it is geometrically ergodic. This theory is borne out by empirical results that demonstrate overall performance benefits over standard Metropolis-Hastings and various subsampling algorithms.

Authors

• 3 publications
• 1 publication
• 8 publications
• 10 publications
• 44 publications
• Multilevel Monte Carlo for Scalable Bayesian Computations

Markov chain Monte Carlo (MCMC) algorithms are ubiquitous in Bayesian co...
09/15/2016 ∙ by Mike Giles, et al. ∙ 0

• Efficient MCMC Sampling with Dimension-Free Convergence Rate using ADMM-type Splitting

Performing exact Bayesian inference for complex models is intractable. M...
05/23/2019 ∙ by Maxime Vono, et al. ∙ 0

• Scaling Bayesian inference of mixed multinomial logit models to very large datasets

Variational inference methods have been shown to lead to significant imp...
04/11/2020 ∙ by Filipe Rodrigues, et al. ∙ 0

• The promises and pitfalls of Stochastic Gradient Langevin Dynamics

Stochastic Gradient Langevin Dynamics (SGLD) has emerged as a key MCMC a...
11/25/2018 ∙ by Nicolas Brosse, et al. ∙ 0

• Newtonian Monte Carlo: single-site MCMC meets second-order gradient methods

Single-site Markov Chain Monte Carlo (MCMC) is a variant of MCMC in whic...
01/15/2020 ∙ by Nimar S. Arora, et al. ∙ 0

• Bayesian inference based process design and uncertainty analysis of simulated moving bed chromatographic systems

Prominent features of simulated moving bed (SMB) chromatography processe...
11/27/2019 ∙ by Qiao-Le He, et al. ∙ 0

• Efficient Bernoulli factory MCMC for intractable likelihoods

Accept-reject based Markov chain Monte Carlo (MCMC) algorithms have trad...
04/16/2020 ∙ by Dootika Vats, et al. ∙ 0

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Bayesian inference is concerned with the posterior distribution , where denotes parameters of interest and are observed data. We assume the prior admits a Lebesgue density and that the data are conditionally independent given with likelihoods , which means

 p(θ|y1:n)∝p(θ)n∏i=1p(yi|θ).

In most scenarios of interest, does not admit a closed-form expression and so we must resort to approximation, for which Markov Chain Monte Carlo (MCMC) provides a widely applicable framework. However, standard MCMC schemes in this context can become very computationally expensive for large datasets. For example, the celebrated Metropolis–Hastings (MH) algorithm requires the computation of a likelihood ratio at each iteration. A direct implementation of this algorithm thus requires computational cost per step, which is prohibitive for large .

Many ideas for mitigating this cost have been suggested. We refer the reader to (Bardenet et al., 2017) for a recent comprehensive review. Broadly speaking these approaches are distinguished by whether or not they exactly preserve the true posterior as the invariant distribution of the Markov chain produced. Approximate methods that have been proposed include divide-and-conquer schemes, which run parallel MCMC chains on a partition of the data (Neiswanger et al., 2013; Scott et al., 2016)

. Other approaches consist of replacing the likelihood ratio in MH with an approximation computed from a subsample of observations. The error introduced can be controlled heuristically using central limit theorem approximations

(Korattikara et al., 2014; Quiroz et al., 2018) or rigorously using concentration inequalities (Bardenet et al., 2014, 2017). Another popular class of schemes is based on Stochastic Gradient Langevin Dynamics (SGLD) (Welling & Teh, 2011; Dubey et al., 2016; Baker et al., 2018; Chatterji et al., 2018), which is a time-discretized Langevin dynamics where the gradient of the log-likelihood is approximated by subsampling. SGLD is usually implemented using a fixed step-size discretization, which does not exactly preserve the posterior distribution.

In addition to these approximate methods, several MCMC methods exist that do preserve the target as invariant distribution while only requiring access to a subset of the data at each iteration. However, various restrictions of these approaches have so far limited their widespread use. Firefly Monte Carlo (Maclaurin & Adams, 2014) considers an extended target that can be evaluated using a subset of the data at each iteration, but requires the user specify global lower bounds to the likelihood factors that can be difficult to derive. It is as yet also unclear what the convergence properties of this scheme are. Banterle et al. (2015) proposed a delayed acceptance scheme that uses a factorized version of the MH acceptance probability. This allows a proposal to be rejected without necessarily computing every likelihood term, but still requires evaluating each term in order to accept a proposal. Finally, various non-reversible continuous-time MCMC schemes based on Piecewise Deterministic Markov Processes have been proposed which, when applied to large-scale datasets (Bouchard-Côté et al., 2018; Bierkens et al., 2018), only require evaluating the gradient of the log-likelihood for a subset of the data. However, these schemes can be difficult to understand theoretically, falling outside the scope of existing geometric ergodicity results, and can be challenging to implement in practice.

In this paper we present a novel subsampling MH-type scheme that exactly preserves the posterior as the invariant distribution while still enjoying attractive theoretical properties and being straightforward to implement and tune. This scheme relies on a combination of a factorized MH acceptance probability (Ceperley, 1995; Christen & Fox, 2005; Banterle et al., 2015; Michel et al., 2017; Vanetti et al., 2018) and fast methods for sampling non-homogeneous Bernoulli processes (Shanthikumar, 1985; Devroye, 1986; Fukui & Todo, 2009; Michel et al., 2017; Vanetti et al., 2018), which allows iterating without computing every likelihood factor. The combination of these ideas has proven useful for some physics models (Michel et al., 2017), but a naïve application is not efficient for large-scale Bayesian inference. Our contribution here is an MH-style MCMC kernel that realises the potential computational benefits of this method in the Bayesian setting. We refer to this kernel as Scalable Metropolis-Hastings (SMH) and, in addition to empirical results, provide a rigorous theoretical analysis of its behaviour under realistic and verifiable regularity assumptions. In particular, we show that our approach only requires processing on average or even at each iteration as illustrated in Figure 1, has a non-vanishing average acceptance probability in the stationary regime, and is geometrically ergodic.

Key to our approach is the use of control variate ideas, which allow us to exploit the concentration around the mode frequently observed for posterior distributions with large datasets. Control variate ideas based on posterior concentration have been used successfully for large-scale Bayesian analysis in numerous recent contributions (Dubey et al., 2016; Bardenet et al., 2017; Bierkens et al., 2018; Baker et al., 2018; Chatterji et al., 2018; Quiroz et al., 2018). In our setting, this may be understood as making use of a computationally cheap approximation of the posterior.

The remainder of this work is organized as follows. In Section 2

we review the factorized MH method and show how to implement this using procedures for fast simulation of Bernoulli random variables. We also provide original sufficient conditions for ensuring the method is geometrically ergodic. In Section

3 we show how this idea can be exploited in the Bayesian big data setting, detailing our proposed factorization scheme and giving realistic and verifiable conditions under which it can be expected to yield a benefit. Finally in Section 4 we show empirically that the predicted computational gains are indeed achieved for two real-world models. The Supplement includes all our proofs as well as a guide to our notation in Section A.

2 Factorised Metropolis-Hastings

In this section we review the use of a factorised acceptance probability inside an MH-style algorithm. For now we assume a completely generic target before specialising to the Bayesian big data setting in the next section.

2.1 Transition kernel

Assume we factorise our target and our proposal like

 π(θ)∝m∏i=1πi(θ)q(θ,θ′)∝m∏i=1qi(θ,θ′)

for some and some choice of non-negative functions and . These factors are not themselves required to be integrable; for instance, we may take any . The Factorised Metropolis-Hastings (FMH) kernel is defined for any and measurable by

 PFMH(θ,A):=(1−∫q(θ,θ′)αFMH(θ,θ′)dθ′)1A(θ)+∫Aq(θ,θ′)αFMH(θ,θ′)dθ′, (1)

where the FMH acceptance probability is defined

 αFMH(θ,θ′):=m∏i=11∧πi(θ′)qi(θ′,θ)πi(θ)qi(θ,θ′)=:αFMHi(θ,θ′). (2)

It is straightforward and well-known that is -reversible; see Section B.1 in the Supplement for a proof. Factorised acceptance probabilities have appeared numerous times in the literature and date back at least to (Ceperley, 1995). The MH acceptance probability and kernel correspond to and when .

2.2 Poisson subsampling implementation

The acceptance step of can be implemented by sampling directly independent Bernoulli trials with success probability , and returning if every trial is a failure. Since we can reject as soon as a single success occurs, this allows us potentially to reject without computing each factor at each iteration (Christen & Fox, 2005; Banterle et al., 2015).

However, although this can lead to efficiency gains in some contexts, it remains of limited applicability for Bayesian inference with large datasets since we are still forced to compute every factor whenever we accept a proposal. It was realized independently by Michel et al. (2017) and Vanetti et al. (2018) that if one has access to lower bounds on , hence to an upper bound on , then techniques for fast simulation of Bernoulli random variables can be used that potentially avoid this problem. One such technique is given by the discrete-time thinning algorithms introduced in (Shanthikumar, 1985); see also (Devroye, 1986, Chapter VI Sections 3.3-3.4). This is used in (Michel et al., 2017).

We use here an original variation of a scheme developed in (Fukui & Todo, 2009). Denote

 λi(θ,θ′):=−log(αFMHi(θ,θ′)),

and assume we have the bounds

 λi(θ,θ′)≤φ(θ,θ′)ψi=:¯¯¯λi(θ,θ′) (3)

for nonnegative . This condition holds for a variety of statistical models: for instance, if is log-Lipschitz and is symmetric with (say) , then

 λi(θ,θ′)≤Ki∥θ−θ′∥. (4)

This case illustrates that (3) is usually a local constraint on the target and therefore not as strenuous as the global lower-bounds required by Firefly (Maclaurin & Adams, 2014). We exploit this to provide a methodology for producing and mechanically when we consider Bayesian targets in Section 3. Letting , it is straightforward to show that if

• [label=, leftmargin=*]

• independently for

then where (and if ). See Proposition C.1 in the supplement for a proof. These steps may be interpreted as sampling a discrete Poisson point process with intensity on via thinning (Devroye, 1986). Thus, to perform the FMH acceptance step, we can simulate these and check whether each is .

We may exploit (3) to sample each and in time per MCMC step as after paying some once-off setup costs. Note that

 ¯¯¯λ(θ,θ′)=φ(θ,θ′)m∑i=1ψi, (5)

so that we may compute in time per iteration by simply evaluating if we pre-compute ahead of our run. This incurs a one-time cost of , but assuming our run is long enough this will be negligible overall. Similarly, note that

 ¯¯¯λi(θ,θ′)¯¯¯λ(θ,θ′)=ψi∑mj=1ψj,

so that does not depend on . Thus, we can sample each in time using Walker’s alias method (Walker, 1977; Kronmal & Peterson, 1979) having paid another once-off cost.

Algorithm 1 shows how to implement using this approach. Observe that if we are guaranteed not to evaluate every target factor even if we accept the proposal . Of course, since is random, in general it is not obvious that will necessarily hold on average, and indeed this will not be so for a naïve factorisation. We show in Section 3 how to use Algorithm 1 as the basis of an efficient subsampling method for Bayesian inference.

In many cases we will not have bounds of the form (3) for every factor. However, Algorithm 1 can still be useful provided the computational cost for computing these extra factors, which we refer to as solitary factors, is . In this case we can directly simulate a Bernoulli trial for each solitary factor, which by assumption does not change the asymptotic complexity of this method.

2.3 Geometric ergodicity

We consider now the theoretical implications of using rather than . We refer the reader to Section B.2 in the supplement for a review of the relevant definitions and theory of Markov chains. It is straightforward to show and well-known that the following holds.

Proposition 2.1.

For all , .

See Section B in the supplement for a proof. As such, we do not expect FMH to enjoy better convergence properties than MH. Indeed, Proposition 2.1

immediately entails that FMH produces ergodic averages of higher asymptotic variance than standard MH

(Peskun, 1973; Tierney, 1998). Moreover can fail to be geometrically ergodic even when is, as noticed in (Banterle et al., 2015). Geometric ergodicity is a desirable property of MCMC algorithms because it ensures the central limit theorem holds for some ergodic averages (Roberts & Rosenthal, 1997, Corollary 2.1)

. The central limit theorem in turn is the foundation of principled stopping criteria based on Monte Carlo standard errors

(Jones & Hobert, 2001).

To address the fact that might not be geometrically ergodic, we introduce the Truncated FMH (TFMH) kernel which is obtained by simply replacing in (1) the term with the acceptance probability

 αTFMH(θ,θ′):={αFMH(θ,θ′),¯¯¯λ(θ,θ′)

for some choice of . Observe that FMH is a special case of TFMH with . When is symmetric in and , Proposition B.3 in the Supplement shows that is still -reversible. The following theorem shows that under mild conditions TFMH inherits the desirable convergence properties of MH.

Theorem 2.1.

If is -irreducible, aperiodic, and geometrically ergodic, then is too if

 δ:=inf¯¯¯λ(θ,θ′)0. (7)

In this case, , and for

 var(f,PTFMH)≤(δ−1−1)var(f,π)+δ−1var(f,PMH).

Here denotes the spectral gap and the asymptotic variance of the ergodic averages of . See Section B.2 in the supplement for full definitions and a proof. Proposition B.1 in the supplement shows that , and hence (7) quantifies the worst-case cost we pay for using the FMH acceptance probability rather than the MH one. The condition (7) is easily seen to hold in the common case that each is bounded away from and on , which is a fairly weak requirement when .

Recall from the previous section that requires computing factors for a given . In this way, TFMH yields the additional benefit of controlling the maximum expected number of factors we will need to compute via the choice of . An obvious choice is to take , which ensures we will not compute more factors for FMH than for MH on average. Thus, overall, TFMH yields the computational benefits of when our bounds (3) are tight (usually near the mode), and otherwise falls back to MH as a default (usually in the tails).

3 FMH for Bayesian big data

We now consider the specific application of FMH to the problem of Bayesian inference for large datasets, where . It is frequently observed that such targets concentrate at a rate around the mode as , in what is sometimes referred to as the Bernstein-von Mises phenomenon. We describe here how to leverage this phenomenon to devise an effective subsampling algorithm based on Algorithm 1. Our approach is based on control variate ideas similar to (Dubey et al., 2016; Bardenet et al., 2017; Bierkens et al., 2018; Baker et al., 2018; Chatterji et al., 2018; Quiroz et al., 2018). We emphasise that all these techniques also rely on a posterior concentration assumption but none of them only requires processing data points per iteration as we do.

To see why this approach is needed, observe that the most natural factorisations of the posterior have . This introduces a major pitfall: each new factor introduced can only lower the value of , which in the aggregate can therefore mean as .

Consider heuristically a naïve application of Algorithm 1 to . Assuming a flat prior for simplicity, the obvious factorisation takes and each . Suppose the likelihoods are log-Lipschitz and that we use the bounds (4) derived above. For smooth likelihoods these bounds will be tight in the limit as . Consequently, if we scale as to match the concentration of the target, then since

 ¯¯¯λ(θ,θ′)=∥θ−θ′∥=Ω(1/√n)n∑i=1Ki=Ω(n)=Ω(√n).

Recall that Algorithm 1 requires the computation of at most factors, and hence in this case we do obtain a reduced expected cost per iteration of as opposed to . Nevertheless, we found empirically that the increased asymptotic variance produced by the decaying acceptance probability entails an overall loss of performance compared with standard MH. We could consider using a smaller stepsize such as which would give a stable acceptance probability, but then our proposal not match the concentration of the posterior. We again found this increases the asymptotic variance to the extent that it negates the benefits of subsampling overall.

3.1 Scalable Metropolis-Hastings

Our approach is based on controlling , which ensures both a low computational cost and a large acceptance probability. We assume an initial factorisation

 π(θ)∝p(θ)n∏i=1p(yi|θ)∝m∏i=1˜πi(θ) (8)

for some (not necessarily equal to ) and (e.g. using directly the factorisation of prior and likelihoods). Let

 Ui(θ):=−log˜πi(θ)U(θ):=m∑i=1Ui(θ).

We choose some fixed not depending on that is near the mode of as in (Dubey et al., 2016; Bardenet et al., 2017; Bierkens et al., 2018; Baker et al., 2018; Chatterji et al., 2018; Quiroz et al., 2018). Assuming sufficient differentiability, we then approximate with a -th order Taylor expansion around , which we denote by

 ˆUk,i(θ)≈Ui(θ).

We also define

 ˆπk,i(θ):=exp(−ˆUk,i(θ))≈˜πi(θ).

In practice we are exclusively interested in the cases and , which correspond to first and second-order approximations respectively. Explicitly, in these cases

 ˆU1,i(θ) = U(ˆθ)+∇Ui(ˆθ)⊤(θ−ˆθ), ˆU2,i(θ) = ˆU1,i(θ)+12(θ−ˆθ)⊤∇2Ui(ˆθ)(θ−ˆθ),

where denotes the gradient and the Hessian. Letting

 ˆUk(θ):=m∑i=1ˆUk,i(θ)ˆπk(θ):=exp(−ˆUk(θ)),

additivity of the Taylor expansion further yields

 ˆU1(θ) = U(ˆθ)+∇U(ˆθ)⊤(θ−ˆθ) ˆU2(θ) = ˆU1(θ)+12(θ−ˆθ)⊤∇2U(ˆθ)(θ−ˆθ). (9)

Thus when (i.e. symmetric positive-definite), is seen to be a Gaussian approximation to around the (approximate) mode .

We use the to define the Scalable Metropolis-Hastings (SMH or SMH-) acceptance probability

 αSMHk(θ,θ′):=(1∧ˆπk(θ′)q(θ′,θ)ˆπk(θ)q(θ,θ′))m∏i=11∧˜πi(θ′)ˆπk,i(θ)ˆπk,i(θ′)˜πi(θ). (10)

Note that SMH- is a special case of FMH with factors given by

 π=ˆπk=:πm+1m∏i=1˜πiˆπk,i=:πiq=q=:qm+1m∏i=11=:qi (11)

and hence defines a valid acceptance probability. (Note that is not integrable, but recall this is not required of FMH factors.) We could consider any factorisation of , but we will not make use of this generality.

can be computed in constant time after precomputing the relevant partial derivatives at before our MCMC run. This allows us to treat as a solitary factor. For the remaining factors we have

 λi(θ,θ′) = −log(1∧˜πi(θ′)ˆπk,i(θ)˜πi(θ)ˆπk,i(θ′)).

We can obtain a bound of the form (3) provided is -times continuously differentiable. In this case, if we can find constants

 ¯¯¯¯Uk+1,i≥supθ∈Θ|β|=k+1|∂βUi(θ)|, (12)

(here is multi-index notation; see Section A of the supplement) it follows that

 ¯¯¯λ(θ,θ′):=(∥θ−ˆθ∥k+11+∥θ′−ˆθ∥k+11)=:φ(θ,θ′)m∑i=1¯¯¯¯Uk+1,i(k+1)!=:ψi (13)

defines an upper bound of the required form (5). See Proposition D.1 in the Supplement for a derivation. Observe this is symmetric in and and therefore can be used to define a truncated version of SMH as described in Section 2.3.

Although we concentrate on Taylor expansions here, other choices of may be possible. In some cases it may be possible to have log-Lipschitz or log-concave, which might allow for better bounds. However, Taylor expansions have the advantage of generality and (13) is sufficiently tight for our purposes.

Heuristically, if the posterior concentrates like , if we scale our proposal like , and if is not too far (specifically ) from the mode, then both and will be , and will be . If moreover , then the summation will be and hence overall . When this is and when this is , which entails a substantial improvement over the naïve approach. In particular, we expect stable acceptance probabilities in both cases, constant expected cost in for , and indeed decreasing cost for . We make this argument rigorous in Theorem 3.1 below.

Beyond what is already needed for MH, and are all the user must provide for our method. In practice neither of these seems problematic in typical settings. We have found deriving to be a fairly mechanical procedure, and give examples for two models in Section 4. Likewise, while computing does entail some cost, we have found that standard gradient descent finds an adequate result in time negligible compared with the full MCMC run.

3.2 Choice of proposal

We now consider the choice of proposal and its implications for the acceptance probability. As mentioned, it is necessary to ensure that, roughly speaking, to match the concentration of the target. In this section we describe heuristically how to ensure this. Theorem 3.1 below and Section F.1.2 in the supplement give precise statements of what is required.

Two main classes of are of interest to us. When is symmetric, (10) simplifies to

 αSMHk(θ,θ′)=(1∧ˆπk(θ′)ˆπk(θ))m∏i=11∧˜πi(θ′)ˆπk,i(θ)˜πi(θ)ˆπk,i(θ′). (14)

We can realise this with the correct scaling with for example

 q(θ,θ′)=Normal(θ′∣θ,σ2nId), (15)

where is fixed in . Alternatively, we can more closely match the covariance of our proposal to the covariance of our target with

 q(θ,θ′)=Normal(θ′∣θ,σ2[∇2U(ˆθ)]−1). (16)

Under usual circumstances is approximately (since in general this will include a non-flat prior term) proportional to the inverse observed information matrix, and hence the correct scaling is achieved automatically. See Section F.1.2 in the supplement for more details.

We can improve somewhat on a symmetric proposal if we choose to be -reversible in the sense that

 ˆπk(θ)q(θ,θ′)=ˆπk(θ′)q(θ′,θ)

for all ; see, e.g., (Tierney, 1994; Neal, 1999; Kamatani, 2018). In this case we obtain

 αSMHk(θ,θ′)=m∏i=11∧˜πi(θ′)ˆπk,i(θ)˜πi(θ)ˆπk,i(θ′).

Note that using a -reversible proposal allows us to drop the first term in (14), and hence obtain a higher acceptance probability for the same . Moreover, when , we see from (9) that a -reversible proposal corresponds to an MCMC kernel that targets a Gaussian approximation to , and may therefore be more suited to the geometry of than a symmetric one.

We now consider how to produce -reversible proposals. For of the form

 q(θ,θ′)=Normal(θ′∣Aθ+b,C)

where with and , Theorem E.1 in the supplement gives necessary and sufficient conditions for and -reversibility. Specific useful choices that satisfy these conditions and ensure the correct scaling are then as follows. For we can use for example

 A=Idb=−σ2n∇U(ˆθ)C=σnId (17)

for some , where

is the identity matrix. For

, assuming (which will hold if is sufficiently close to the mode), we can use a variation of the preconditioned-Crank Nicholson proposal (pCN) (Neal, 1999) defined by taking

 A=√ρIdC=(1−ρ)[∇2U(ˆθ)]−1 b=(1−√ρ)(ˆθ−[∇2U(ˆθ)]−1∇U(ˆθ))

where . When this corresponds to an independent Gaussian proposal: . Note that this can be re-interpreted as the exact discretization of an Hamiltonian dynamics for the Gaussian .

3.3 Performance

We now show rigorously that SMH addresses the issues of a naive approach and entails an overall performance benefit. In our setup we assume some unknown data-generating distribution , with data . We denote the (random) targets by , for which we assume a factorisation (8) involving terms. We denote the mode of by

, and our estimate of the mode by

. Observe that is a deterministic function of the data, and we assume this holds for also. In general may depend on additional randomness, say

, if for instance it is the output of a stochastic gradient descent algorithm. In that case, our statements should involve conditioning on

but are otherwise unchanged.

For the case of data points we denote the proposal by , and model the behaviour of our chain at stationarity by considering and sampled conditionally independently of all other randomness given . The following theorem allows us to show that both the computational cost and the acceptance probability remain stable in this case as grows large. See Section F in the supplement for a proof.

Theorem 3.1.

Suppose each is -times continuously differentiable, each , and . Likewise, assume each of , , and is in , and each of , , and is as . Then defined by (13) satisfies

 E[¯¯¯λ(θ(n),θ′(n))|Y1:n]=OP0(n(1−k)/2).

For given and , recall that the method described in Section 2.2 requires the computation of at most factors. Under the conditions of Theorem 3.1, we therefore have

 E[N(n)|Y1:n] = E[E[N(n)|θ(n),θ′(n),Y1:n]|Y1:n] = E[¯¯¯λ(θ(n),θ′(n))|Y1:n] = OP0(n(1−k)/2).

In other words, with arbitrarily high probability with respect to the data-generating distribution, the average computational cost per iteration is when using a first-order approximation, and for a second-order one.

This result also ensures that the acceptance probability for SMH does not vanish as . Denoting by our approximation in the case of data points, observe that

 0≤E[−logαFMH(θ(n),θ′(n))|Y1:n]≤E[−log(1∧ˆπ(n)k(θ′(n))q(n)(θ′(n),θ(n))ˆπ(n)k(θ(n))q(n)(θ(n),θ′(n)))|Y1:n]+E[¯¯¯λ(θ(n),θ′(n))|Y1:n]

Here the second right-hand side term is by Theorem 3.1. For a -reversible proposal the first term is simply , while for a symmetric proposal Theorem F.2 in the supplement shows it is . In either case, we see that the acceptance probability is stable in the limit of large . In the case of a -reversible proposal, we in fact have .

Note that both these implications also apply if we use a truncated version of SMH as per Section 2.3. This holds since in general TFMH ensures both that the expected number of factor evaluations is not greater than for FMH, and that the acceptance probability is not less than for FMH.

The conditions of Theorem 3.1 hold in realistic scenarios. The integrability assumptions are mild and mainly technical. We will see in Section 4 that in practice is usually a function of , in which case

 E[m(n)∑i=1¯¯¯¯Uk,i|Y1:n]=n∑i=1¯¯¯¯Uk+1(Yi)=OP0(n)

by the law of large numbers. In general, we might also have one

for the prior also, but the addition of this term still gives the same asymptotic behaviour.

The condition essentially states that the posterior must concentrate at rate around the mode. This is a consequence of standard, widely-applicable assumptions that are used to prove BvM, which we describe in depth in Section F.1.1 of the supplement. Note in particular that we do not require our model to be well-specified (i.e. we do not need for some ). The remaining two conditions correspond to the heuristic conditions given in Section 3.1. In particular, the proposal should scale like . We show Section F.1.2 of the supplement that this condition holds for the proposals described in Section 3.2. Likewise, should be distance from the mode. When the posterior is log-concave it can be shown this holds for instance for an SGD algorithm after performing single pass through the data (Baker et al., 2018, Section 3.4). In practice, we interpret this condition to mean that should be as close as possible to , but that some small margin for error is acceptable.

4 Experimental results

In this section we apply SMH to Bayesian logistic regression. An application to robust linear regression is also provided in Section

G.2 of the supplement. We chose these models due to the availability of lower bounds on the likelihoods required by Firefly. In both cases we write our covariates as and responses as , and our target is the posterior

 π(θ)=p(θ|x1:n,y1:n)∝p(θ)n∏i=1p(yi|θ,xi).

For logistic regression we have , , and

 p(yi|θ,xi) = Bernoulli(yi|11+exp(−θ⊤xi)).

For simplicity we assume a flat prior , which allows factorising like (8) with and . It is then easy to show that

 Ui(θ)=−log˜πi(θ)=log(1+exp(θTxi))−yiθ⊤xi.

We require upper bounds of the form (12) for these terms. A straightforward calculation shows that

 ¯¯¯¯U2,i=14max1≤j≤d|xij|2¯¯¯¯U3,i=16√3max1≤j≤d|xij|3

will do. See Section G.1 in the supplement for the details.

In our experiments we took . For both SMH-1 and SMH-2 we used truncation as described in Section 2.3, with . Our estimate of the mode was computed using stochastic gradient descent. We compare our algorithms to standard MH, Firefly (Maclaurin & Adams, 2014) and Zig-Zag (Bierkens et al., 2018), which all have the exact posterior as the invariant distribution. We used the MAP-tuned variant of Firefly (which also makes use of ) with implicit sampling (this uses an algorithmic parameter , the optimal choice of is an open question) and the likelihood lower bounds specified in (Maclaurin & Adams, 2014, Section 3.1). For MH, SMH-1, SMH-2, and Firefly, we use the proposal (16).

Figure 1 (in Section 1) illustrates the average number of likelihood evaluations per iteration and confirms the predictions of Theorem 3.1. Figure 2 displays the integrated autocorrelation times (IACT) for the posterior mean estimate rescaled by execution time, and demonstrates that eventually SMH-2 significantly outperforms competing techniques. Figure 3 shows the performance of the pCN proposal when varying . As the target concentrates, the Gaussian approximation of the target improves and an independent proposal () becomes optimal. Finally, we also illustrate the average acceptance rate when varying in Figure 4.

References

• Andrieu et al. (2013) Andrieu, C., Lee, A., and Vihola, M. Uniform ergodicity of the iterated conditional smc and geometric ergodicity of particle gibbs samplers. arXiv preprint arXiv:1312.6432, 2013.
• Baker et al. (2018) Baker, J., Fearnhead, P., Fox, E. B., and Nemeth, C. Control variates for stochastic gradient MCMC. Statistics and Computing, 2018. to appear.
• Banterle et al. (2015) Banterle, M., Grazian, C., Lee, A., and Robert, C. P. Accelerating Metropolis-Hastings algorithms by delayed acceptance. arXiv preprint arXiv:1503.00996, 2015.
• Bardenet et al. (2014) Bardenet, R., Doucet, A., and Holmes, C. Towards scaling up Markov chain Monte Carlo: An adaptive subsampling approach. In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, ICML’14, pp. I–405–I–413. JMLR.org, 2014.
• Bardenet et al. (2017) Bardenet, R., Doucet, A., and Holmes, C. On Markov chain Monte Carlo methods for tall data. Journal of Machine Learning Research, 18(47):1–43, 2017.
• Bierkens et al. (2018) Bierkens, J., Fearnhead, P., and Roberts, G. The Zig-Zag process and super-efficient sampling for Bayesian analysis of big data. The Annals of Statistics, 2018. to appear.
• Bouchard-Côté et al. (2018) Bouchard-Côté, A., Vollmer, S. J., and Doucet, A. The Bouncy Particle sampler: A nonreversible rejection-free Markov chain Monte Carlo method. Journal of the American Statistical Association, 113:855–867, 2018.
• Ceperley (1995) Ceperley, D. M. Path integrals in the theory of condensed helium. Reviews of Modern Physics, 67(2):279, 1995.
• Chatterji et al. (2018) Chatterji, N. S., Flammarion, N., Ma, Y.-A., Bartlett, P. L., and Jordan, M. I. On the theory of variance reduction for stochastic gradient Monte Carlo. arXiv preprint arXiv:1802.05431, 2018.
• Christen & Fox (2005) Christen, J. A. and Fox, C. Markov chain Monte Carlo using an approximation. Journal of Computational and Graphical Statistics, 14(4):795–810, 2005. ISSN 10618600.
• Devroye (1986) Devroye, L. Non-Uniform Random Variate Generation. Springer, 1986.
• Dubey et al. (2016) Dubey, K. A., Reddi, S. J., Williamson, S. A., Poczos, B., Smola, A. J., and Xing, E. P. Variance reduction in stochastic gradient Langevin dynamics. In Advances in Neural Information Processing Systems, pp. 1154–1162, 2016.
• Fukui & Todo (2009) Fukui, K. and Todo, S. Order-n cluster Monte Carlo method for spin systems with long-range interactions. Journal of Computational Physics, 228(7):2629–2642, 2009.
• Geyer (1998) Geyer, C. J. Markov chain Monte Carlo lecture notes. 1998.
• Hastings (1970) Hastings, W. K. Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57(1):97–109, 1970.
• Jones & Hobert (2001) Jones, G. L. and Hobert, J. P.

Honest Exploration of Intractable Probability Distributions via Markov Chain Monte Carlo.

Statistical Science, 16(4):312–334, 2001. ISSN 0883-4237.
• Jones et al. (2014) Jones, G. L., Roberts, G. O., and Rosenthal, J. S. Convergence of conditional Metropolis–Hastings samplers. Advances in Applied Probability, 46(2):422–445, 2014.
• Kamatani (2018) Kamatani, K. Efficient strategy for the Markov chain Monte Carlo in high-dimension with heavy-tailed target probability distribution. Bernoulli, 24(4B):3711–3750, 2018.
• Kleijn & van der Vaart (2012) Kleijn, B. and van der Vaart, A.

The Bernstein-Von-Mises theorem under misspecification.

Electron. J. Statist., 6:354–381, 2012. doi: 10.1214/12-EJS675.
• Korattikara et al. (2014) Korattikara, A., Chen, Y., and Welling, M. Austerity in MCMC land: Cutting the Metropolis–Hastings budget. In International Conference on Machine Learning, pp. 181–189, 2014.
• Kronmal & Peterson (1979) Kronmal, R. A. and Peterson, A. V. J. On the alias method for generating random variables from a discrete distribution. The American Statistician, 33(4):214–218, 1979.
• Maclaurin & Adams (2014) Maclaurin, D. and Adams, R. P. Firefly Monte Carlo: Exact MCMC with subsets of data. In

Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence

, UAI’14, pp. 543–552, Arlington, Virginia, United States, 2014. AUAI Press.
ISBN 978-0-9749039-1-0.
• Meyn & Tweedie (2009) Meyn, S. and Tweedie, R. L. Markov Chains and Stochastic Stability. Cambridge University Press, New York, NY, USA, 2nd edition, 2009. ISBN 0521731828, 9780521731829.