# Global consensus Monte Carlo

For Bayesian inference with large data sets, it is often convenient or necessary to distribute the data across multiple machines. We consider a likelihood function expressed as a product of terms, each associated with a subset of the data. Inspired by global variable consensus optimisation, we introduce an instrumental hierarchical model associating auxiliary statistical parameters with each term; these are conditionally independent given the top-level parameters, one of which controls their unconditional strength of association. This model leads to a distributed MCMC algorithm on an extended state space yielding approximations of posterior expectations. A trade-off between computational tractability and fidelity to the original model can be controlled by changing the association strength in the instrumental model. We propose the use of a SMC sampler with a sequence of association strengths, allowing both the automatic determination of appropriate strengths and for a bias correction technique to be applied. In contrast to similar distributed Monte Carlo algorithms, this approach requires few distributional assumptions. The performance of the algorithms is illustrated with a number of simulated examples.

## Authors

• 1 publication
• 13 publications
• 5 publications
• 10 publications
• ### Firefly Monte Carlo: Exact MCMC with Subsets of Data

Markov chain Monte Carlo (MCMC) is a popular and successful general-purp...
03/22/2014 ∙ by Dougal Maclaurin, et al. ∙ 0

• ### Scalable Monte Carlo inference for state-space models

We present an original simulation-based method to estimate likelihood ra...
09/07/2018 ∙ by Sinan Yıldırım, et al. ∙ 0

• ### Merging MCMC Subposteriors through Gaussian-Process Approximations

Markov chain Monte Carlo (MCMC) algorithms have become powerful tools fo...
05/27/2016 ∙ by Christopher Nemeth, et al. ∙ 0

• ### Distributed Bayesian Computation for Model Choice

We propose a general method for distributed Bayesian model choice, where...
10/10/2019 ∙ by Alexander Buchholz, et al. ∙ 16

• ### Further Inference on Categorical Data – A Bayesian Approach

Three different inferential problems related to a two dimensional catego...
02/15/2020 ∙ by Samyajoy Pal, et al. ∙ 0

• ### A quasi-Monte Carlo data compression algorithm for machine learning

We introduce an algorithm to reduce large data sets using so-called digi...
04/06/2020 ∙ by Josef Dick, et al. ∙ 0

• ### Bradley-Terry Modeling with Multiple Game Outcomes with Applications to College Hockey

The Bradley-Terry model has previously been used in both Bayesian and fr...
12/02/2021 ∙ by John T. Whelan, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Large data sets arising in modern statistical applications present serious challenges for standard computational techniques for Bayesian inference, such as Markov chain Monte Carlo (MCMC) and other approaches requiring repeated evaluations of the likelihood function. We consider here the situation where the data are distributed across multiple computing nodes. This could be because the likelihood function cannot be computed on a single computing node in a reasonable amount of time, e.g. the data might not fit into main memory.

We assume that the likelihood function can be expressed as a product of terms so that the posterior density for the statistical parameter satisfies

 π(z)∝μ(z)b∏j=1fj(z), (1)

where takes values , and is a prior density. We assume that is computable on computing node and involves consideration of , the th subset or ‘block’ of the full data set, which comprises such blocks.

Many authors have considered embarrassingly parallel MCMC algorithms approximating expectations with respect to (1), following the consensus Monte Carlo approach of (Scott et al., 2016). Such algorithms require communication between the nodes only at the very beginning and end of the procedure, falling into the MapReduce framework (Dean and Ghemawat, 2008); their use is therefore more advantageous when inter-node communication is relatively costly, for example due to high latency. We later review some of these approaches, and some issues surrounding their use, in Section 2.4.

The approach we propose is motivated by global variable consensus optimisation (see, e.g., Boyd et al., 2011, Section 7). Instead of aiming to avoid entirely communication between nodes, our proposed algorithm is intended to be robust to significant differences between the likelihood contributions , and therefore between the blocks of data. We introduce an instrumental hierarchical model, associating an auxiliary parameter with each likelihood contribution (and therefore with each computing node). These are conditionally independent given

and an additional top-level parameter, which controls their unconditional strength of association. This allows the construction of an MCMC algorithm on an extended state space, yielding estimates of expectations with respect to

. By tuning the association strength through the top-level parameter, a trade-off between computational tractability and fidelity to the original model can be controlled.

The same framework has recently and independently been proposed in a serial context by Vono et al. (2018), who construct a Gibbs sampler via a ‘variable splitting’ approach. Rather than distributing the computation, the authors focus on the setting where in order to obtain a relaxation of the original simulation problem (the case in which is described in an appendix). Our contemporaneous work focuses on distributed settings, providing a sequential Monte Carlo implementation of the framework that may be used to generate bias-corrected estimates. The aforementioned authors’ application of the idea to linear Gaussian inverse problems provides a useful further demonstration of the framework; their proposed use of data augmentation could equally be applied in distributed contexts.

We introduce our proposed framework and the resulting algorithmic structure in Section 2, including some discussion of issues in its implementation, and comparisons with related approaches in the literature. A case study for Gaussian densities is provided in Section 3

, giving some insight into the asymptotic behaviour of the algorithm; for this simple case we analyse the resulting bias–variance trade-off, and show how to minimise the asymptotic mean squared error. We then introduce our proposed sequential Monte Carlo implementation of the framework in Section

4. Various simulation examples are presented in Section 5, before conclusions are provided in Section 6.

## 2 The instrumental model and MCMC

For simplicity, we shall occasionally abuse notation by using the same symbol for a probability measure on

, and for its density with respect to some dominating measure. For the numerical examples presented herein, and all densities are defined with respect to a suitable version of the Lebesgue measure. We use the notation for arbitrary

. For a probability density function

and function we denote by the expectation of when , i.e.

 ν(φ)\coloneqq∫φ(x)ν(x)dx.

The goal of the present paper is to approximate for suitable .

### 2.1 The instrumental model

We take an approach that has also been developed in contemporaneous work by Vono et al. (2018), although their objectives were somewhat different. Alongside the variable of interest , we introduce a collection of instrumental variables each also defined on , denoted by . On the extended state space , we define the probability density function by

 ~πλ(z,x1:b)∝μ(z)b∏j=1K(λ)j(z,xj)fj(xj), (2)

where is a family of Markov transition densities for each . Defining

 f(λ)j(z)\coloneqq∫EK(λ)j(z,x)fj(x)dx,

the density of the -marginal of may be written as

 πλ(z)\coloneqq∫Eb~πλ(z,x1:b)dx1:b∝μ(z)b∏j=1f(λ)j(z). (3)

The role of is to control the fidelity of to , and so we assume the following in the sequel.

###### Assumption 1

For each , is bounded for all and pointwise as .

For example, Assumption 1 implies that converges in total variation to by Scheffé’s lemma (Scheffé, 1947), and therefore for bounded . A sufficient condition for this is that for each and for -almost all , the probability measure associated with converges weakly to the Dirac measure concentrated at . On a first reading one may wish to assume that the are chosen to be independent of ; for example, with one could take . In Section 2.3.3 we consider settings in which choosing these to differ with may be beneficial.

The instrumental hierarchical model is presented diagrammatically in Figure 1. The variables may be seen as ‘proxies’ for associated with each of the data subsets, which are conditionally independent given and the newly-introduced parameter . Loosely speaking, represents the extent to which we allow the local variables to differ from the global variable .

We observe that simultaneously provides a pseudo-prior for and a pseudo-likelihood for . The prior density provides a form of regularisation at the ‘global’ level. This approach avoids the need to define, e.g., fractionated prior densities to assign prior information to each block of data, as is typical in embarrassingly parallel approaches. In terms of computation, it is this separation of from the subsets of the data , given introduced by the instrumental model, that can be exploited by distributed algorithms.

There are connections between this setup and various concepts in the distributed optimisation literature; this motivation is also explored in the contemporaneous work of Vono et al. (2018). The global consensus problem is the problem of minimising a sum of functions on a common domain, under the constraint that their arguments are all equal to some global common value. If for each one uses the Gaussian kernel density , then taking the negative logarithm of (2) gives

 −log~πλ(z,x1:b)=C−logμ(z)−b∑j=1logfj(xj)+12λb∑j=1(z−xj)2. (4)

Maximising is equivalent to minimising this function under the constraint that for , which may be achieved using the alternating direction method of multipliers (Bertsekas and Tsitsiklis, 1989). Specifically, (4) corresponds to the use of as the penalty parameter in this procedure.

There are some similarities between this framework and Approximate Bayesian Computation (see Marin et al., 2012, for a review of such methods) – in both cases one introduces a kernel that can be viewed as acting to smooth the likelihood. In the case of (2) the role of is to control the scale of smoothing that occurs in the parameter space; the tolerance parameter used in ABC, in contrast, controls the extent of a comparable form of smoothing in the observation (or summary statistic) space.

### 2.2 Distributed Metropolis-within-Gibbs

The instrumental model described forms the basis of our proposed global consensus framework; ‘global consensus Monte Carlo’ is correspondingly the application of any Monte Carlo method to form an approximation of . We focus here on the construction of a Metropolis-within-Gibbs Markov kernel that leaves invariant. If is chosen to be sufficiently small, then the -marginal provides an approximation of and so given a chain with values denoted for , an approximation of for some function is given by

 πTλ(φ)\coloneqq1TT∑t=1φ(Zt). (5)

The Metropolis-within-Gibbs kernel we consider utilises the full conditional densities

 ~πλ(xj∣z)∝K(λ)j(z,xj)fj(xj) (6)

for , and

 ~πλ(z∣x1:b)∝μ(z)b∏j=1K(λ)j(z,xj), (7)

where (6) follows from the mutual conditional independence of given .

We define to be a -invariant Markov kernel that fixes the first component . Denoting by the Dirac measure at , we have

 M(λ)1((z,x1:b);d(z′,x′1:b))=δz(dz′)b∏j=1P(λ)j,z(xj,dx′j),

where for each , is a Markov kernel leaving (6) invariant. A simple choice is

 P(λ)j,z(xj,dx′j)=~πλ(x′j∣z)dx′j,j∈{1,…,b}, (8)

which corresponds to sampling from their respective full conditional distributions. We similarly define to be a -invariant Markov kernel that fixes ,

 M(λ)2((z,x1:b);d(z′,x′1:b))=[b∏j=1δxj(dx′j)]P(λ)x1:b(z,dz′),

where is a Markov kernel leaving (7) invariant. Again, a simple choice is

 P(λ)x1:b(z,dz′)=~πλ(z′∣x1:b)dz′. (9)

The interest from a distributed perspective is that can be implemented by having each node sample a new from ; these may then be communicated to a central node that implements . The resulting algorithm is presented as Algorithm 1.

The simple cases here correspond to a two-variable Gibbs sampler, in which the two variables are and . This is particularly amenable to analysis, and we pursue this in Section 3. The same Gibbs sampler construction has recently been proposed independently by Vono et al. (2018); their main objective was to improve algorithmic performance and they do not consider distributed computation.

### 2.3 Implementation considerations

#### 2.3.1 Repeated MCMC kernel iterations

A typical straightforward MCMC approach to drawing samples according to (1) would require computation of the full likelihood function in each iteration. While the computation of the partial likelihood functions on each node can be carried out in parallel, this requires communication from the master node to and from the worker nodes for each computation of the full likelihood, and therefore for each accept/reject step.

In contrast, while each iteration of Algorithm 1 requires similar communication between nodes, computations of the conditional densities (6)–(7) may occur on a single node. As such, our approach will be most advantageous when implementation of the Gibbs kernel (8) is not possible, with the Markov kernels instead comprising multiple iterations of an MCMC kernel leaving (6) invariant. In this case, multiple computations of each (and therefore multiple accept/reject steps) may be conducted on each worker node. Our analysis of the Gibbs sampler setting in Section 3 may also be informative about this more general Metropolis-within-Gibbs setting, in cases where each comprises enough MCMC iterations to exhibit good mixing.

Vono et al. (2018) demonstrate an example based on linear inverse Gaussian problems in which the Gibbs sampler form of Algorithm 1 results in more efficient simulation than direct sampling from . This serial setting constitutes a ‘divide-and-conquer’ approach, in which the resulting Gibbs kernels are more computationally tractable than sampling from the full posterior.

#### 2.3.2 Choosing the regularisation parameter

The regularisation parameter takes the role of a tuning parameter, and we can view its effect on the mean squared error of approximations of using the bias–variance decomposition,

 E[(πTλ(φ)−π(φ))2]=[πλ(φ)−π(φ)]2+var[πTλ(φ)], (10)

which is true when . In many practical cases this decomposition will provide a very accurate approximation for large , as the squared bias of is typically asymptotically negligible in comparison to its variance.

The decomposition (10) separates the contributions to the error from the bias introduced by the instrumental model and the variance associated with the MCMC approximation. If is too large, the squared bias term in (10) can dominate while if is too small, the Markov chain may exhibit poor mixing due to strong conditional dependencies between and , and so the variance term in (10) can dominate. For example, if converges weakly to the Dirac measure at as , it is straightforward to show that for any with , the conditional distribution of given will converge weakly to the Dirac measure at .

It follows that should ideally be chosen in order to balance these two considerations; the effect of is investigated theoretically in the analysis in Section 3 and empirically in the examples of Section 5. An alternative that we explore in Section 4 is to use Markov kernels formed via Algorithm 1 within a Sequential Monte Carlo sampler. In this manner a decreasing sequence of values may be considered, which may result in lower-variance estimates for small values; we also describe a possible bias correction technique.

#### 2.3.3 Choosing the Markov transition densities

Depending on and , appropriate choices of may enable sampling from some of the conditional distributions in (6)–(7). For example, is a pseudo-likelihood for , so if is conjugate to for each , then the conditional distribution of given will be from the same family as . Similarly, one might choose for each

a conjugate prior for the partial likelihood terms

, so that the conditional distribution of each given and is from the same family as .

It may also be appropriate to choose the Markov transition densities to have relative scales comparable to those of the corresponding partial likelihood terms. To motivate this consider a univariate setting in which the partial likelihood terms are Gaussian, so that we may write for each . Suppose one uses Gaussian transition densities , where are positive values controlling the relative strengths of association between and the local variables . As seen in (3), in the approximating density the partial likelihood terms are replaced by smoothed terms, in this case given by

 f(λ)j(z)∝N(μj;z,σ2j+cjλ). (11)

The resulting smoothed posterior density is presented as (12) in Section 3, where this setting is further explored.

In this case, the role of may be seen as ‘diluting’ or downweighting the contribution of each partial likelihood to the posterior distribution . A natural choice is to take , so that the dilution of each is in proportion to the strength of its contribution to . In this case (11) becomes

 f(λ)j(z)∝N(μj;z,(1+cλ)σ2j)

for some constant . The relative strengths of contribution of the are thereby preserved in the posterior density . If one has prior beliefs or approximations of the relative scales of the partial likelihood terms, then scaling the regularisation parameters in this way may therefore be beneficial. An additional advantage is that if one uses an improper uniform prior , then will have the same mean as for all , allowing low-bias estimates of the posterior mean to be formed.

A particular case of interest in that in which the blocks of data differ in size. If each observation has a likelihood contribution of the form , then the th partial likelihood may be expressed as , where is the number of data in the th block and is their mean. Taking , the smoothed partial likelihood (11) becomes

 f(λ)j(z)∝N(¯yj;z,(σ2+cλ)/nj)

for some , so that the information from each observation is diluted in a consistent way. Motivated by Bayesian asymptotic arguments, we suggest that this scaling of the regularisation parameter in inverse proportion to the relative block sizes may be beneficial in more general settings.

The effect of such choices on the MCMC algorithm is most readily seen by considering the improper uniform prior (a Gaussian prior is considered in Section 3). Taking , the conditional density of given is

 ~πλ(z∣x1:b)=N⎛⎝z;∑bj=1xj/cj∑bj=11/cj,λ∑bj=11/cj⎞⎠.

Therefore, when updating given the local variables’ current values , the choice of dictates the relative influence of each such value. For example, we might expect the local variables corresponding to larger blocks to be more informative about the distribution of , which further justifies choosing to be inversely proportional to the block sizes.

In a multidimensional setting, one could control the covariance structure of each given by using transition densities of the form , where are positive semi-definite matrices. By a similar Gaussian analysis, one could preserve the relative strengths of contribution of the partial likelihood terms by choosing for each an approximation of the covariance matrix of .

### 2.4 Comparisons with related approaches

Our algorithm has similar objectives to embarrassingly parallel techniques proposed by many authors; these require communication between nodes only at the beginning and end of the algorithm, thereby reducing the costs of communication latency to a near-minimum. Consensus Monte Carlo (Scott et al., 2016) involves simulating different MCMC chains on each node, each invariant with respect to a ‘subposterior’ distribution with density proportional to , and combining the chains to produce approximations of posterior expectations using a final post-processing step. The post-processing step amounts to forming a ‘consensus chain’ by weighted averaging of the separate chains. In the case that each subposterior density is Gaussian, this approach can be used to produce samples asymptotically distributed according to , by weighting each chain using the precision matrices of the subposterior distributions, or sample approximations thereof.

Motivated by Bayesian asymptotics, the authors suggest using this approach more generally; indeed, in cases where the subposterior distributions exhibit near-Gaussianity this performs well, with the final ‘consensus chain’ providing a good approximation of posterior expectations. However, the application of this approach in non-Gaussian settings comes with no theoretical guarantees. Other weighting schemes are possible; the authors use unweighted averages in one example, though note that this choice may produce poor results when the subposterior distributions differ considerably.

In cases where at least some of the likelihood contributions are highly non-Gaussian, consensus Monte Carlo sometimes performs poorly (as in examples of Wang et al., 2015; Srivastava et al., 2015). Various authors have therefore proposed more generally-applicable techniques for utilising the values from each of these chains in order to approximate posterior expectations. Scott (2017) suggests a strategy based on finite mixture models, following an earlier proposal by Neiswanger et al. (2014)

employing kernel density estimates, though notes that both methods may be infeasible or impractical in high-dimensional settings. The same author proposes a model-agnostic method employing sequential importance sampling, which is also observed to perform poorly in high dimensions.

While the averaging technique used in the consensus Monte Carlo procedure of Scott et al. (2016) is simple, it can exhibit other shortfalls. As we shall later exemplify, one case of interest is that in which the data in different blocks are not exchangeable, perhaps due to each block of data originating from a different source, or describing a different subpopulation. There may then be significant differences between the functions , and so the final combination procedure can result in inaccurate representations of the true target density. While permuting and re-partitioning the data may result in greater homogeneity across the data blocks, this may not always be feasible.

Various other techniques have been proposed for combining the outputs of the individual chains in embarrassingly parallel approaches. Rabinovich et al. (2015) suggest that instead of combining the chains’ values by averaging, one could use variational optimisation to choose a function with which to aggregate the chains. Wang and Dunson (2013) build on the idea of using kernel density estimates by using approximations based on Weierstrass transforms; Wang et al. (2015) suggest a method employing random partition trees. Another direction is taken by Minsker et al. (2014) and Srivastava et al. (2015), who consider forming empirical measures approximating each subposterior distribution and taking a suitably-defined mean or median.

Another possible issue with embarrassingly parallel approaches is that of the prior density . Typically, each subposterior receives an equal share of the prior information encoded by , in the form of the fractionated prior density

. It is not clear, however, when this approach is justified. For example, suppose the prior distribution belongs to an exponential family; any property that is not invariant to multiplying the canonical parameters by a constant will not be preserved when fractionating the density. For several common distributions (including gamma and Wishart), this is true of the first moment. As such if

is proportional to a valid probability density function, then the distribution to which this corresponds may be qualitatively very different to the full prior. Although Scott et al. (2016) note that fractionated priors perform poorly on one example (for which a tailored solution is provided), no other obvious way of assigning prior information to each block naturally presents itself. In contrast, as described in Section 2.1 our approach avoids this problem entirely, with providing prior information for at the ‘global’ level.

An alternative approach to the problem has been proposed by Jordan et al. (2018), who construct a surrogate approximation to the global likelihood based on gradient information from each block. In another proposal, Xu et al. (2014) consider approximating each by an unnormalised density belonging to an exponential family; separate MCMC chains are run in parallel on each node, targeting densities in which those likelihood contributions that cannot be computed on that node are replaced by these approximating densities. Regular moment-sharing between the nodes allows the parameters of these densities to be iteratively updated via expectation propagation, in order that each chain’s target density forms a close approximation of the true target . Again, however, the effectiveness of this method relies on approximations that may not always be appropriate.

Finally, we believe this work is complementary to approaches that try to reduce the amount of computation associated with each likelihood calculation on a single node, e.g. by using only a subsample or batch of the data (Korattikara et al., 2014; Bardenet et al., 2014; Huggins et al., 2016), the use of bounds on the likelihood contribution of each datum (Maclaurin and Adams, 2014), and the application of various stochastic optimisation procedures (Welling and Teh, 2011; Hoffman et al., 2013).

## 3 Theoretical analysis for a simple model

### 3.1 Inferring the mean of a normal distribution

We consider a simple model where the goal is to infer the mean of a normal distribution, as the resulting Markov chain is particularly amenable to theoretical analysis. The results may also be indicative of performance for regular models with abundant data due to the Bernstein–von Mises theorem

(see, e.g., van der Vaart, 2000, Chapter 10).

Let , and for each let and , following Section 2.3.3. We obtain

 πλ(z)=N(z;δ2λ[μ0σ20+b∑j=1μjσ2j+cjλ],δ2λ),δ2λ=(1σ20+b∑j=11σ2j+cjλ)−1, (12)

and can be recovered by taking in (12). The corresponding full conditional densities for (12) are

 ~πλ(xj∣z)=N(xj;σ2jz+cjλμjσ2j+cjλ,cjλσ2jσ2j+cjλ)

and

 ~πλ(z∣x1:b)=N(z;~δ2λ[μ0σ20+b∑j=1xjcjλ],~δ2λ),~δ2λ=(1σ20+b∑j=11cjλ)−1,

and we consider the case where and are Gibbs kernels as in (8)–(9).

For a two-variable Gibbs Markov chain, each of the two ‘marginal’ chains (the sequences of states for each of the two variables) is also a Markov chain. In this setting we may therefore consider the -chain with transition kernel

 (13)

Observing that depends on only through the sum , one can thereby show that the -chain defined by (13) is an AR(1) process. Specifically, we have

 Zk=C+αZk−1+ϵk,k>0,

where

 α\coloneqq~δ2λb∑j=1σ2jcjλ(σ2j+cjλ),C\coloneqq~δ2λ(μ0σ20+b∑j=1μjσ2j+cjλ),

and the

are i.i.d. zero-mean normal random variables, with variance

 ~δ2λ(1+~δ2λb∑j=1σ2jcjλ(σ2j+cjλ)).

It follows that the autocorrelation of lag is given by for , and that as .

### 3.2 Asymptotic bias and variance with n observations

We now consider the setting of Section 3.1, making the number of observations explicit. In particular, for some consider realisations of i.i.d.  random variables, grouped into blocks. For simplicity, assume that divides , that each block contains observations, and that the observations are allocated to the blocks sequentially, so that the th block comprises those where . Then

 fj(z)=∏i∈BjN(yi;z,σ2)∝N⎛⎝bn∑i∈Bjyi;z,bnσ2⎞⎠.

Since the blocks are of equal size in this case, so that each partial likelihood is of the same scale, we consider using for each . From (12), we obtain

 πλ(z)=N(z;δ2λ[μ0σ20+n¯yσ2+nλ/b],δ2λ),δ2λ=(1σ20+nσ2+nλ/b)−1. (14)

Denoting the identity function on by , we consider the estimator of the posterior first moment ; we analyse its mean squared error using the bias–variance decomposition (10). The bias is

 πλ(Id)−π(Id)=n2(λ/b)σ20(μ0−¯y)(σ2+nσ20)(σ2+nσ20+nλ/b). (15)

To assess the variance of , we consider the asymptotic variance associated with the ergodic average (5),

 (16)

for square-integrable w.r.t. . As discussed earlier the -chain is an AR(1) process, and the autocorrelations are entirely determined by the autoregressive parameter

 α=nσ2σ20(σ2+nλ/b)(nσ20+nλ/b),

from which one can find that the asymptotic variance for is

 σ20(σ2+nλ/b)[(nλ/b)2+(σ2+nσ20)(nλ/b)+2nσ2σ20](nλ/b)(σ2+nσ20+nλ/b)2. (17)

Following the definition (16) of this asymptotic variance, dividing this expression by gives an approximation of the variance term in (10) for large .

As a caveat to this and the following analysis, estimation of the mean in Gaussian settings may not accurately reflect what happens in more complex settings. For example, if one uses an improper uniform prior then

is an unbiased estimator of

for any , as seen in (15) with ; this will not be true in general.

One may also note that in this Gaussian setting, the variance of will always exceed the variance of the true target , since the variance expression in (14) is an increasing function of

. The effect is that estimation of the posterior variance in Gaussian settings is likely to result in positive bias, and confidence intervals for

may be conservative. This, of course, simply reflects the fact that marginally the instrumental model can be viewed as replacing the original likelihood with a smoothed version as shown in (3).

### 3.3 Asymptotic optimisation of λ for large T

For fixed , we consider the problem of choosing as a function of chain length so as to minimise the mean squared error of the posterior mean estimator. This involves considering the contributions of the bias and variance to the mean squared error (10) in light of (15) and (17). Intuitively, with larger values of , smaller values of can be used to reduce the bias while keeping the variance small. Defining to be the bias as given in (15), we see that as ,

 B(λ)λ→n2σ20(μ0−¯y)b(σ2+nσ20)2\eqqcolonB⋆.

Similarly, denoting by the asymptotic variance (17), we see that

 λV(λ)→2bσ4σ40(σ2+nσ20)2\eqqcolonV⋆.

For small , the mean squared error of the estimate is given approximately by

 E[(πTλ(Id)−π(Id))2]≈(λB⋆)2+1TV⋆λ,

which may be shown to be minimised when

 λ3=V⋆2B2⋆T=b3σ4(σ2+nσ20)2n4T(μ0−¯y)2.

We see that, for a fixed number of data , we should scale with the number of samples as ; the corresponding minimal mean squared error may be shown to behave as .

### 3.4 Posterior consistency and coverage of credible intervals as n→∞

We now consider the behaviour of the algorithm as the number of data tends to infinity. For this example, in the smoothed likelihood (14) all dependence on and is through their ratio . The result is that splitting the data into more blocks has the same effect as reducing , which may not be representative of these variables’ behaviour for other models.

Recalling that we assume the true parameter value to be , we may consider the consistency of the posterior distribution (14) by treating the data as random. We denote their mean by , which is normally distributed with mean and variance . We shall also consider allowing and to vary with ; making this explicit in the notation, (14) becomes

 πλn(z)=N(z;μ(n),δ2(n))=N(z;δ2(n)ξ(n),δ2(n)),

where , and

 ξ(n)=μ0σ20+n¯Ynσ2+nλn/bn,δ2(n)=(1σ20+nσ2+nλn/bn)−1.

We consider and using the fact that , we obtain the following convergence results for different values of :

Hence, it can be seen that the posterior is consistent (see, e.g., Ghosh and Ramamoorthi, 2003, Chapter 1) if . Moreover, if then credible intervals will have asymptotically a coverage probability of exactly due to the convergence .

If then the rate of approximate posterior contraction is too conservative, while if the corresponding credible intervals will be too wide by a constant factor depending on . From a practical perspective, one can consider the case where corresponds to the maximum number of data points that can be processed on an individual computing node. In such a setting, letting is reasonable and we require in addition that is decreasing to obtain credible intervals with asymptotically exact coverage.

## 4 Sequential Monte Carlo approach

We consider here particle approximations of a sequence of distributions with densities using Sequential Monte Carlo (SMC) methodology, where is a decreasing sequence. The procedure is specified in Algorithm 2. SMC methodology employs sequential importance sampling and resampling; recent surveys include Doucet and Johansen (2011) and Doucet and Lee (2018). The specific method employed here is an SMC sampler within the framework of Del Moral et al. (2006), and was proposed by Gilks and Berzuini (2001) and Chopin (2002), building upon ideas in Crooks (1998), Neal (2001).

The algorithm presented involves simulating particles using -invariant Markov kernels, and has a genealogical structure imposed by the ancestor indices for and . The specific scheme for simulating the ancestor indices here is known as multinomial resampling; other schemes can certainly be used (see Douc et al., 2005; Gerber et al., 2017, for a summary of some schemes and their properties). We use this simple scheme here as it validates the use of the variance estimators used in Section 4.1. This optional resampling step is used to prevent the degeneracy of the particle set; a common approach is to carry out this step whenever the particles’ effective sample size (Liu and Chen, 1995) falls below a pre-determined threshold.

Under weak conditions as . One can also define the particle approximations of via

 πNλp\coloneqq1NN∑i=1WipδZip, (20)

where is the first component of the particle . With regard to initialisation, if it is not possible to sample from , one could instead use samples obtained by importance sampling or, at the expense of the introduction of an additional approximation, running a thinned -invariant Markov chain. One could also initialise the SMC algorithm from a tractable distribution and use tempering or similar techniques to reach .

Although the algorithm is specified for simplicity in terms of a fixed sequence , a primary motivation for the SMC approach is that the sequence used can be determined while running the algorithm, as explored in Section 4.2. Other reasons in favour of this approach are that many of the particle approximations (19) can be used to form a final estimate of as explored in Section 4.1, and that SMC methods can be more robust to multimodality of than simple Markov chain schemes. We finally note that in such an SMC sampler, a careful implementation of the MCMC kernels used may allow the inter-node communication described in Section 2.3.1 to be interleaved with likelihood computations associated with the particles, thereby reducing the costs associated with communication latency.

### 4.1 Bias correction using local linear regression

We present an approach to use many of the particle approximations produced by Algorithm 2. A natural idea is to regress the values of on , extrapolating to to obtain an estimate of . A similar idea has been used for bias correction in the context of Approximate Bayesian Computation, albeit not in an SMC setting, regressing on the discrepancy between the observed data and simulated pseudo-observations (Beaumont et al., 2002; Blum and François, 2010).

Under very mild assumptions on the transition densities , is continuous as a function of , and so a simple approach is to model its dependence on as being approximately linear for sufficiently close to . The Gaussian setting described in Section 3.2 illustrates this approach; in that case, define by the first moment of , which has density (14). A Taylor expansion about gives

 (21)

in which the linear term in the sum dominates for sufficiently small . A similar argument may be applied to the second and higher moments of .

Having determined a subset of the values of used for which a linear approximation is appropriate, e.g. using the automatic approach in Section 4.2, one can use linear least squares to carry out the regression. To account for the SMC estimates having different variances, we propose the use of weighted least squares, with the ‘observations’ assigned weights inversely proportional to their variances.

To make this explicit, first consider the case in which , so that all estimates are one-dimensional. For each value denote the corresponding SMC estimate by , and let denote the variance of this estimate. Then for some , chosen such that the relationship between and is approximately linear, a bias-corrected estimate for may be computed as

 ~η(m:n)−~λ(m:n)∑np=m(λp−~λ(m:n))(ηp−~η(m:n))/vp∑np=m(λp−~λ(m:n))2/vp, (22)

where and denote weighted means given by

 ~λ(m:n)\coloneqq∑np=mλp/vp∑np=m1/vp,~η(m:n)\coloneqq∑np=mηp/vp∑np=m1/vp.

This corresponds to the estimated intercept term in weighted least squares, and therefore the extrapolated value of the estimate at . In the more general case where is multidimensional we simply propose evaluating (22) for each component of this quantity separately, which corresponds to fitting an independent weighted least squares regression to each component, though in principle one could use multivariate weighted least squares.

We propose the weighted form of least squares here since, as the values of used in the SMC procedure approach zero, the estimators generated may increase in variance – partly due to poorer mixing of the MCMC kernels as previously described, but also due to the gradual degeneracy of the particle set. In order to estimate the variances of estimates generated using SMC, several recent approaches have been proposed that allow this estimation to be carried out online by considering the genealogy of the particles. The first such procedure was proposed by Chan and Lai (2013), which was generalised by Lee and Whiteley (2018). More recent work by Olsson and Douc (2018) has focused on improving the numerical stability of such estimators when they are used in online settings over very long time horizons. Using any such procedure, one may estimate the variance of for each