Unbiased Bayes for Big Data: Paths of Partial Posteriors

01/14/2015 ∙ by Heiko Strathmann, et al. ∙ University of Warwick 0

A key quantity of interest in Bayesian inference are expectations of functions with respect to a posterior distribution. Markov Chain Monte Carlo is a fundamental tool to consistently compute these expectations via averaging samples drawn from an approximate posterior. However, its feasibility is being challenged in the era of so called Big Data as all data needs to be processed in every iteration. Realising that such simulation is an unnecessarily hard problem if the goal is estimation, we construct a computationally scalable methodology that allows unbiased estimation of the required expectations -- without explicit simulation from the full posterior. The scheme's variance is finite by construction and straightforward to control, leading to algorithms that are provably unbiased and naturally arrive at a desired error tolerance. This is achieved at an average computational complexity that is sub-linear in the size of the dataset and its free parameters are easy to tune. We demonstrate the utility and generality of the methodology on a range of common statistical models applied to large-scale benchmark and real-world datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Markov Chain Monte Carlo (MCMC), used for sampling from posterior distributions, is one of the most fundamental tools in Bayesian data analysis. However, the recent explosion in the amount of data to be analysed poses serious challenges for this methodology as it is often infeasible to scale it to today’s statistical problems. This development led to a recent focus on methods to make MCMC practical for large datasets. Most research thus far has focused on devising alternative Markov transition kernels based on mini-batches of the data. These approaches either lead to (1) samples being drawn from an asymptotically approximate posterior distribution, thus reducing the amount of computation at the expense of introducing bias [Welling2011, Korattikara2014, ChenFoxGuestrin2014, Bardenet2014], or (2) preserving the asymptotically correct invariant distribution at the expense of technical requirements and mixing properties that might limit applicability in practice [Maclaurin2014]. The alternative approach is to run MCMC on small shards of the data and then construct a ’Consensus Monte Carlo’ estimator [Scott2013]22todo: 2lot more work on distributed MCMC other than consensus monte carlo (see references in Balajis NIPS paper). i think david dunson’s work has some theory, would be good to check. At present the Consensus Monte Carlo algorithm lacks any theoretical guarantees.

In this contribution, we propose a different view on the problem. We construct a scheme that directly estimates posterior expectations with neither simulation from the full posterior nor construction of alternative approximate transition kernels – without introducing bias. While variance of these estimators is naturally increased, we will show that this increase is bounded by construction and straightforward to control. This in particular holds for the Big Data context. We arrive at the following desiderata for a useful methodology in unbiased Bayesian inference: (i) computational complexity that is sub-linear in the number of observations, (ii) controllable and bounded variance, and (iii) no problems with transition kernel design.

The presented framework is very general in the sense that it is neither restricted to a particular form of the underlying Bayesian model (such as factorising likelihoods), nor does it rely on a particular inference technique used from within. Any free parameters are easy to tune. Furthermore, it is competitive in practice: experimental examples show that we are able to accurately and confidently estimate posterior expectations, faster than with simulation methods. In addition, and without exploiting any domain specific structure, we are able to outperform state-of-the-art results obtained from (approximate) stochastic variational inference methods on large-scale real-world data [Hensman2013].

By no means do we aim to replace existing simulation (or any other) techniques for Bayesian inference. If the goal is to obtain a representation of the full posterior density, simulation remains the method of choice. Our contribution rather provides a different perspective on problems where a given Bayesian expectation lies at the core of the performed statistical analysis.

The setting

we consider is as follows. Given data , a statistical model with parameters , and likelihood , we wish to obtain an unbiased estimator of the expectation

(1)

of a given functional over the full posterior 33todo: 3Clearly state assumptions. While we focus on the real-valued here, multivariate extensions are possible. A typical way to approach this problem is to generate samples from using MCMC, and then compute the empirical expectation . Note that the goal here is not to obtain samples from the full posterior measure – the focus is rather on the estimation of particular expectations. For example, in a regression setting, we might be interested in predictive posterior means and variances. This is the ubiquitous end goal in many situations in which Bayesian inference is employed. Therefore, we challenge the paradigm of solely working towards posterior simulation for such estimation problems, and propose a complementary methodology.

A subtlety when dealing with MCMC based estimates is that for any posterior is only asymptotically correct. Therefore, any finite time MCMC algorithm produces estimates that contain a systematic bias, which is subsequently made arbitrarily small via careful choice of simulation parameters. Parts of our methodology are based on such estimators. For the sake of simplicity, we will here assume access to the asymptotic limit in the same sense as finite time MCMC estimates are treated as ’correct’. We hint at a way to address this issue as an outlook.

Moreover, our approach is not restricted to MCMC, but easily applies to situations where expectations over are available in closed form but require prohibitive amounts of computation. Such cases can for example be found in large-scale spatial statistics [Lyne2013], or Gaussian Process models deployed to Big Data regimes [RasmussenCarlEdwardWilliams2006, Hensman2013].

Our work builds on several breakthroughs in related areas, such as unbiased estimation for stochastic differential equations [RheeGlynn2013] and for Markov chain equilibrium expectations [GlynnRhee2014]. These developments demonstrate the overarching principle that estimation is often an easier problem than simulation – a dictum we adopt and apply here in the context of Bayesian inference.

Paper outline

We begin by summarising previous sub-sampling based simulation approaches in Section 2. In Section 3, we construct unbiased estimators for Bayesian expectations from paths of partial posterior distributions. Section 4 contains a number of experimental illustrations where we demonstrate that our estimator is in particular useful in the Big Data regime. In Section 5, we point out a number of extensions and conduct experiments that showcase the generality of the developed framework. We close with a discussion of shortcomings and point out future work in Section 6.

2 Previous work

A practical difficulty in dealing with the full posterior is that is often large. This renders the computation of a likelihood extremely expensive – if not impossible. This, for example, limits feasibility of MCMC algorithms to simulate from as they require access to in every iteration.

Biased MCMC

The infeasibility of exact likelihood computation has been the main focus of [Welling2011, Korattikara2014, ChenFoxGuestrin2014, Bardenet2014] where this issue is circumvented by approximations to the transition kernel in MCMC. This is done using, e.g. stochastic gradient Langevin [Welling2011] or Hamiltonian [ChenFoxGuestrin2014] approaches, or by using a noisy acceptance ratio and employing a statistical test [Korattikara2014] or concentration bounds [Bardenet2014]. The well-known issue with this vein of work is that the approximate finite step-size diffusions, defined by mini-batches of data, are no longer corrected for induced bias. Consequently convergence to the correct posterior (and indeed any) measure is no longer guaranteed. The practical effect of these approaches makes them difficult to tune and to obtain a well-mixing chain. Furthermore, artefacts of these methodologies, such as using parametric hypothesis testing [Korattikara2014], might even lead to over-confident accept/reject decisions in the Markov transition kernel. The latter was illustrated in Bardenet2014, who also substantially improve on these constructions by providing total variation bounds to assess the quality of the approximation.

Noisy MCMC

Recently, Alquier2014 provided quantification of the approximation quality of many ’noisy’ MCMC Algorithms, including the ones by Welling2011, Korattikara2014, Bardenet2014. These results are an important step towards understanding the extent of the bias induced by employing approximate transition kernels. However, in practice their results require uniform or geometric ergodicity of the original Markov Chain to explicitly quantify the approximation error [Alquier2014, Theorem 2.1] or just guarantee convergence [Alquier2014, Theorem 2.2], respectively. The first condition is too strong for most problems in practice while the second one does not give important details on how and when the approximate chain converges. Our work can be seen as an orthogonal approach, as we avoid simulation from the full posterior and rather directly attack the underlying estimation of expectations of interest, i.e. (1).

Firefly MCMC

In contrast to these approximate, biased sampling methods, Firefly MCMC [Maclaurin2014] introduces an exact construction that neither introduces bias nor requires computation of a full likelihood. It is an elegant way of exploiting additional auxiliary variable structure. In this regime of computationally intractable likelihoods due to data size111Data size as opposed to computationally intractable likelihoods due to an inherently intractable functional defining the likelihood., it is seen as the only approach that can ensure coherence of subsequent inference through the simulation from the asymptotically correct posterior. One complication with Firefly MCMC is that it requires availability of appropriate easily computable and tight lower bounds on the likelihood function, tuning of which requires at least one sweep through all data. Of course, the models for which such bounds can be obtained are often precisely those relatively simple models where the need for full and exact Bayesian inference over variational or other approximations might be questionable44todo: 4Add non-trivial example. While investigating the formal construction of such bounds in more general model classes is a promising way forward, the generality and applicability of Firefly MCMC is clearly limited. Moreover, while Firefly MCMC can significantly reduce the number of likelihood evaluations at each iteration of MCMC, the complexity of the scheme is linear in the number of observations, as resampling of auxiliary variables is required at each iteration, for a given fraction of the available data . There is a limit as to how small the parameter can be chosen: mixing time cannot be smaller than . This means that the reduced number of likelihood evaluations at each iteration of MCMC comes at the cost of requiring to run the chains by a factor of longer.

In contrast to all but one of the previous sub-sampling schemes considered, the estimators that we propose are provably unbiased and also have a sub-linear average complexity in the number of observations. Unlike the only unbiased competitor, Firefly MCMC, our approach does not require a lower bound on the likelihood and even extends to several other situations: where posterior expectations are available in closed form but computationally infeasible, where likelihoods need not factorise across observations, and where likelihoods might themselves be unavailable in an analytic form, as for example in the context of pseudo-marginal MCMC [Andrieu2009].

3 Partial posterior path estimators

In this section, we present a different approach to coherent Bayesian inference in the Big Data regime which exploits the paths of induced partial posterior distributions through the debiasing device developed in RheeGlynn2013, GlynnRhee2014. A similar approach was very recently taken by Agapiou2014, who exploit RheeGlynn2013’s work for unbiased posterior estimation of expectations over intractable infinite-dimensional models which can be parametrised in terms of a series expansion of basis functions. In contrast, our work directly attacks intractability that arises from large datasets. We see our contribution as a pragmatic complement to existing work on debiasing Monte Carlo estimates.

Our approach follows similar ideas as Chopin2002, who presented a sequential Monte Carlo procedure for static target distributions by exploiting a sequence of partial posterior targets. Given the full posterior , we define a subset of size of all data, where is a (possibly random) index set55todo: 5this footnote doesn’t make sense - haven’t introduced yet, with sizes . The partial posterior corresponding to is then 222Note that while is the number of subsets, they are indexed with the subscript for some variable that will be introduced later.. Paths of partial posterior measures can be constructed by starting from the prior and increasing the size of the batches until exhausting the whole set of observations, i.e.

where is the full posterior.

For simplicity of exposition, we here consider the case of a geometric increase in batch sizes. More precisely, we set , where is the number of possible batch sizes, is the smallest batch size considered. We assume that is an integer. Note that common ratios other than 2 are possible, and are used in the experiments.

The next section presents the debiasing device, which is a key component in transforming estimates of expectations over multiple partial posteriors into an unbiased estimator of the expectation over the full posterior .

3.1 Debiasing Lemma & algorithm

The debiasing Lemma provides a way to construct an unbiased estimator for the limit of a converging a sequence – without evaluating all elements. Here, we transform a sequence of asymptotically correct, but biased estimators into an unbiased estimator. In different contexts, the result was originally discussed independently by McLeish2011 and RheeGlynn2012. It was shown in the present form in RheeGlynn2013; see also JacobThiery2013.

Lemma 1 (Telescoping Estimator).

Let and

be real-valued random variables

333Agapiou2014 very recently developed a Hilbert space version of the Lemma., and let be an integer-valued random variable independent of and of with . With the convention , assume that

(2)

Then,

is an unbiased estimator of , i.e. . Moreover,

66todo: 6HS: We should have a proof sketch of the unbiasedness here, or at least reference the Appendix

Finite variance for unbiased Bayesian expectations

The above Lemma 1 is directly applicable in our context. We set , for where the empirical expectation is computed by, e.g. MCMC444We realise that empirical expectations computed with MCMC are technically biased and will comment on this in the next Section. on the partial posterior , and for . In the finite data regime, the conditions of the above Lemmas are trivially satisfied since we set almost surely for . The variance of estimators is thus finite by construction and the truncation variable needs only to be supported on , and all infinite sums can be replaced with sums up to . However, variance might still increase increase with without a bound. Therefore, in order to ensure stability of the estimators in the Big Data regime, we here require an analogous condition to (2) that will guarantee that the variance remains constant, i.e. that as , 77todo: 7Check expectation arguments

(3)

Intuitively, we require that the tail of the stochastic truncation variable matches the rate of convergence of partial posterior expectations. See the next Section and Appendix A for details on a simple setup where this condition holds, along with a way to tune the truncation distribution .

Note also that in the same fashion as in RheeGlynn2013, one can replicate the random truncation procedure times and thus reduce variance. More precisely, is an unbiased estimator of and its variance scales as . Here, are independent copies of and each is computed on a different partial posterior path. This implies that the scheme can be repeated until a desired error tolerance is attained. The latter can be estimated from the empirical variance of the .

Algorithm 1 summarises our approach, and Figure 1 illustrates both construction of the from partial posterior paths, and their distribution.

Input: Discrete distribution over , corresponding to batch indices
For (number of replications), or
while not achieved desired error tolerance

  • Sample truncation variable

  • For

    • Compute (expectation over the partial posterior on batch of size ), e.g. by running MCMC on

  • Compute the debiased estimate

Return: the average of debiased estimates .

Algorithm 1 Debiasing posterior expectations
88todo: 8Peter D: Add a Figure to Figure 1 that explains a single debiasing estimator, i.e. how the weighting corrects for the stopping

In summary, the key properties of the described methodology is that full posterior expectations over can be estimated, with no bias introduced and with a bounded increase in variance. This is achieved by using sub-samples of the available data – at a sub-linear average computational cost as we will see next.

Figure 1: Illustration of Algorithm 1 for the posterior mean of a 2D Gaussian with unknown mean and fixed covariance . Data is with , prior is . We aim to compute the posterior mean . Debiasing computes multiple posterior paths (coloured solid lines), which are randomly truncated (solid line stops), and then plugged into the debiasing estimator in (1) to estimate the posterior mean of and (coloured round dots, dotted lines connect path end-point to estimate). The procedure is averaged

times (gray dots), after which the empirical mean matches the full posterior mean. A kernel density estimate of the gray dots is shown in the background.

3.2 Practical considerations

We now list several properties, implications, and key advantages of our scheme.

99todo: 9Balaji suggest to think about the subsampling the data: Do the batches overlap or not? What does this mean for the estimators? He told us to look at [Kleiner2014].1010todo: 10Balaji: the tradeoff between and wasn’t clear to me. does the theory suggest a good default value? (e.g. iirc, the bag of little bootstraps [Kleiner2014] suggests something like each bag should use ). HS: We should ellaborate on that, I think he got confused there.

Computational costs and variance

Let us denote by the time required to generate a single debiasing estimator . Since computing requires running MCMC chains, on batch sizes , would scale linearly with the overall number of likelihood evaluations, resulting in the average time complexity of . If the batch-size increase is geometric, i.e. , the cost becomes

. By matching this with truncation probabilities

, for , we obtain an average cost of , which is sub-linear in , see also Appendix A. This cost reflects the amount of computation when only a single core is available, and the trivial parallelisation of the scheme allows further savings, as described below.

The variance of the debiasing estimator depends on the rate of convergence of the partial posterior expectations. In order to ensure that te variance stays bounded as increases, assume that there exist a constant and , such that for large enough and :

From here, as shown in Appendix A, (3) is satisfied and therefore variance remains bounded as long as . Thus, fast convergence of partial posterior expectations, e.g., close to , can result in significant speed-ups of the scheme. We give examples of empirical fits for in Appendix A.

Tuning truncation probabilities

We now describe how to tune free parameters of the scheme. Following GlynnWhitt1992, if both the average time complexity and the variance

are finite, a central limit theorem holds in the limit where computational budget

. Namely, for a given computational budget , if we denote by the number of debiasing replications that can be generated in time, and by the resulting average of debiasing replications, then

(4)

Thus, the asymptotic efficiency of the debiasing estimator is characterised precisely by the work-variance product. Figure 2 demonstrates how the distribution of the stochastic truncation variable (here in the parametric form ) can be optimised in order to yield minimal asymptotic variance of the estimators, see also Appendix A.

Figure 2: Complexity-Variance tradeoff as a function in for . Optimising these quantities for a minimum batch size of gives a best . The zoomed-in area shows a zoom-in around the minimum (with a different scaling on both axes) and reveals that the minimum is well defined.

MCMC and finite time bias

Any empirical expectation computed from finite MCMC algorithms is only correct in the asymptotic limit. Consequently, setting for sampled from a finite time Markov chain is problematic, as unbiasedness of the overall approach technically corrupted. However, the same is true for any MCMC estimate. In practice, careful tuning of simulation parameters such as burn-in, thinning, and running multiple chains [gelman1992inference], is successfully used to reduce finite time biases to a neglectable level. We adopt this mindset here for the sake of practicality and simplicity of presentation. A way to address the issue could be to apply debiasing to the Monte Carlo estimate itself. In Lemma 1, set for drawn from an approximate (not converged) Markov chain after iterations, as opposed to be drawn from the asymptotically invariant distribution. This gives a sequence . Via applying the debiasing Lemma, an unbiased estimator can be constructed for any partial posterior ’s expectation. In a second stage, these debiased partial posterior expectation estimates can be used to estimate the full posterior expectation – now fully unbiased. Unfortunately, as is an infinite sequence, the variance expression in Lemma 1 is not trivially guaranteed to be finite anymore. See GlynnRhee2014, McLeish2011, Agapiou2014 for an explicit and in-depth treatment of unbiased Monte Carlo estimation.

MCMC and mixing time

If a Markov chain is, in line with above considerations, used for computing partial posterior expectations , it need not be induced by any form of approximation, noise injection, or state-space augmentation of the transition kernel. As a result, the notorious difficulties of ensuring acceptable mixing and problems of stickiness are conveniently side-stepped – which is in sharp contrast to all existing approaches. Furthermore, while the latter are limited to particular MCMC schemes [Bardenet2014, random walk], and [Welling2011, Langevin], any MCMC procedure can be employed in our construction. This allows us to harvest decades of mathematical and engineering effort that went into both methodology and software packages (e.g. Stan [stan2014]. Mixing time when using MCMC to estimate partial posterior’s expectations is not compromised by our approach, in contrast to for example Firefly MCMC, whose mixing time gets worse as the mini-batch size decreases. MCMC chains over partial posteriors do not suffer from such problems. Indeed they are in practice often easier to handle due to their similarity to the (usually simply structured) prior distribution. 1111todo: 11Peter D: He did not get that we talk about mixing problems come from the augmented transition kernels, so ellaborate on that

1212todo: 12Balaji: - mixing time: you talk about mixing time about firefly MC … i’d guess mixing would affect posterior expectations of functionals less than the scenario where you want to generate samples from posterior. maybe it might be better to discuss bias variance of firefly MC than mixing time.

Parallelisation

As computation required for each partial posterior in a single path can be performed independently, the method embarrassingly parallelises and expectations computed in parallel only need to be combined in a telescoping sum. The same holds true for replications of the scheme: since the computational cost within each truncated posterior path is dictated by the largest batch size that needs to be processed in parallel, the required wall-time in the case of geometric batch-size increases is roughly halved. Therefore, with sufficient computational resources, the potential speed-up factor through parallelisation is , where in practice is usually in the 100s to 1000s as we will see in the experiments.1313todo: 13Reference Appendix for derivation

4 Experiments

In this section, we illustrate the utility of the debiasing approach and compare it against other unbiased approaches: full posterior sampling and Firefly MCMC. In particular, we show that for large-scale datasets, debiasing can accurately and confidently estimate posterior expectations before full MCMC and Firefly have produced a single estimate.

It is clear that running an MCMC chain on the full posterior, for any statistic, produces more accurate estimates than the debiasing approach, which by construction has an additional intrinsic source of variance. This means that if it is possible to produce even only a single MCMC sample (after burn-in), the resulting posterior expectation can be estimated with less expected error. It is therefore not instructive to compare approaches in that region. 1414todo: 14Why is it clear? Reference?

On comparing to Firefly MCMC

For a fair comparison of our method to Firefly MCMC, we give an estimate for the number of likelihood evaluation necessary for Firefly MCMC to produce the first sample – for which there are two notable obstacles. The first is computing a maximum a posteriori (MAP) estimate to initialise a lower bound on the likelihood: Maclaurin2014 reported Firefly’s performance to be inferior to standard MCMC otherwise. For large datasets, MAP estimates are challenging as a standard gradient based optimisation scheme such as BFGS needs multiple evaluations of the full likelihood. For example, Stan’s BFGS implementation on a commonly used benchmark dataset, a9a [Welling2011, Lin2008]

, takes around 40 iterations to reach a reasonable convergence level. While this issue can be somewhat sidestepped via using stochastic gradient descent. However, given a MAP estimate, a one-off cost of

, i.e. computing sufficient statistics of all data [Maclaurin2014], cannot be avoided. This is challenging for extremely large datasets. Furthermore, Firefly is based on binary indicator variables that determine whether a point in a factorised likelihood is used for an MCMC update (bright) or not (dark). The second obstacle comes from Firefly’s parameter that is the probability of brightening a dark point. First, at least points need to be evaluated in each iterations, which is linear in . Second, mixing time suffers at least by a factor of , which means that burn-in time and number of MCMC iterations need to be multiplied by that factor to compare to full MCMC in a fair way. Together, this means that Firefly MCMC roughly needs the same number of likelihood evaluations as full MCMC before it produces the first sample – implying that our comparisons to the full MCMC directly apply to Firefly MCMC as well.

We now provide a number of examples, where we analyse convergence of the debiasing estimator up to the number of likelihood evaluations necessary to produce a single sample. All estimates are given as function of the number of likelihood evaluations needed to compute them (including burn-in). Note that, in favour of competing methods, we do not take parallelisation into account, which (given appropriate hardware) would increase the effective number of likelihood evaluations per unit time by a factor of .

4.1 Synthetic log-Gaussian

We first consider a toy model from Bardenet2014, but with more data555Attempting to resist the commonly followed temptation of applying large-scale methodology to only medium sized datasets. : rather than the original , we generate data from a log-Gaussian , where and . Using flat priors, we sample from the joint posterior over and aim to estimate the marginal posterior mean of . This posterior has extremely wide tails, which causes problems for MCMC methods based on approximate transition kernels. In particular, Bardenet2014 illustrate that even when using an appropriate setting of the mini-batch size, Korattikara2014’s scheme (confidently) gives completely wrong results. Bardenet2014

’s sampler is able to recover the model’s standard deviation but this comes at the cost of using almost

all available data in every MCMC iteration.

This happens despite the fact that such simple posterior expectations converge rapidly. Figure 3 shows convergence of the partial posterior mean of as a function of mini-batch size . Even though all such estimates are biased, the plot reveals that using multiple MCMC chains on subsets of constant size and averaging gives a small estimation error quickly. This raises the question whether manipulating the Markov transition kernel is the best way of addressing such problems.

Debiasing, in contrast, is a way of exploiting rapid convergence of posterior expectations, while remaining unbiased – as demonstrated in the upper part of Figure 4, where we show that we can recover the true model parameter confidently and quickly. We run a number of debiasing replications with a minimum batch size of , and geometric truncation probability close to . Each of the partial posterior expectations are computed via MCMC666We use the NUTS sampler [stan2014], assuming a constant number of leap-frog iterations in HMC. with iterations after a burn-in of iterations. We count the number of likelihood evaluations for each partial posterior, taking into account the MCMC iterations. We run full MCMC with the same number of iterations and burn-in (note that this is in favour of full MCMC as partial posterior distributions should be explored in fewer iterations). We count likelihood evaluations per MCMC iteration, with an offset of burn-in iterations.

Remarkably, as Figure 4 indicates, the largest partial posterior size was only , leading to a maximum single replication cost of as depicted in Figure 4, and a median of only . After replications, the number of data touched in total is . Taking into account the MCMC iterations to estimate each partial posterior expectation, this sums up to likelihood evaluations, which is less than a quarter of a single full MCMC burn-in iteration (), and less than times the number of likelihood evaluations required to complete the burn-in of full MCMC.

Figure 3: Convergence of partial posteriors’ mean of as a function of sub-sample size for Bardenet2014’s log-Gaussian model. We randomly sub-sample data from the dataset and run MCMC to estimate , whose true value is . Error bars are 95% over 150 trials.
Figure 4: Convergence and used data for debiased mean of posterior mean of for Bardenet2014’s Log-Gaussian model. Top: Debiasing estimates converge to true posterior statistic quickly. Bottom: Small per-replication and cumulative data usage.

Fast convergence of posterior expectations as in Figure 3 is not an artefact of the above model. As we demonstrate in the next experiment, such situations arise in more involved inference tasks as well.

4.2 Large-scale logistic regression

We now apply our methodology to a large-scale Bayesian logistic regression problem on

data. As full posterior simulation is infeasible for models of such size, we choose a synthetic dataset in order to quantify estimation error.

We model binary labels of observations of features as , where

is the logit function and

are independent Laplace priors with a scale of 1. The bias parameter is absorbed into and . To generate data, we sample covariates and label them positively with probabilities . True regression weights are set to for .

As in the previous example, simple posterior statistics such as mean regression weights, i.e. for , converge quickly: Figure 5 (top) reveals that the statistics do not significantly change when computed from randomly sub-sampled mini-batches larger than 10000, which is 4 order of magnitudes smaller than . Note that it not possible to run even a single MCMC chain on the full dataset – in contrast to our debiasing approach.

We apply our debiasing estimator, using a minimum batch size of only , a geometric batch size increase of , and run Stan’s NUTS sampler for 500 iterations after a burn-in of 100 iterations. Figure 5 (bottom) shows examples of the convergence of the debiasing estimator over replications. Taking into account the 500+100 inherent MCMC iterations, the median data usage per replication is likelihood evaluations data points, the average is . Summing over all replications, debiasing takes likelihood evaluations. This means that a full MCMC chain would not even have passed the iterations of burn-in, while debiasing already converged close to the ground truth posterior statistic. We stress that this comparison is extremely conservative: given appropriate computational resources, parallelisation allows for a speed-up of up to factor . This means that we can reduce error bars by an additional large factor without increasing computation wall-time.

Figure 5: Logistic regression on synthetic data. Top:

Posterior mean convergence as a function of mini-batch size, for all 9 regression weights and the bias term (outlier in the plot). They converge from about 10000 data (averaged over 100 trials). Note how the bias term compensates for small weights in the low data regions of the plot.

Bottom: Debiasing convergence as a function of the number of replications , for the last 3 regression weights and bias term. Other regression weights behave similarly. 95% error bars, dashed line indicates ground truth from the above plot. All replications correspond to likelihood evaluations, not taking parallelisation into account.

5 Extensions

We now describe two extensions of our framework that illustrate its generality compared to other sub-sampling based approaches, and give experimental illustration. This includes an experiment where we are able to outperform stochastic variational inference for Gaussian Processes on a large-scale real-world dataset.

5.1 Likelihoods need not factorise

The debiasing device for constructing the unbiased estimators of posterior expectations does not require that the likelihood factorises, i.e. that

We simply require access to partial likelihoods for a given batch size . To the best of our knowledge, this is in sharp contrast to all other methods available for MCMC in the Big Data regime, where the likelihood has to be computable and typically is assumed to factorise. As such, debiasing over partial posterior paths can also be applied to cases where posterior distributions are available in closed form – but only at a prohibitive amount of computational cost.

Approximate Gaussian Process regression

is a typical example for a non-factorising likelihood. We focus on a simple case of predictive posterior in Gaussian Process (GP) regression,

(5)

where is the covariance function evaluated at pairwise training covariates , and , and observation noise variance [RasmussenCarlEdwardWilliams2006, Section 2]. This requires the inversion of an covariance matrix and therefore costs computation. Note that predictive mean and variance here can be computed exactly – no MCMC simulation is required. With the debiasing approach, it suffices to look at the expectations of partial predictive posteriors , which is again based on sub-sampling all available data and . As each evaluation then requires computation, the average computational costs are given by , where we set , and as before.

The above, however, can still be infeasible in practice, and further savings can be obtained by applying the debiasing onto the primal form of (5), in combination with an explicit finite rank representation of the kernel function , with , e.g. inducing variables [QuinoneroCandela2005], random Fourier features [Rahimi2007], or Incomplete Cholesky [Fine2001]

. By performing Bayesian linear regression on this explicit (approximate) feature space, the posterior for a single test feature

becomes

(6)

with feature matrix . Evaluation now requires a reduced cost of . Having a cost linear in for obtaining each expectation gives a debiasing average computational cost of , which again is sub-linear in for a fixed feature space dimension . As before, each of the mini-batches can be processed in parallel.

Experimental illustration

To illustrate the above idea, we generate toy data for a univariate non-linear regression problem in an approximate feature space given by Rahimi2007’s random Fourier features. Note that this exactly corresponds to a Bayesian linear regression with the mapped features. More specifically, we choose a Gaussian kernel with unit length scale, , whose associated random feature space of dimension is given by the mapping

where and are fixed and the covariates are randomly spread in . We sample a set of training labels from the corresponding approximate GP prior, add observation noise, and resample the feature space basis via . We then fit the data and compute the predictive mean from equation (5.1) for a set of randomly chosen test covariates with test features . The ground truth is chosen to be the predictive mean using all data.777Note that this is different to the predictive mean using an exact GP, but suffices for illustration purposes here, as the MSE is zero by construction when all data is used .

As before, we begin by exploring convergence of the desired posterior statistic, here averaged for multiple test features . For a given partial posterior size, we repeatedly sub-sample observations and compute the predictive mean. Figure 6 (top) shows convergence of the mean squared error (MSE) of the predictive means for all test features as a function of partial posterior size. The MSE only gets close to zero when almost all data is used. This is unlike in previous examples and therefore shows that the functional corresponding to GP regression, i.e. equation (5), is more complicated.

We apply the debiasing scheme and compute the average computational complexity, i.e. the average size of all partial posteriors of a single debiasing replication. Given that complexity, Figure 6 (top) shows the MSE if we were to average predictions over multiple mini-batches. In contrast, Figure 6 (bottom) reveals that debiasing achieves a much better MSE at the same average computational cost.

To our knowledge, none of the other approximate or exact sub-sampling-based MCMC schemes can be applied to this example. We therefore resort to comparing against a popular approximate inference method for such GP models.

Figure 6: Top: Convergence of MSE on test features as a function of data used for training. We randomly subsample available data, train the GP and compute the test MSE. 95% error bars obtained by averaging over 200 trials. Note that the MSE eventually vanishes as we compare against the predictive mean obtained via using all data. Bottom: MSE as a function of debiasing replications. We compute multiple debiasing estimates for a given average cost and plot their running mean. The given average cost corresponds to a vertical slice in the above plot, i.e. predictions of constant sized mini-batches of size gives an MSE of more than while debiasing with the same average cost reaches almost zero error.

5.2 Comparing to stochastic variational inference on real-world data

Another way to approach Gaussian Processes in the Big Data context is via stochastic variational inference (GP-SVI). Hensman2013 combine a decade’s work on sparse GPs, variational bounds, and stochastic gradient descent to fit huge GP models in a streaming fashion. The clever usage of a number of approximations allows them to cut the computational costs from down to , where is the number of inducing variables and constants depending on number of iterations and mini-batch size.

Airtime delays

We apply a combination of random Fourier features and debiasing (as in previous Section) to the real-world problem of predicting arrival time delays in flight records [Hensman2013, Section 4.3]. This involves data consisting of 8-dimensional covariates and real labels. We aim to estimate the predictive mean of a GP for randomly chosen test covariates. We use the exponentiated quadratic covariance function, with a finite-rank expansion via random Fourier features. For the sake of simplicity, we do not apply a different length-scale to each dimension and include no bias term. Instead, we centre the data and re-scale to unit variance in a preprocessing step888

Indeed, we were not able to obtain significant differences working with varying length-scales or other hyperparameters. Furthermore,

Hensman2013 do not report predictive variance, for which tuning such parameters is more essential. However, random Fourier features are easily adapted to such covariance functions.. We match the number of random Fourier features to the number of inducing points in the GP-SVI experiment.

In debiasing, We use the minimum batch size , and set the trucation distribution to match an average computational cost of roughly for each of the replications, which is an order of magnitude less than the batch size of for iterations in the GP-SVI experiment.

Remarkably, as shown in Figure 7, debiasing outperforms Hensman2013. GP-SVI achieves a square rooted mean squared error (RMSE) of , while debiasing achieves less than . Care has to be taken when concluding from these RMSE comparisons – both methods are likely to be improved by tuning, the full experimental protocols are not available, there are slight differences in finite-rank approximations, etc. Instead, we make the point that debiasing achieves a competitive performance. However, while GP-SVI is highly engineered to these very same GP regression models, debiasing is a more general method for estimation in Bayesian inference – with GP regression only being one of its applications.

While this example is promising, we leave a thorough comparison with streaming variational Bayes for future work.

Figure 7: Debiasing for the airtime delays dataset. Top: Convergence of RMSE as a function of mini-batch size using random Fourier features. The results using data are slightly better to those reported in Hensman2013, speaking for our choice of hyper-parameters. Bottom: With an order of magnitude less average computational cost than GP-SVI (see text), we are able to reproduce a comparable RMSE on randomly chosen test covariates. In particular, as in the previous GP example, Figure 6, we are able to obtain a better RMSE than averaging predictions obtained from constant batch sizes with the same computational costs. 95 error bars are computed over 20 repetitions.

5.3 Reducing bias in streaming applications

The debiasing formalism is easily applicable in a scenario where the amount of data is unknown or unlimited, e.g. in a streaming scenario. Previously, we discussed targeting the posterior given a fixed number of observations , and constructing the stochastic truncation variable such that a small probability remains that all observations are used for computing the desired expectation. In contrast, in the streaming scenario, we are unable to process all observations at a time, and the nature of the problem forces us to process observations in batches – which are discarded afterwards. Debiasing is still possible: we fix a worst case budget , which is the largest number of observations that can be processed at a time (e.g. guided by the hardware restrictions). then replaces in the static case: the stochastic truncation variable allows processing observations at maximum. This means that the bias with respect to the full posterior (of unknown size) still remains. However, as , it is typically of the order , and is therefore subsumed by the error bars over replications, which are of the order .

Note that in the streaming scenario, no fully unbiased scheme is available.

Toy example

We compare the debiasing scheme with the constant-batch scheme on a simple posterior mean estimation in a Gaussian model with known variance, where with prior and true . Results are given in Figure 8. They show that the debiasing estimator is less biased and has more appropriate error bars where the constant batch-size approach is overconfident and strongly biased. The constant batch size was chosen to make the computational cost of the two schemes comparable. Results show estimates towards the end of the total of replications after each scheme has streamed around datapoints.

Figure 8: Reduced bias for mean estimation in a Gaussian model with known variance. Our estimator is less biased and has more appropriate error bars than averaging over constant batch sizes, which is overconfidently biased. Both approaches have the same average computational costs.

6 Discussion

In this section, we present shortcomings and problems, both conceptual experimental, that have to be addressed in future work. We close by summarising our contributions.

Bias from memory restrictions

In order for the estimator in (1) to be unbiased, one needs to assign a non-zero probability for each of the possible values of the truncation variable – and the resulting partial posterior expectations need to be computable in finite time. In all presented examples, sampling large values of only results in a long runtime. However, such large might also result in a partial posterior statistics that are impossible to estimate due to restrictions of available computing resources. A common example are memory limits arising from large Gaussian covariance matrices, see for example [Lyne2013]. We side-stepped such problems in our experiments by using a finite-rank kernel expansion. However, in general our estimator is not unbiased in cases where partial posteriors exceed available machine memory. In practice, allowing a fixed computational budget and tweaking the truncation distribution such that larger values are almost never sampled, yields good results. Developing a more sophisticated solution is left for future work.

Convergence on short posterior paths

In experiments on smaller datasets, we could not beat full posterior sampling in terms of estimation error per computation time. Only when the sub-linear average computational complexity is significantly less than , debiasing outperforms MCMC. It will be interesting to study the connection of data size and truncation parameter for different classes of posterior functionals .

Figure 9 shows results for (sparse) logistic regression on the a9a dataset [Lin2008, Welling2011], which consists of covariates of dimension . Using the same model as in Section 4.2, we aim to estimate the posterior mean of the first regression weight . Note that full posterior sampling on this dataset takes days. Figure 9 (top) shows convergence of partial posterior statistics, which tend to stabilise from about 1000 data. Convergence of debiasing, Figure 9 (middle), behaves well at first sight. However, as in this case is relatively small, the probability to sample a partial posterior path truncation that includes the whole dataset is relatively high. In the presented debiasing run, this happens around replication 100. As this results in full posterior sampling, debiasing is pointless. Unfortunately, the convergence at this point has not yet reached an acceptable level.

Figure 9: Logistic regression on the a9a dataset consisting of data of dimension . We estimate posterior mean of the first regression weight for increasing data size. Top: Convergence of partial posterior statistics. Bottom: Convergence of the debiasing estimator looks promising. However, at around iteration 100, full posterior sampling is performed.

Summary

We presented an alternative perspective on large-scale Bayesian inference problems, and developed a novel framework for approaching those in practice. For cases where the goal is estimation of Bayesian posterior expectations, rather than simulation from the posterior, we side-stepped the many serious convergence problems arising from employing approximate transition kernels of Markov chains for simulation. By exploiting the debiasing Lemma, we were able to estimate these posterior statistics efficiently from partial posterior statistics. Data complexity is sub-linear in , no bias is introduced, variance is finite.

Implementing our approach is trivial as it exploits existing work on MCMC and easily fits in with other inference schemes. Free parameters are easy to tune. It furthermore is embarrassingly parallelisable. We conducted experiments to illustrate cases where debiasing can accurately and confidently estimate posterior statistics before competing simulation methods are able to produce a single estimate. The presented methodology is not limited to factorising likelihoods or MCMC as an internal inference scheme. We carried out experimental examples that showcased competitiveness of debiasing compared to full posterior sampling and stochastic variational inference.

Most essential areas for future work are (i) exploring the computation-variance tradeoff in detail, also in context of other than geometric truncation distributions (ii) dealing with finite time bias when MCMC is used, (iii) a thorough formal and experimental comparison with other large-scale inference schemes such as stochastic variational inference.

References

Appendix A Computational complexity and variance for geometric batch schedule

In this section, we show that for the simple choice of a geometrically increasing batch schedule, and a geometric stochastic truncation variable, it is possible to obtain sub-linear expected complexity of the debiasing scheme. Furthermore, variance remains bounded by a constant as increases.

Number of likelihood evaluations

For simplicity, we assume the common ratio , i.e. the batch sizes are , where , and is the smallest batch size considered, and is an integer. We express computational costs in terms of the number of likelihood evaluations . It is easy to see that is a function of the stochastic truncation variable , i.e.

where is the length of the MCMC chains (assumed to be constant throughout). Namely, chains need to be run on partial posteriors w.r.t.  datapoints respectively.

We write the truncation probability as for some . The normalizing constant of the corresponding density is given by , leading to the expected number of likelihood evaluations being

i.e. the overall complexity is sub-linear in the number of observations for .

Variance

The tail of is

In addition to , variance will also depend on the rate of convergence of partial posterior expectations to the full posterior expectation. We denote the difference between the expectation estimated on a partial posterior (corresponding to points) and the expectation estimated on the full posterior by

Note that we have almost surely for as the sequence of estimators terminates with the full posterior. This suffices that for any finite , variance of the debiasing scheme is finite as well. However, it might grow without bound as grows large – which is an undesirable situation. Remarkably, it is possible to ensure that variance is , i.e. it stays bounded as increases. Namely, let us assume that for large enough , there exist a constant and , such that :

(7)

The coefficient clearly depends on the function : it is typically for simple models and and could be closer to zero for complicated functionals exhibiting slow convergence of partial posterior expectations. Now, from Lemma 1

, the second moment of the debiasing estimator is precisely

where the last sum remains finite for as long as . This implies that the variance of the scheme remains bounded by a constant as the number of observations grows large. In terms of and , this upper bound on the second moment is approximately for large .

Figure 10 shows fits of equation (7) to the convergence rates of the for various of the presented examples. This shows that can be chosen close to or larger for simple, quickly converging posterior statistics. Note that values larger than 1 imply a constant average computational cost of debiasing (independently of the number of observations). However, note that these are empirical fits, sensitive to noise etc. In practice, it is possible to estimate by investigating the desired expectations on the first few partial posteriors only – comparisons being w.r.t. the expectation given the largest batch considered among these, rather than w.r.t. the full posterior. We stress that getting an accurate estimate of is not required for our scheme - a conservative lower bound on suffices to ensure that the parameter used in the truncation probabilities is smaller and thus variance remains bounded.

Figure 10: Empirical least squares fits of in equation (7) to observed partial posterior statistics’ convergence. Top-left: Gaussian mean, from Figure 1. Top-right: Log-Gaussian standard deviation, from Figure 3. Bottom-left: First regression weight of synthetic logistic regression, from Figure 5. Bottom-right: A single prediction in approximate GP regression, from Figure 6.

Minimising asymptotic variance

From (4), the variance cost determines the asymptotic variance of the debiasing scheme. Thus, provided is known, one can use the derivations above to select the parameter in the stochastic truncation distribution,