Markov Chain Monte Carlo (MCMC), used for sampling from posterior distributions, is one of the most fundamental tools in Bayesian data analysis. However, the recent explosion in the amount of data to be analysed poses serious challenges for this methodology as it is often infeasible to scale it to today’s statistical problems. This development led to a recent focus on methods to make MCMC practical for large datasets. Most research thus far has focused on devising alternative Markov transition kernels based on mini-batches of the data. These approaches either lead to (1) samples being drawn from an asymptotically approximate posterior distribution, thus reducing the amount of computation at the expense of introducing bias [Welling2011, Korattikara2014, ChenFoxGuestrin2014, Bardenet2014], or (2) preserving the asymptotically correct invariant distribution at the expense of technical requirements and mixing properties that might limit applicability in practice [Maclaurin2014]. The alternative approach is to run MCMC on small shards of the data and then construct a ’Consensus Monte Carlo’ estimator [Scott2013]22todo: 2lot more work on distributed MCMC other than consensus monte carlo (see references in Balajis NIPS paper). i think david dunson’s work has some theory, would be good to check. At present the Consensus Monte Carlo algorithm lacks any theoretical guarantees.
In this contribution, we propose a different view on the problem. We construct a scheme that directly estimates posterior expectations with neither simulation from the full posterior nor construction of alternative approximate transition kernels – without introducing bias. While variance of these estimators is naturally increased, we will show that this increase is bounded by construction and straightforward to control. This in particular holds for the Big Data context. We arrive at the following desiderata for a useful methodology in unbiased Bayesian inference: (i) computational complexity that is sub-linear in the number of observations, (ii) controllable and bounded variance, and (iii) no problems with transition kernel design.
The presented framework is very general in the sense that it is neither restricted to a particular form of the underlying Bayesian model (such as factorising likelihoods), nor does it rely on a particular inference technique used from within. Any free parameters are easy to tune. Furthermore, it is competitive in practice: experimental examples show that we are able to accurately and confidently estimate posterior expectations, faster than with simulation methods. In addition, and without exploiting any domain specific structure, we are able to outperform state-of-the-art results obtained from (approximate) stochastic variational inference methods on large-scale real-world data [Hensman2013].
By no means do we aim to replace existing simulation (or any other) techniques for Bayesian inference. If the goal is to obtain a representation of the full posterior density, simulation remains the method of choice. Our contribution rather provides a different perspective on problems where a given Bayesian expectation lies at the core of the performed statistical analysis.
we consider is as follows. Given data , a statistical model with parameters , and likelihood , we wish to obtain an unbiased estimator of the expectation
of a given functional over the full posterior 33todo: 3Clearly state assumptions. While we focus on the real-valued here, multivariate extensions are possible. A typical way to approach this problem is to generate samples from using MCMC, and then compute the empirical expectation . Note that the goal here is not to obtain samples from the full posterior measure – the focus is rather on the estimation of particular expectations. For example, in a regression setting, we might be interested in predictive posterior means and variances. This is the ubiquitous end goal in many situations in which Bayesian inference is employed. Therefore, we challenge the paradigm of solely working towards posterior simulation for such estimation problems, and propose a complementary methodology.
A subtlety when dealing with MCMC based estimates is that for any posterior is only asymptotically correct. Therefore, any finite time MCMC algorithm produces estimates that contain a systematic bias, which is subsequently made arbitrarily small via careful choice of simulation parameters. Parts of our methodology are based on such estimators. For the sake of simplicity, we will here assume access to the asymptotic limit in the same sense as finite time MCMC estimates are treated as ’correct’. We hint at a way to address this issue as an outlook.
Moreover, our approach is not restricted to MCMC, but easily applies to situations where expectations over are available in closed form but require prohibitive amounts of computation. Such cases can for example be found in large-scale spatial statistics [Lyne2013], or Gaussian Process models deployed to Big Data regimes [RasmussenCarlEdwardWilliams2006, Hensman2013].
Our work builds on several breakthroughs in related areas, such as unbiased estimation for stochastic differential equations [RheeGlynn2013] and for Markov chain equilibrium expectations [GlynnRhee2014]. These developments demonstrate the overarching principle that estimation is often an easier problem than simulation – a dictum we adopt and apply here in the context of Bayesian inference.
We begin by summarising previous sub-sampling based simulation approaches in Section 2. In Section 3, we construct unbiased estimators for Bayesian expectations from paths of partial posterior distributions. Section 4 contains a number of experimental illustrations where we demonstrate that our estimator is in particular useful in the Big Data regime. In Section 5, we point out a number of extensions and conduct experiments that showcase the generality of the developed framework. We close with a discussion of shortcomings and point out future work in Section 6.
2 Previous work
A practical difficulty in dealing with the full posterior is that is often large. This renders the computation of a likelihood extremely expensive – if not impossible. This, for example, limits feasibility of MCMC algorithms to simulate from as they require access to in every iteration.
The infeasibility of exact likelihood computation has been the main focus of [Welling2011, Korattikara2014, ChenFoxGuestrin2014, Bardenet2014] where this issue is circumvented by approximations to the transition kernel in MCMC. This is done using, e.g. stochastic gradient Langevin [Welling2011] or Hamiltonian [ChenFoxGuestrin2014] approaches, or by using a noisy acceptance ratio and employing a statistical test [Korattikara2014] or concentration bounds [Bardenet2014]. The well-known issue with this vein of work is that the approximate finite step-size diffusions, defined by mini-batches of data, are no longer corrected for induced bias. Consequently convergence to the correct posterior (and indeed any) measure is no longer guaranteed. The practical effect of these approaches makes them difficult to tune and to obtain a well-mixing chain. Furthermore, artefacts of these methodologies, such as using parametric hypothesis testing [Korattikara2014], might even lead to over-confident accept/reject decisions in the Markov transition kernel. The latter was illustrated in Bardenet2014, who also substantially improve on these constructions by providing total variation bounds to assess the quality of the approximation.
Recently, Alquier2014 provided quantification of the approximation quality of many ’noisy’ MCMC Algorithms, including the ones by Welling2011, Korattikara2014, Bardenet2014. These results are an important step towards understanding the extent of the bias induced by employing approximate transition kernels. However, in practice their results require uniform or geometric ergodicity of the original Markov Chain to explicitly quantify the approximation error [Alquier2014, Theorem 2.1] or just guarantee convergence [Alquier2014, Theorem 2.2], respectively. The first condition is too strong for most problems in practice while the second one does not give important details on how and when the approximate chain converges. Our work can be seen as an orthogonal approach, as we avoid simulation from the full posterior and rather directly attack the underlying estimation of expectations of interest, i.e. (1).
In contrast to these approximate, biased sampling methods, Firefly MCMC [Maclaurin2014] introduces an exact construction that neither introduces bias nor requires computation of a full likelihood. It is an elegant way of exploiting additional auxiliary variable structure. In this regime of computationally intractable likelihoods due to data size111Data size as opposed to computationally intractable likelihoods due to an inherently intractable functional defining the likelihood., it is seen as the only approach that can ensure coherence of subsequent inference through the simulation from the asymptotically correct posterior. One complication with Firefly MCMC is that it requires availability of appropriate easily computable and tight lower bounds on the likelihood function, tuning of which requires at least one sweep through all data. Of course, the models for which such bounds can be obtained are often precisely those relatively simple models where the need for full and exact Bayesian inference over variational or other approximations might be questionable44todo: 4Add non-trivial example. While investigating the formal construction of such bounds in more general model classes is a promising way forward, the generality and applicability of Firefly MCMC is clearly limited. Moreover, while Firefly MCMC can significantly reduce the number of likelihood evaluations at each iteration of MCMC, the complexity of the scheme is linear in the number of observations, as resampling of auxiliary variables is required at each iteration, for a given fraction of the available data . There is a limit as to how small the parameter can be chosen: mixing time cannot be smaller than . This means that the reduced number of likelihood evaluations at each iteration of MCMC comes at the cost of requiring to run the chains by a factor of longer.
In contrast to all but one of the previous sub-sampling schemes considered, the estimators that we propose are provably unbiased and also have a sub-linear average complexity in the number of observations. Unlike the only unbiased competitor, Firefly MCMC, our approach does not require a lower bound on the likelihood and even extends to several other situations: where posterior expectations are available in closed form but computationally infeasible, where likelihoods need not factorise across observations, and where likelihoods might themselves be unavailable in an analytic form, as for example in the context of pseudo-marginal MCMC [Andrieu2009].
3 Partial posterior path estimators
In this section, we present a different approach to coherent Bayesian inference in the Big Data regime which exploits the paths of induced partial posterior distributions through the debiasing device developed in RheeGlynn2013, GlynnRhee2014. A similar approach was very recently taken by Agapiou2014, who exploit RheeGlynn2013’s work for unbiased posterior estimation of expectations over intractable infinite-dimensional models which can be parametrised in terms of a series expansion of basis functions. In contrast, our work directly attacks intractability that arises from large datasets. We see our contribution as a pragmatic complement to existing work on debiasing Monte Carlo estimates.
Our approach follows similar ideas as Chopin2002, who presented a sequential Monte Carlo procedure for static target distributions by exploiting a sequence of partial posterior targets. Given the full posterior , we define a subset of size of all data, where is a (possibly random) index set55todo: 5this footnote doesn’t make sense - haven’t introduced yet, with sizes . The partial posterior corresponding to is then 222Note that while is the number of subsets, they are indexed with the subscript for some variable that will be introduced later.. Paths of partial posterior measures can be constructed by starting from the prior and increasing the size of the batches until exhausting the whole set of observations, i.e.
where is the full posterior.
For simplicity of exposition, we here consider the case of a geometric increase in batch sizes. More precisely, we set , where is the number of possible batch sizes, is the smallest batch size considered. We assume that is an integer. Note that common ratios other than 2 are possible, and are used in the experiments.
The next section presents the debiasing device, which is a key component in transforming estimates of expectations over multiple partial posteriors into an unbiased estimator of the expectation over the full posterior .
3.1 Debiasing Lemma & algorithm
The debiasing Lemma provides a way to construct an unbiased estimator for the limit of a converging a sequence – without evaluating all elements. Here, we transform a sequence of asymptotically correct, but biased estimators into an unbiased estimator. In different contexts, the result was originally discussed independently by McLeish2011 and RheeGlynn2012. It was shown in the present form in RheeGlynn2013; see also JacobThiery2013.
Lemma 1 (Telescoping Estimator).
Let and be real-valued random variables
be real-valued random variables333Agapiou2014 very recently developed a Hilbert space version of the Lemma., and let be an integer-valued random variable independent of and of with . With the convention , assume that
is an unbiased estimator of , i.e. . Moreover,
Finite variance for unbiased Bayesian expectations
The above Lemma 1 is directly applicable in our context. We set , for where the empirical expectation is computed by, e.g. MCMC444We realise that empirical expectations computed with MCMC are technically biased and will comment on this in the next Section. on the partial posterior , and for . In the finite data regime, the conditions of the above Lemmas are trivially satisfied since we set almost surely for . The variance of estimators is thus finite by construction and the truncation variable needs only to be supported on , and all infinite sums can be replaced with sums up to . However, variance might still increase increase with without a bound. Therefore, in order to ensure stability of the estimators in the Big Data regime, we here require an analogous condition to (2) that will guarantee that the variance remains constant, i.e. that as , 77todo: 7Check expectation arguments
Intuitively, we require that the tail of the stochastic truncation variable matches the rate of convergence of partial posterior expectations. See the next Section and Appendix A for details on a simple setup where this condition holds, along with a way to tune the truncation distribution .
Note also that in the same fashion as in RheeGlynn2013, one can replicate the random truncation procedure times and thus reduce variance. More precisely, is an unbiased estimator of and its variance scales as . Here, are independent copies of and each is computed on a different partial posterior path. This implies that the scheme can be repeated until a desired error tolerance is attained. The latter can be estimated from the empirical variance of the .
In summary, the key properties of the described methodology is that full posterior expectations over can be estimated, with no bias introduced and with a bounded increase in variance. This is achieved by using sub-samples of the available data – at a sub-linear average computational cost as we will see next.
3.2 Practical considerations
We now list several properties, implications, and key advantages of our scheme.
Computational costs and variance
Let us denote by the time required to generate a single debiasing estimator . Since computing requires running MCMC chains, on batch sizes , would scale linearly with the overall number of likelihood evaluations, resulting in the average time complexity of . If the batch-size increase is geometric, i.e. , the cost becomes
. By matching this with truncation probabilities, for , we obtain an average cost of , which is sub-linear in , see also Appendix A. This cost reflects the amount of computation when only a single core is available, and the trivial parallelisation of the scheme allows further savings, as described below.
The variance of the debiasing estimator depends on the rate of convergence of the partial posterior expectations. In order to ensure that te variance stays bounded as increases, assume that there exist a constant and , such that for large enough and :
From here, as shown in Appendix A, (3) is satisfied and therefore variance remains bounded as long as . Thus, fast convergence of partial posterior expectations, e.g., close to , can result in significant speed-ups of the scheme. We give examples of empirical fits for in Appendix A.
Tuning truncation probabilities
We now describe how to tune free parameters of the scheme. Following GlynnWhitt1992, if both the average time complexity and the variance
are finite, a central limit theorem holds in the limit where computational budget. Namely, for a given computational budget , if we denote by the number of debiasing replications that can be generated in time, and by the resulting average of debiasing replications, then
Thus, the asymptotic efficiency of the debiasing estimator is characterised precisely by the work-variance product. Figure 2 demonstrates how the distribution of the stochastic truncation variable (here in the parametric form ) can be optimised in order to yield minimal asymptotic variance of the estimators, see also Appendix A.
MCMC and finite time bias
Any empirical expectation computed from finite MCMC algorithms is only correct in the asymptotic limit. Consequently, setting for sampled from a finite time Markov chain is problematic, as unbiasedness of the overall approach technically corrupted. However, the same is true for any MCMC estimate. In practice, careful tuning of simulation parameters such as burn-in, thinning, and running multiple chains [gelman1992inference], is successfully used to reduce finite time biases to a neglectable level. We adopt this mindset here for the sake of practicality and simplicity of presentation. A way to address the issue could be to apply debiasing to the Monte Carlo estimate itself. In Lemma 1, set for drawn from an approximate (not converged) Markov chain after iterations, as opposed to be drawn from the asymptotically invariant distribution. This gives a sequence . Via applying the debiasing Lemma, an unbiased estimator can be constructed for any partial posterior ’s expectation. In a second stage, these debiased partial posterior expectation estimates can be used to estimate the full posterior expectation – now fully unbiased. Unfortunately, as is an infinite sequence, the variance expression in Lemma 1 is not trivially guaranteed to be finite anymore. See GlynnRhee2014, McLeish2011, Agapiou2014 for an explicit and in-depth treatment of unbiased Monte Carlo estimation.
MCMC and mixing time
If a Markov chain is, in line with above considerations, used for computing partial posterior expectations , it need not be induced by any form of approximation, noise injection, or state-space augmentation of the transition kernel. As a result, the notorious difficulties of ensuring acceptable mixing and problems of stickiness are conveniently side-stepped – which is in sharp contrast to all existing approaches. Furthermore, while the latter are limited to particular MCMC schemes [Bardenet2014, random walk], and [Welling2011, Langevin], any MCMC procedure can be employed in our construction. This allows us to harvest decades of mathematical and engineering effort that went into both methodology and software packages (e.g. Stan [stan2014]. Mixing time when using MCMC to estimate partial posterior’s expectations is not compromised by our approach, in contrast to for example Firefly MCMC, whose mixing time gets worse as the mini-batch size decreases. MCMC chains over partial posteriors do not suffer from such problems. Indeed they are in practice often easier to handle due to their similarity to the (usually simply structured) prior distribution. 1111todo: 11Peter D: He did not get that we talk about mixing problems come from the augmented transition kernels, so ellaborate on that
As computation required for each partial posterior in a single path can be performed independently, the method embarrassingly parallelises and expectations computed in parallel only need to be combined in a telescoping sum. The same holds true for replications of the scheme: since the computational cost within each truncated posterior path is dictated by the largest batch size that needs to be processed in parallel, the required wall-time in the case of geometric batch-size increases is roughly halved. Therefore, with sufficient computational resources, the potential speed-up factor through parallelisation is , where in practice is usually in the 100s to 1000s as we will see in the experiments.1313todo: 13Reference Appendix for derivation
In this section, we illustrate the utility of the debiasing approach and compare it against other unbiased approaches: full posterior sampling and Firefly MCMC. In particular, we show that for large-scale datasets, debiasing can accurately and confidently estimate posterior expectations before full MCMC and Firefly have produced a single estimate.
It is clear that running an MCMC chain on the full posterior, for any statistic, produces more accurate estimates than the debiasing approach, which by construction has an additional intrinsic source of variance. This means that if it is possible to produce even only a single MCMC sample (after burn-in), the resulting posterior expectation can be estimated with less expected error. It is therefore not instructive to compare approaches in that region. 1414todo: 14Why is it clear? Reference?
On comparing to Firefly MCMC
For a fair comparison of our method to Firefly MCMC, we give an estimate for the number of likelihood evaluation necessary for Firefly MCMC to produce the first sample – for which there are two notable obstacles. The first is computing a maximum a posteriori (MAP) estimate to initialise a lower bound on the likelihood: Maclaurin2014 reported Firefly’s performance to be inferior to standard MCMC otherwise. For large datasets, MAP estimates are challenging as a standard gradient based optimisation scheme such as BFGS needs multiple evaluations of the full likelihood. For example, Stan’s BFGS implementation on a commonly used benchmark dataset, a9a [Welling2011, Lin2008]
, takes around 40 iterations to reach a reasonable convergence level. While this issue can be somewhat sidestepped via using stochastic gradient descent. However, given a MAP estimate, a one-off cost of, i.e. computing sufficient statistics of all data [Maclaurin2014], cannot be avoided. This is challenging for extremely large datasets. Furthermore, Firefly is based on binary indicator variables that determine whether a point in a factorised likelihood is used for an MCMC update (bright) or not (dark). The second obstacle comes from Firefly’s parameter that is the probability of brightening a dark point. First, at least points need to be evaluated in each iterations, which is linear in . Second, mixing time suffers at least by a factor of , which means that burn-in time and number of MCMC iterations need to be multiplied by that factor to compare to full MCMC in a fair way. Together, this means that Firefly MCMC roughly needs the same number of likelihood evaluations as full MCMC before it produces the first sample – implying that our comparisons to the full MCMC directly apply to Firefly MCMC as well.
We now provide a number of examples, where we analyse convergence of the debiasing estimator up to the number of likelihood evaluations necessary to produce a single sample. All estimates are given as function of the number of likelihood evaluations needed to compute them (including burn-in). Note that, in favour of competing methods, we do not take parallelisation into account, which (given appropriate hardware) would increase the effective number of likelihood evaluations per unit time by a factor of .
4.1 Synthetic log-Gaussian
We first consider a toy model from Bardenet2014, but with more data555Attempting to resist the commonly followed temptation of applying large-scale methodology to only medium sized datasets. : rather than the original , we generate data from a log-Gaussian , where and . Using flat priors, we sample from the joint posterior over and aim to estimate the marginal posterior mean of . This posterior has extremely wide tails, which causes problems for MCMC methods based on approximate transition kernels. In particular, Bardenet2014 illustrate that even when using an appropriate setting of the mini-batch size, Korattikara2014’s scheme (confidently) gives completely wrong results. Bardenet2014
’s sampler is able to recover the model’s standard deviation but this comes at the cost of using almostall available data in every MCMC iteration.
This happens despite the fact that such simple posterior expectations converge rapidly. Figure 3 shows convergence of the partial posterior mean of as a function of mini-batch size . Even though all such estimates are biased, the plot reveals that using multiple MCMC chains on subsets of constant size and averaging gives a small estimation error quickly. This raises the question whether manipulating the Markov transition kernel is the best way of addressing such problems.
Debiasing, in contrast, is a way of exploiting rapid convergence of posterior expectations, while remaining unbiased – as demonstrated in the upper part of Figure 4, where we show that we can recover the true model parameter confidently and quickly. We run a number of debiasing replications with a minimum batch size of , and geometric truncation probability close to . Each of the partial posterior expectations are computed via MCMC666We use the NUTS sampler [stan2014], assuming a constant number of leap-frog iterations in HMC. with iterations after a burn-in of iterations. We count the number of likelihood evaluations for each partial posterior, taking into account the MCMC iterations. We run full MCMC with the same number of iterations and burn-in (note that this is in favour of full MCMC as partial posterior distributions should be explored in fewer iterations). We count likelihood evaluations per MCMC iteration, with an offset of burn-in iterations.
Remarkably, as Figure 4 indicates, the largest partial posterior size was only , leading to a maximum single replication cost of as depicted in Figure 4, and a median of only . After replications, the number of data touched in total is . Taking into account the MCMC iterations to estimate each partial posterior expectation, this sums up to likelihood evaluations, which is less than a quarter of a single full MCMC burn-in iteration (), and less than times the number of likelihood evaluations required to complete the burn-in of full MCMC.
Fast convergence of posterior expectations as in Figure 3 is not an artefact of the above model. As we demonstrate in the next experiment, such situations arise in more involved inference tasks as well.
4.2 Large-scale logistic regression
We now apply our methodology to a large-scale Bayesian logistic regression problem ondata. As full posterior simulation is infeasible for models of such size, we choose a synthetic dataset in order to quantify estimation error.
We model binary labels of observations of features as , where
is the logit function andare independent Laplace priors with a scale of 1. The bias parameter is absorbed into and . To generate data, we sample covariates and label them positively with probabilities . True regression weights are set to for .
As in the previous example, simple posterior statistics such as mean regression weights, i.e. for , converge quickly: Figure 5 (top) reveals that the statistics do not significantly change when computed from randomly sub-sampled mini-batches larger than 10000, which is 4 order of magnitudes smaller than . Note that it not possible to run even a single MCMC chain on the full dataset – in contrast to our debiasing approach.
We apply our debiasing estimator, using a minimum batch size of only , a geometric batch size increase of , and run Stan’s NUTS sampler for 500 iterations after a burn-in of 100 iterations. Figure 5 (bottom) shows examples of the convergence of the debiasing estimator over replications. Taking into account the 500+100 inherent MCMC iterations, the median data usage per replication is likelihood evaluations data points, the average is . Summing over all replications, debiasing takes likelihood evaluations. This means that a full MCMC chain would not even have passed the iterations of burn-in, while debiasing already converged close to the ground truth posterior statistic. We stress that this comparison is extremely conservative: given appropriate computational resources, parallelisation allows for a speed-up of up to factor . This means that we can reduce error bars by an additional large factor without increasing computation wall-time.
Posterior mean convergence as a function of mini-batch size, for all 9 regression weights and the bias term (outlier in the plot). They converge from about 10000 data (averaged over 100 trials). Note how the bias term compensates for small weights in the low data regions of the plot.Bottom: Debiasing convergence as a function of the number of replications , for the last 3 regression weights and bias term. Other regression weights behave similarly. 95% error bars, dashed line indicates ground truth from the above plot. All replications correspond to likelihood evaluations, not taking parallelisation into account.
We now describe two extensions of our framework that illustrate its generality compared to other sub-sampling based approaches, and give experimental illustration. This includes an experiment where we are able to outperform stochastic variational inference for Gaussian Processes on a large-scale real-world dataset.
5.1 Likelihoods need not factorise
The debiasing device for constructing the unbiased estimators of posterior expectations does not require that the likelihood factorises, i.e. that
We simply require access to partial likelihoods for a given batch size . To the best of our knowledge, this is in sharp contrast to all other methods available for MCMC in the Big Data regime, where the likelihood has to be computable and typically is assumed to factorise. As such, debiasing over partial posterior paths can also be applied to cases where posterior distributions are available in closed form – but only at a prohibitive amount of computational cost.
Approximate Gaussian Process regression
is a typical example for a non-factorising likelihood. We focus on a simple case of predictive posterior in Gaussian Process (GP) regression,
where is the covariance function evaluated at pairwise training covariates , and , and observation noise variance [RasmussenCarlEdwardWilliams2006, Section 2]. This requires the inversion of an covariance matrix and therefore costs computation. Note that predictive mean and variance here can be computed exactly – no MCMC simulation is required. With the debiasing approach, it suffices to look at the expectations of partial predictive posteriors , which is again based on sub-sampling all available data and . As each evaluation then requires computation, the average computational costs are given by , where we set , and as before.
The above, however, can still be infeasible in practice, and further savings can be obtained by applying the debiasing onto the primal form of (5), in combination with an explicit finite rank representation of the kernel function , with , e.g. inducing variables [QuinoneroCandela2005], random Fourier features [Rahimi2007], or Incomplete Cholesky [Fine2001]
. By performing Bayesian linear regression on this explicit (approximate) feature space, the posterior for a single test featurebecomes
with feature matrix . Evaluation now requires a reduced cost of . Having a cost linear in for obtaining each expectation gives a debiasing average computational cost of , which again is sub-linear in for a fixed feature space dimension . As before, each of the mini-batches can be processed in parallel.
To illustrate the above idea, we generate toy data for a univariate non-linear regression problem in an approximate feature space given by Rahimi2007’s random Fourier features. Note that this exactly corresponds to a Bayesian linear regression with the mapped features. More specifically, we choose a Gaussian kernel with unit length scale, , whose associated random feature space of dimension is given by the mapping
where and are fixed and the covariates are randomly spread in . We sample a set of training labels from the corresponding approximate GP prior, add observation noise, and resample the feature space basis via . We then fit the data and compute the predictive mean from equation (5.1) for a set of randomly chosen test covariates with test features . The ground truth is chosen to be the predictive mean using all data.777Note that this is different to the predictive mean using an exact GP, but suffices for illustration purposes here, as the MSE is zero by construction when all data is used .
As before, we begin by exploring convergence of the desired posterior statistic, here averaged for multiple test features . For a given partial posterior size, we repeatedly sub-sample observations and compute the predictive mean. Figure 6 (top) shows convergence of the mean squared error (MSE) of the predictive means for all test features as a function of partial posterior size. The MSE only gets close to zero when almost all data is used. This is unlike in previous examples and therefore shows that the functional corresponding to GP regression, i.e. equation (5), is more complicated.
We apply the debiasing scheme and compute the average computational complexity, i.e. the average size of all partial posteriors of a single debiasing replication. Given that complexity, Figure 6 (top) shows the MSE if we were to average predictions over multiple mini-batches. In contrast, Figure 6 (bottom) reveals that debiasing achieves a much better MSE at the same average computational cost.
To our knowledge, none of the other approximate or exact sub-sampling-based MCMC schemes can be applied to this example. We therefore resort to comparing against a popular approximate inference method for such GP models.
5.2 Comparing to stochastic variational inference on real-world data
Another way to approach Gaussian Processes in the Big Data context is via stochastic variational inference (GP-SVI). Hensman2013 combine a decade’s work on sparse GPs, variational bounds, and stochastic gradient descent to fit huge GP models in a streaming fashion. The clever usage of a number of approximations allows them to cut the computational costs from down to , where is the number of inducing variables and constants depending on number of iterations and mini-batch size.
We apply a combination of random Fourier features and debiasing (as in previous Section) to the real-world problem of predicting arrival time delays in flight records [Hensman2013, Section 4.3]. This involves data consisting of 8-dimensional covariates and real labels. We aim to estimate the predictive mean of a GP for randomly chosen test covariates. We use the exponentiated quadratic covariance function, with a finite-rank expansion via random Fourier features. For the sake of simplicity, we do not apply a different length-scale to each dimension and include no bias term. Instead, we centre the data and re-scale to unit variance in a preprocessing step888 Indeed, we were not able to obtain significant differences working with varying length-scales or other hyperparameters. Furthermore,
Indeed, we were not able to obtain significant differences working with varying length-scales or other hyperparameters. Furthermore,Hensman2013 do not report predictive variance, for which tuning such parameters is more essential. However, random Fourier features are easily adapted to such covariance functions.. We match the number of random Fourier features to the number of inducing points in the GP-SVI experiment.
In debiasing, We use the minimum batch size , and set the trucation distribution to match an average computational cost of roughly for each of the replications, which is an order of magnitude less than the batch size of for iterations in the GP-SVI experiment.
Remarkably, as shown in Figure 7, debiasing outperforms Hensman2013. GP-SVI achieves a square rooted mean squared error (RMSE) of , while debiasing achieves less than . Care has to be taken when concluding from these RMSE comparisons – both methods are likely to be improved by tuning, the full experimental protocols are not available, there are slight differences in finite-rank approximations, etc. Instead, we make the point that debiasing achieves a competitive performance. However, while GP-SVI is highly engineered to these very same GP regression models, debiasing is a more general method for estimation in Bayesian inference – with GP regression only being one of its applications.
While this example is promising, we leave a thorough comparison with streaming variational Bayes for future work.
5.3 Reducing bias in streaming applications
The debiasing formalism is easily applicable in a scenario where the amount of data is unknown or unlimited, e.g. in a streaming scenario. Previously, we discussed targeting the posterior given a fixed number of observations , and constructing the stochastic truncation variable such that a small probability remains that all observations are used for computing the desired expectation. In contrast, in the streaming scenario, we are unable to process all observations at a time, and the nature of the problem forces us to process observations in batches – which are discarded afterwards. Debiasing is still possible: we fix a worst case budget , which is the largest number of observations that can be processed at a time (e.g. guided by the hardware restrictions). then replaces in the static case: the stochastic truncation variable allows processing observations at maximum. This means that the bias with respect to the full posterior (of unknown size) still remains. However, as , it is typically of the order , and is therefore subsumed by the error bars over replications, which are of the order .
Note that in the streaming scenario, no fully unbiased scheme is available.
We compare the debiasing scheme with the constant-batch scheme on a simple posterior mean estimation in a Gaussian model with known variance, where with prior and true . Results are given in Figure 8. They show that the debiasing estimator is less biased and has more appropriate error bars where the constant batch-size approach is overconfident and strongly biased. The constant batch size was chosen to make the computational cost of the two schemes comparable. Results show estimates towards the end of the total of replications after each scheme has streamed around datapoints.
In this section, we present shortcomings and problems, both conceptual experimental, that have to be addressed in future work. We close by summarising our contributions.
Bias from memory restrictions
In order for the estimator in (1) to be unbiased, one needs to assign a non-zero probability for each of the possible values of the truncation variable – and the resulting partial posterior expectations need to be computable in finite time. In all presented examples, sampling large values of only results in a long runtime. However, such large might also result in a partial posterior statistics that are impossible to estimate due to restrictions of available computing resources. A common example are memory limits arising from large Gaussian covariance matrices, see for example [Lyne2013]. We side-stepped such problems in our experiments by using a finite-rank kernel expansion. However, in general our estimator is not unbiased in cases where partial posteriors exceed available machine memory. In practice, allowing a fixed computational budget and tweaking the truncation distribution such that larger values are almost never sampled, yields good results. Developing a more sophisticated solution is left for future work.
Convergence on short posterior paths
In experiments on smaller datasets, we could not beat full posterior sampling in terms of estimation error per computation time. Only when the sub-linear average computational complexity is significantly less than , debiasing outperforms MCMC. It will be interesting to study the connection of data size and truncation parameter for different classes of posterior functionals .
Figure 9 shows results for (sparse) logistic regression on the a9a dataset [Lin2008, Welling2011], which consists of covariates of dimension . Using the same model as in Section 4.2, we aim to estimate the posterior mean of the first regression weight . Note that full posterior sampling on this dataset takes days. Figure 9 (top) shows convergence of partial posterior statistics, which tend to stabilise from about 1000 data. Convergence of debiasing, Figure 9 (middle), behaves well at first sight. However, as in this case is relatively small, the probability to sample a partial posterior path truncation that includes the whole dataset is relatively high. In the presented debiasing run, this happens around replication 100. As this results in full posterior sampling, debiasing is pointless. Unfortunately, the convergence at this point has not yet reached an acceptable level.
We presented an alternative perspective on large-scale Bayesian inference problems, and developed a novel framework for approaching those in practice. For cases where the goal is estimation of Bayesian posterior expectations, rather than simulation from the posterior, we side-stepped the many serious convergence problems arising from employing approximate transition kernels of Markov chains for simulation. By exploiting the debiasing Lemma, we were able to estimate these posterior statistics efficiently from partial posterior statistics. Data complexity is sub-linear in , no bias is introduced, variance is finite.
Implementing our approach is trivial as it exploits existing work on MCMC and easily fits in with other inference schemes. Free parameters are easy to tune. It furthermore is embarrassingly parallelisable. We conducted experiments to illustrate cases where debiasing can accurately and confidently estimate posterior statistics before competing simulation methods are able to produce a single estimate. The presented methodology is not limited to factorising likelihoods or MCMC as an internal inference scheme. We carried out experimental examples that showcased competitiveness of debiasing compared to full posterior sampling and stochastic variational inference.
Most essential areas for future work are (i) exploring the computation-variance tradeoff in detail, also in context of other than geometric truncation distributions (ii) dealing with finite time bias when MCMC is used, (iii) a thorough formal and experimental comparison with other large-scale inference schemes such as stochastic variational inference.
Appendix A Computational complexity and variance for geometric batch schedule
In this section, we show that for the simple choice of a geometrically increasing batch schedule, and a geometric stochastic truncation variable, it is possible to obtain sub-linear expected complexity of the debiasing scheme. Furthermore, variance remains bounded by a constant as increases.
Number of likelihood evaluations
For simplicity, we assume the common ratio , i.e. the batch sizes are , where , and is the smallest batch size considered, and is an integer. We express computational costs in terms of the number of likelihood evaluations . It is easy to see that is a function of the stochastic truncation variable , i.e.
where is the length of the MCMC chains (assumed to be constant throughout). Namely, chains need to be run on partial posteriors w.r.t. datapoints respectively.
We write the truncation probability as for some . The normalizing constant of the corresponding density is given by , leading to the expected number of likelihood evaluations being
i.e. the overall complexity is sub-linear in the number of observations for .
The tail of is
In addition to , variance will also depend on the rate of convergence of partial posterior expectations to the full posterior expectation. We denote the difference between the expectation estimated on a partial posterior (corresponding to points) and the expectation estimated on the full posterior by
Note that we have almost surely for as the sequence of estimators terminates with the full posterior. This suffices that for any finite , variance of the debiasing scheme is finite as well. However, it might grow without bound as grows large – which is an undesirable situation. Remarkably, it is possible to ensure that variance is , i.e. it stays bounded as increases. Namely, let us assume that for large enough , there exist a constant and , such that :
The coefficient clearly depends on the function : it is typically for simple models and and could be closer to zero for complicated functionals exhibiting slow convergence of partial posterior expectations. Now, from Lemma 1
, the second moment of the debiasing estimator is precisely
where the last sum remains finite for as long as . This implies that the variance of the scheme remains bounded by a constant as the number of observations grows large. In terms of and , this upper bound on the second moment is approximately for large .
Figure 10 shows fits of equation (7) to the convergence rates of the for various of the presented examples. This shows that can be chosen close to or larger for simple, quickly converging posterior statistics. Note that values larger than 1 imply a constant average computational cost of debiasing (independently of the number of observations). However, note that these are empirical fits, sensitive to noise etc. In practice, it is possible to estimate by investigating the desired expectations on the first few partial posteriors only – comparisons being w.r.t. the expectation given the largest batch considered among these, rather than w.r.t. the full posterior. We stress that getting an accurate estimate of is not required for our scheme - a conservative lower bound on suffices to ensure that the parameter used in the truncation probabilities is smaller and thus variance remains bounded.
Minimising asymptotic variance
From (4), the variance cost determines the asymptotic variance of the debiasing scheme. Thus, provided is known, one can use the derivations above to select the parameter in the stochastic truncation distribution,