Stochastic Bouncy Particle Sampler

09/03/2016 ∙ by Ari Pakman, et al. ∙ 0

We introduce a novel stochastic version of the non-reversible, rejection-free Bouncy Particle Sampler (BPS), a Markov process whose sample trajectories are piecewise linear. The algorithm is based on simulating first arrival times in a doubly stochastic Poisson process using the thinning method, and allows efficient sampling of Bayesian posteriors in big datasets. We prove that in the BPS no bias is introduced by noisy evaluations of the log-likelihood gradient. On the other hand, we argue that efficiency considerations favor a small, controllable bias in the construction of the thinning proposals, in exchange for faster mixing. We introduce a simple regression-based proposal intensity for the thinning method that controls this trade-off. We illustrate the algorithm in several examples in which it outperforms both unbiased, but slowly mixing stochastic versions of BPS, as well as biased stochastic gradient-based samplers.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

page 15

page 17

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The advent of the Big Data era presents special challenges to practitioners of Bayesian modeling because typical sampling-based inference methods have a computational cost per sample linear in the size of the dataset. This computational burden has been addressed in recent years through two major approaches (see (Bardenet et al., 2015)

for a recent overview): (i) split the data into batches and combine posterior samples obtained in parallel from each batch, or (ii) use variants of the Markov Chain Monte Carlo (MCMC) algorithm that only query a subset of the data at every iteration. Our interest in the paper is in the latter approach, where many methods are based on modifying both steps of the Metropolis-Hastings (MH) algorithm: in the proposal step, only a mini-batch of the data is used, and the accept-reject step is either ignored or approximated 

(Korattikara et al., 2013; Bardenet et al., 2014). This strategy has been explored using proposals from Langevin (Welling & Teh, 2011), Riemannian Langevin (Patterson & Teh, 2013), Hamiltonian (Chen et al., 2014) and Riemannian Hamiltonian (Ma et al., 2015) dynamics. Other relevant works include (Ahn et al., 2012; Ding et al., 2014).

Despite the success of the above approach, the partial accept-reject step is a source of bias, the precise size of which is difficult to control, and which tends to be amplified by the noisy evaluation of the gradient. This has motivated the search for unbiased stochastic samplers, such as the Firefly MCMC algorithm (Maclaurin & Adams, 2014), the debiased pseudolikelihood approach of (Quiroz et al., 2016), and the quasi-stationary distribution approach of (Pollock et al., 2016).

The present work is motivated by the idea that the bias could be reduced by starting from a rejection-free MCMC algorithm, avoiding thus the Metropolis-Hastings algorithm altogether. Two similar algorithms of this type have been recently proposed: the Bouncy Particle Sampler (BPS) (Peters & de With, 2012; Bouchard-Côté et al., 2015), and Zig-Zag Monte Carlo (Bierkens & Roberts, 2015; Bierkens et al., 2016). These algorithms sample from the target distribution through non-reversible, piecewise linear Markov processes. Non-reversibility (i.e., the failure to satisfy detailed balance) has been shown in many cases to yield faster mixing rates (Neal, 2004; Vucelja, 2014; Bouchard-Côté et al., 2015).

Our contributions in this paper are twofold. Firstly, we show that the BPS algorithm is particularly well suited to sample from posterior distributions of big datasets, because the target distribution is invariant under zero-mean noisy perturbations of the log-likelihood gradient, such as those introduced by using mini-batches of the full dataset in each iteration.

Stochastic variants of BPS or Zig-Zag that preserve exactly the target distribution have been proposed, such as Local BPS (Bouchard-Côté et al., 2015) or Zig-Zag with subsampling (ZZ-SS) (Bierkens et al., 2016)

, but they lead to extremely slow mixing because are based on overly conservative bounds (which moreover must be derived on a case-by-case basis, and in many cases may not hold at all). This leads us to our second contribution, the Stochastic Bouncy Particle Sampler (SBPS), a stochastic version of the BPS algorithm which trades a small amount of bias for significantly reduced variance, yielding superior performance (and requiring no parameter tuning or derivation of problem-specific bounds) compared to existing subsampling-based Monte Carlo methods. SBPS inherits the piecewise linear sample paths of BPS, and therefore enjoys faster convergence of empirical means, particularly of rapidly varying test functions, compared to more standard approaches.

We organize this paper as follows. In Section 2 we review the Bouncy Particle Sampler, in Section 3 we study the invariance of the target distribution under noise perturbations to the BPS updates, in Section 4 we introduce SBPS, and in Section 5 a preconditioned variant. In Section 6 we discuss related works and in Section 7 we illustrate the advantages of SBPS in several examples.

2 The Bouncy Particle Sampler

Figure 1: Noisy vs. noiseless gradients in BPS. Above: Contour plot of the 2D density considered and sample BPS trajectories during 50 bounces, with exact and noisy gradients. The noise was sampled in each component from a distribution. Below, Left: smoothed histogram of travel times until bouncing, with shorter times for noisy gradients. Middle: QQ-plot of one of the coordinates, showing that the invariant distribution is not changed by the noise. Right: ACFs of one of the coordinates, with slower mixing per iteration in the noisy case. However, note that these ACF plots do not account for computational cost per iteration.
Algorithm 1 Bouncy Particle Sampler
  Initialize particle position and velocity
  while desired do
     Sample Poisson process first arrivals with rates ,
     Let
     Move
     if  then
        Reflect
     else
        Refresh: sample
     end if
     Let
  end while
  RETURN piecewise linear trajectory of

Consider a distribution , where the normalization factor may be intractable. The Bouncy Particle Sampler (BPS), proposed in (Peters & de With, 2012; Monmarché, 2014) and formalized and developed in (Bouchard-Côté et al., 2015)

, introduces a random velocity vector

distributed uniformly in the unit sphere , and defines a continuous Markov process in . To describe this process we begin in discrete time and then take the continuous-time limit. Denoting time by , consider a discrete-time Markov process that acts on the variables as

(1)

where

(2)
(3)
(4)

Note that in (3) is the directional derivative of in the direction , and is a reflection of with respect to the plane perpendicular to the gradient , satisfying and . In other words, the particle moves along a straight line in the direction of and this direction is reflected as (4

) with probability

. This probability is non-zero only if the particle is moving in a direction of lower target probability , or equivalently higher potential .

Applying the transition (1) repeatedly and taking , the random reflection point becomes an event in an inhomogeneous Poisson process with intensity . The resulting sampling procedure generates a piecewise linear Markov process (Davis, 1984; Dufour et al., 2015), and is summarized in Algorithm 1. Note that the algorithm also includes occasional resamplings of , to ensure ergodicity (Bouchard-Côté et al., 2015). Remarkably, in the limit , the algorithm leaves the joint factorized distribution invariant, as we review in Supp. Material A.1.

The Zig-Zag process (Bierkens & Roberts, 2015; Bierkens et al., 2016) is similar to BPS, but velocity components can take only values, and the piecewise linear trajectories change direction only in a single coordinate at each random breakpoint. For a review of these methods, see (Fearnhead et al., 2016; Bierkens et al., 2017).

3 Noise Resilience and Big Data

3.1 Noise Resilience

Let us assume that only a noisy version of the gradient is available to compute the probability of bouncing and the reflected velocity in (4). In the Big Data scenario described below, this is the result of using a random subset of the data at each gradient evaluation, and can be represented as

(5)

where and has zero mean.

Theorem 1: The invariance of under the BPS algorithm is unaffected by the zero-mean noise (5) if and are independent for .

See Supp. Material A.2 for a proof sketch. Defining , the intensity of the inhomogeneous Poisson process , which determines the time of the velocity bounce, now becomes stochastic, and the resulting point process is called a doubly stochastic, or Cox, process (Cox, 1955; Grandell, 1976). The effect of the gradient noise is to increase the average point process intensity, since , from Jensen’s inequality. This leads to more frequent bounces and typically a slower mixing of the Markov process, as illustrated in Figure 1.

Many Cox processes are based on Poisson intensities obeying stochastic differential equations, or assume that the joint distribution at several

’s has a non-trivial -dependent structure. Our case is different because we assume that and are independent even when and are infinitesimally close.

3.2 Sampling from Big Data posteriors

In a prototypical Bayesian setting, we have a prior , i.i.d. data points , and the negative log-posterior gradient is

(6)

When is big we consider replacing the above gradient by the noisy approximation

(7)

where and the indices are sampled randomly without replacement. To sample from the posterior using the noisy gradient (7), we want to simulate the first arrival time in a doubly stochastic Poisson process with random intensity , where

(8)

Note that is a stochastic process, and noise independence for different ’s implies that different ’s require independent mini-batches. Out of several methods to sample from (noisy) Poisson processes, the thinning method (Lewis & Shedler, 1979) is compatible with the noise independence assumption. This is a form of rejection sampling which proposes a first arrival time , sampled from an inhomogeneous Poisson process with intensity such that The particle moves a distance , and accepts the proposal to bounce the velocity with probability . Note that this accept-reject step is different from the MH algorithm (Robert & Casella, 2013), since the particle always moves the distance , and a rejection only affects the velocity bouncing. This can greatly improve the efficiency of the sampler. As in the noiseless case, one should in general also resample occasionally, to ensure ergodicity (Bouchard-Côté et al., 2015), although in the examples we considered this was not empirically necessary, since the mini-batch noise serves to randomize the velocity sufficiently, preventing “non-ergodic” trajectories that do not explore the full space.

In some special cases one can derive a bound that always holds (Bouchard-Côté et al., 2015; Bierkens et al., 2017). But this is atypical, due to the dependence of in (8) on the changing velocity and the mini-batch noise. Even when such bounds do exist, they tend to be conservatively high, leading to an inefficient sampler with many rejected proposals (wasting many mini-batches of data) before accepting.

Instead, we propose below an adaptive approximate bound which achieves a bias-variance trade-off between the frequency of the bounce proposals and a controllable probability of bound violation.

4 Proposal from Local Regression

Our approach to an adaptive and tractable proposal intensity relies on a predictive model of  based on previous observations; the key idea is to exploit the correlations between nearby values. The upper value of the resulting predictive confidence band can then be used as , and this band is adaptively updated as more proposals are generated.

While there are many possibilities for such a predictive model, we found that a simple local linear model was very effective and computationally trivial. Consider then the linear regression of

observed values since the previous bounce,

(9)

where

and the noise variance can be estimated from the mini-batch in (

7) as

(10)

Here denotes the sample variance of the mini-batch, and we included the finite population correction factor because the indices are sampled without replacement. The Gaussian noise assumption in in (9

) is valid when the mini-batch is sufficiently large that we can appeal to a central limit theorem. (For heavy-tailed noise we could consider more robust estimators, but we do not pursue this direction here.)

Adding a Gaussian prior to , and defining , the log posterior of is

Let and be the mean and covariance of this distribution. Using these estimates, we obtain the predictive distribution for for ,

(11)
(12)

with . Note that as usual the noise variance is different in (9) and (11), since in (9) we are fitting observed pairs , while in (11) we are predicting the value of and we include the uncertainty from the estimates. Also, for simplicity we extrapolate the observation noise to be the same as in the last mini-batch, .

We can now construct a tractable approximate thinning proposal intensity by choosing a confidence band multiple , and defining

as a linear interpolation between selected points along the non-linear curve

(13)

The proposal intensity is now , and sampling from an inhomogeneous Poisson process with piecewise linear rate can be done analytically using the inverse CDF method. When a bounce time is proposed at time , the particle moves a distance , a noisy observation is made as in (8) and the bounce time is accepted with probability . If the bounce is accepted, the velocity is reflected as in (4) (using instead of ), and the set of observed values is reinitialized with , which are the values one would obtain from sampling the same mini-batch after the bounce, since . On the other hand, if the proposal is rejected, the observed

are added to the set of observed values. The hyperparameters

of the regression model can be learned by performing, after each bounce, a gradient ascent step on the marginal likelihood, ; this gradient can be computed analytically and does not significantly impact the computational cost.

Figure 2:

Thinning proposal intensity for bounce times from a linear regression predictive confidence interval applied to a two-dimensional logistic regression posterior with

. Left: Starting from , the piecewise linear intensity is used to propose bounce times (green points). As these proposals are rejected additional observations are made until a proposal is accepted (red point). The decrease in the slope of indicates the decreasing uncertainty in the estimated regression parameters as observations increase; note that the linear approximation for the true is quite accurate here. Note also the reduced observation frequency at lower values of indicating more efficient sampling than is achievable with the constant and much higher bounds used in (Bouchard-Côté et al., 2015; Bierkens et al., 2016), which were in the range for this data. Right: The corresponding SBPS particle trajectory, with arrows indicating the initial velocity and the velocity after the bounce. The contours show the Laplace approximation of the log posterior.

The linear model for is good when the target distribution can be locally approximated by a Gaussian, since in (8) is a projection of the derivative of the negative log posterior. When the posterior is highly non-Gaussian, a decaying weight can be used for more-distant observations, leading to a local regression; the scale of this decay can be fit again via stochastic gradient ascent on the predictive likelihood. We have also explored a Gaussian Process regression model, but it did not improve over the linear model in the cases we considered. In Supp. Material E we discuss a potential problem with our approach in the case of multimodal distributions, and propose a solution for such cases.

Finally, note that the directional derivative of needed in (8) can in many cases be computed at a cheaper cost (by a factor of ) than the full gradient. The latter is only needed when a bounce is accepted. This is in contrast to other gradient based samplers which require the full gradient at every step.

We dub this approach to BPS with noisy gradients Stochastic BPS (SBPS). See Supp. Material C for pseudocode. Figure 2 illustrates the evolution of these dynamic proposal intensities in a simple example. In Section 5, we consider a variant to SBPS, called pSBPS, that learns a diagonal preconditioning factor for the gradient, and leads to a more efficient exploration of the space when the posterior is highly anisotropic and roughly axis-aligned.

4.1 Bias in the Samples

The constant in (13) controls the tradeoff between bias from possible cases and lower computational cost: higher leads to a more conservative (higher) proposal intensity and therefore a less-biased but more data-inefficient sampler. We present a bound on the Wasserstein distance between the exact and bias distributions in Supp. Material B, and explore this bias-variance tradeoff further in Supp. Material F. A quick bias diagnostic is the rate at which the bound is violated, i.e., cases with ; if this rate is significantly higher than expected under the local linear regression model, then a different approach should be considered.

5 Preconditioned SBPS

Consider now the linear transformation

with an arbitrary square matrix . A distribution of interest can be expressed in terms of as

(14)
(15)

The SBPS algorithm can be applied to the density using the gradients of . For this note that . The Poisson intensity to compute bounces is , with , and the velocity reflection is computed as

(16)

The piecewise linear trajectory becomes . The matrix is called a preconditioner in the optimization literature, but can also be used in a sampling context to reduce anisotropy of posterior distributions; it is often the case that a good preconditioner is not known in advance but is instead learned adaptively (Duchi et al., 2011).

Figure 3: Effect of diagonal preconditioning on SBPS performance. Sampling is from the logistic regression posterior as described in Section 7.1 , with . The preconditioner parameters are . Left: Contour plots of posterior log likelihood under the Laplace approximation. Center, right: ACF and trajectories in the direction of greatest covariance.

We use a diagonal preconditioner for simplicity. Denoting the th component at the th evaluation of the gradient by , we define

(17)
(18)

for some . The preconditioner at iteration is defined as . This is the same preconditioner used in  (Li et al., 2016), up to the factor; the latter is needed here in order to prevent scaling of .

As noted in (Li et al., 2016), a time dependent preconditioner requires adding a term proportional to to the gradient, yet this term is negligibly small and can be ignored when , since in this parameter regime the preconditioner changes slowly as a function of and thus of .

We call this preconditioned variant pSBPS. It performs favorably compared to SBPS when the posterior is anisotropic and axis-aligned, since we use a diagonal approximation of the Hessian in the preconditioner. See (Bierkens et al., 2017) for a related approach. As Figure 3 shows, pSBPS converges to the posterior mode faster than SBPS, and mixes faster in the direction of greatest covariance.111pSBPS code at https://github.com/dargilboa/SBPS-public.

6 Related Works

[width=1]lreg_d=20.pdf

Figure 4: Logistic regression posterior sampling, with (best seen in color). Top Left: Negative Log Likelihood (NLL) per data point of samples from SBPS compared with SGLD (step sizes ), mSGNHT (step size ), lipSBPS and SS-ZZ (see text for definitions and references), all initialized at the same random positions. Also shown are the normalized NLL of the MAP estimator and the mean std. dev. of the Laplace approximation NLL (distributed as ). The continuous samplers (SBPS, SS-ZZ, lipSBPS) were run to match the data cost of the discrete (SGLD, mSGNHT), and for their ACFs we discretized the continuous paths uniformly to obtain the same number of samples. Note that SBPS is the fastest to converge. Center/Right:

Trajectories and ACFs in the directions of largest and smallest eigenvalues of the Laplace approximation inverse Hessian. The ACFs were calculated after burn-in, while the trajectory plots only show initial convergence.

Inset:

CPU runtime for 100 epochs, showing a

advantage of SBPS over SS-ZZ and lipSBPS

Biased Samplers: Many stochastic gradient samplers (e.g. (Welling & Teh, 2011)) can be formulated exactly using a Wiener process (Ma et al., 2015), but they are biased because (i) the Gaussian assumption in the noise may not hold for small mini-batches, and (ii) the MH correction to the time discretization is avoided or approximated. Recently, irreversible samplers have been studied in this context (Ma et al., 2016). Choosing the step size in these samplers can be quite challenging, as discussed below: too-large step sizes increase the bias, while too-small step sizes slow the mixing, and in generic high-dimensional examples there is no way to automatically tune the step size (though see (Giles et al., 2016) for recent progress). In contrast, the bias in SBPS, controlled by the constant , does not come from time discretization, but from easy-to-track violations of the thinning bound when .

Exact non-BPS-like Samplers: Firefly MCMC (Maclaurin & Adams, 2014)

augments the target distribution with one binary variable per data point, and yields unbiased samples while only querying a subset of data points at each iteration. But it needs distribution-dependent lower bounds on the likelihood and requires an initial full sweep of the data. Also mixing can be extremely slow 

(Quiroz et al., 2015; Bardenet et al., 2015), and all the dataset must be available for access all the time.

Two recent novel proposals are (Quiroz et al., 2016), based on debiased pseudolikelihood combined with variance reduction techniques, and (Pollock et al., 2016), based on quasi-stationary distributions. These methods are relatively more complex, and we have not yet systematically compared them against SBPS.

Exact BPS-like Samplers: Two subsampling variants of BPS which preserve the exact distribution are Local BPS (Bouchard-Côté et al., 2015), that needs a pre-processing step of computational cost , and ZZ-SS (Bierkens et al., 2016). In these approaches, the requirement to preserve the distribution exactly leads to extremely conservative thinning bounds, which in turn yield a very slow exploration of the space, as we will see below. Also, the bounds need to be rederived for each new model (if possible at all), unlike SBPS which can be used for any differentiable posterior distribution.

7 Experiments

7.1 Logistic Regression

Although simpler MCMC methods perform well in Bayesian logistic regression (BLR) models (Chopin & Ridgway, 2015), we begin with this well-understood case for comparing SBPS against a few of the existing stochastic MCMC methods discussed in the previous section. To generate the data, we sampled the components of the true from and data points from a -dimensional zero-mean Gaussian, with one component of the diagonal covariance set to 6 and all the rest to 1. Labels are drawn from , where . In the regime the Laplace approximation holds fairly well, providing another good comparison method. Figure 4 shows results for .

We run comparisons against the biased stochastic samplers Stochastic Gradient Langevin Dynamics (SGLD) (Welling & Teh, 2011) and multivariate Stochastic Gradient Nose-Hoover Thermostat (mSGNHT) (Li et al., 2015) with fixed step sizes. As noted above, choosing optimal step sizes for these samplers is challenging. To allow SGLD and mSGNHT to perform best, we performed a scan to find the largest (fastest-mixing) step size that did not lead to overly large bias compared to the Laplace approximation. (Note, importantly, that this scan is expensive and is not possible in high-dimensional examples where the Laplace approximation does not hold - precisely the cases where MCMC methods are most valuable.) See Supp. Material E for details of this scan, which led to an optimal step size of for SGLD. Larger step sizes led to visible biases in the samples (not shown); we also show the results with step size for comparison to note that the results do depend sensitively on this parameter.

We also compare against ZZ-SS. Instead of Local BPS, we ran comparisons against an unbiased method we call lipSBPS (short for Lipshitz BPS), where the velocity bounces occur as first arrival events in a Poisson process with noisy intensity built from a noisy gradient (7) of minimal size , and simulated with thinning using an exact upper bound derived in Supp. Material F. One can verify that the resulting stochastic process is identical to that of Local BPS. Our bound is higher than that used in (Bouchard-Côté et al., 2015) by up to a factor of 2, which results in up to twice as many bounce proposals. On the other hand, our bound can be computed in time, does not require non-negative covariates, and can be used also for . Again, we note that this lipSBPS method, like Local BPS and ZZ-SS, are not generally applicable because the derived bounds only apply in special cases.

The results of Figure 4 show that SBPS outperforms the optimally tuned SGLD and mSGNHT, and converges orders of magnitude faster than lipSBPS and ZZ-SS. While the latter two methods are unbiased, our results suggest that the small bias introduced by SBPS is worth the massive reduction in variance.

In Supp. Material F we explore the effects of the hyperparameters: , , and refresh rate . The conclusion is that in this logistic example no manual hyperparameter tuning was required (in stark contrast to the careful step size tuning required for SGLD): the bias-controlling constant can be set in the range (consistent with the tails of the Gaussian in the linear regression model) and the mini-batch size should be small, but large enough for the CLT to justify the noise term in (9); worked well, but the results were not sensitively dependent on . For small values of the mini-batch variability provided sufficient velocity randomness that no additional velocity refreshes were necessary, so we did not have to tune either.

The comparison to pSBPS shows an improvement in the rate of convergence to the posterior mode. The MAP estimator was calculated using SAG (Roux et al., 2012), and the Hessian was computed exactly.

7.2 Continuous Trajectory Sampling

Figure 5: Estimated mean of , under continuous and discrete samples, with different ratios where is the average linear trajectory length. The posterior distribution and settings are as in Figure 4. Assuming the Laplace approximation holds, the expectation of is . Left: For there is little difference between continuous or discrete samples. Center: For the continuous mean converges faster than the discrete. Right: Expectation of the absolute value of the test function averaged over 5 runs of 1000 epochs, as a function of . The advantage of the continuous expectation when this ratio is is evident.

Figure 6: Neural network posterior sampling for a single hidden layer network trained on MNIST. . For SBPS . The posterior is compared to an expensive Metropolis-Hastings run. SBPS shows comparable mixing to an appropriately chosen SGLD without the need for a scan over step sizes. As can be seen, a poor choice of SGLD step size can lead to slow mixing or bias in the narrow directions of the target

A unique feature of BPS-like samplers is that their output is a continuous trajectory. Given and a set of velocities and bounce times , the estimated expectation of a test function is

(19)

where and is the total particle travel time. For simple test functions this integral is analytic, while more generally it can be computed numerically with standard efficient one-dimensional quadrature methods. When varies across a characteristic length shorter than the average trajectory length of the linear segments, we intuitively expect the error in the estimate (19) to be smaller than in estimators based on discrete samples. Note that this advantage tends to diminish for higher SBPS noise, since the linear segments become shorter.

Figure 5 explores empirically this idea in a simple setting by comparing the value of the expectation of under the posterior distribution of the logistic example considered above. Here are the first coordinates of the vectors , is the MAP value, and the characteristic length of . As expected, the error in the expectation is lower in the continuous case for .

7.3 Neural Network Posterior Sampling

We considered a simple model of one hidden layer followed by a softmax. For Bayesian approaches to neural networks see (Neal, 2012; Gal, 2016). The likelihood was the standard cross entropy with an additional regularization term where

is the probability of classifying the

th example correctly. was approximated via subsampling, and . This architecture was trained on the MNIST dataset. A subset of the training set was preprocessed by downsampling the images to , removing pixels that are 0 for all training examples and decreasing the number of digits to 4. The resulting training set size was . The resulting dimensionality of the posterior was . Mini-batch size was for all methods. All weights were initialized at 0 and all methods were run for epochs. SBPS is compared with SGLD at different step sizes, and performance is comparable to SGLD with an appropriate step size without requiring an expensive scan over step sizes. Since the additional regularization term can lead to unbounded gradients of the log posterior one can no longer use the bounds derived for the Local BPS and ZZ-SS algorithms and thus they cannot be applied to this problem without further work. This is not the case for SBPS. The posterior is not Gaussian due to the likelihood terms and thus the Laplace approximation is not effective unless the posterior is dominated by the prior.

In order to assess the quality of the sampling, we compare the trajectories to a standard costly Metropolis-Hastings MCMC using a Gaussian with variance as the proposal distribution. This algorithm was run for epochs and the proposal acceptance rate was 0.43. Figure 6 shows samples in the directions of the largest, median and smallest variance of the empirical covariance matrix of the Metropolis-Hastings samples.

8 Conclusions

This paper introduced a non-reversible sampler that can be applied to big datasets by means of subsampling the data in each iteration. At the price of a small, controllable bias, it provides the benefits of (i) high mixing speed associated with non-reversibility, and (ii) continuous sample trajectories, with (iii) minimal hyperparameter tuning required, leading to state of the art performance and making it a convenient alternative to biased, difficult-to-tune MH-based stochastic samplers.

Stochastic Bouncy Particle Sampler

Supplementary Material


Appendix A Proof Sketch of Invariance under Noisy Gradients

In this section we start with a simple reformulation of the proof in (Bouchard-Côté et al., 2015) that the BPS Markov process leaves invariant the distribution where

(A.1)
(A.2)

where is the -dimensional one-sphere. This will set the stage for the noisy case considered next. For a more formal and detailed treatment of the BPS algorithm, including ergodicity, see (Bouchard-Côté et al., 2015). For simplicity, we do not include here the velocity refreshments, which do not change the proof.

The proof sketches below are presented using a discrete-time approach followed by letting . We have found this approach more accessible for a machine learning audience. After submitting a preliminary version of this work to the arXiv, the preprint (Fearnhead et al., 2016) was submitted to the arXiv, which presents similar proofs of invariance by first deriving a general Fokker-Planck equation and then showing that the equation is satisfied both in noiseless and noisy cases.

a.1 Exact Gradient

To understand why the algorithm is correct, consider first the transition rule

(A.3)

where

(A.4)
(A.5)

and

(A.6)

This rule acts on the probability density as,

(A.7)

The two terms in (A.7) correspond to the two ways to reach at time . First, we can start at at time and move a distance without bouncing. This occurs with probability , so we have

(A.8)
(A.9)
(A.10)

where in (A.9) we did a Taylor expansion and in (A.10) we used (A.1).

The second term in (A.7) corresponds to being at at time , moving and bouncing. This occurs with probability , so we have

(A.11)
(A.12)

where again we did a Taylor expansion in (A.11). Adding (A.10) and (A.12), and using

(A.13)

equation (A.7) becomes

(A.14)

which implies that the distribution is stationary, .

a.2 Noisy Gradient

Consider now a noisy gradient represented as

(A.15)

where we assume that has zero mean.

First note that the requirement that and are conditionally independent given and , with , is needed to preserve under the noise the Markov property of the sampler, which requires the bounce point process intensity to depend only on , and not the past history of the trajectory.

Next we decompose the random vector into two orthogonal components,

(A.16)

with , and . This induces a corresponding decomposition in the probability density as

(A.17)

and note that from the assumption that has zero mean it follows that has zero mean. The noisy projected gradient becomes

(A.18)

To study the invariance of under the noisy BPS, let us consider again the decomposition (A.7) into straight and bounced infinitesimal trajectories. The probability that the particle is at at time and moves a distance without bouncing is the average of over all the possible realizations of , and is therefore given by

(A.19)
(A.20)

where the above expression defines . The first term of (A.7) is therefore

(A.22)

similarly to (A.8)-(A.10).

The second term in (A.7) now has contributions from all those values at time , such that a reflection of with respect to a noisy gives . Such a exists for every value of the noise vector , and is given by

(A.23)

Therefore the second term in (A.7) contains contributions from all the possible realizations of and is

(A.25)

where we used , the measure decomposition (A.17), and defined

(A.26)

Adding now (A.22) and (A.25), using (since is uniform) and

(A.27)

which follows from (A.20) and (A.26), and the fact that has zero mean, we get again the stationarity condition

(A.28)

Appendix B Biased Approximation

b.1 Biased bouncing rate

In the noiseless case, the velocity bounce is an event in a Poisson process with intensity while in the noisy case, the average Poisson intensity is where

(B.29)

When a thinning upper bound for is unknown and the distribution of is Gaussian with predicted variance , our algorithm makes a bounce proposal from a Poisson process with intensity

(B.30)

where is our estimate of . At the proposed bounce point , we evaluate , and accept with probability . The evaluation of also provides an estimate of the variance of . Assuming is Gaussian, the probability of the bound violation event , is

(B.31)

where is the standard normal CDF. For a given , the intensity is therefore,

(B.32)

where is the indicator function. Averaging over we get

(B.33)
(B.34)

If the probability of bound violation has a universal upper bound , we assume

(B.35)

where is a constant.

b.2 Preliminaries

We are interested bounding the distance between the equilibrium distribution of the biased, noisy BPS process with mean intensity , and the exact, noisy process with mean intensity . We start with some preliminary results.

Wasserstein Distance and Kantorovich Duality

We will consider the Wasserstein distance, defined as

(B.36)

where is the set of 1-Lipshitz continuous functions,

(B.37)

Given random variables

, a coupling is a joint distribution with marginals and . The Kantorovich duality (Villani, 2008) asserts that

(B.38)

Generators

To simplify the notation, let us define , . The infinitesimal generator of a stochastic process is defined as

(B.39)

and note that it satisfies

(B.40)
(B.41)
(B.42)

where the expectation is with respect to the distribution invariant under the stochastic process, and we used . In our case, the generator of a BPS process with intensity is (Davis, 1984; Fearnhead et al., 2016)

(B.43)

and similarly for .

Let us define

(B.44)

where the expectation is with respect to the distribution of the stochastic process with intensity at time and with a given initial condition. This expression satisfies the backward Kolmogorov equation

(B.45)

and also (Jacod & Shiryaev, 1987)

(B.46)

where the expectation is with respect to the distribution invariant under the stochastic process with intensity .

Ergodicity

We assume that the random process defined by SBPS is polynomial ergodic (although see the recent (Deligiannidis et al., 2017)). In particular, we assume that two distributions started at reflected velocities converge as

(B.47)

where are constants.222This assumed property follows usually from the existence of small sets (Lemma 3 in (Bouchard-Côté et al., 2015)) along with an appropriate Lyapunov function (Roberts et al., 2004).

Poisson Equation

Given a function , we will consider below the Poisson equation

(B.48)

We assume the existence of the solution

(B.49)

where was defined in (B.44). The fact that this expression solves (B.48) can be easily verified using (B.45), (B.46) and . For (see (B.37) ), this solution satisfies