# Stochastic Bouncy Particle Sampler

We introduce a novel stochastic version of the non-reversible, rejection-free Bouncy Particle Sampler (BPS), a Markov process whose sample trajectories are piecewise linear. The algorithm is based on simulating first arrival times in a doubly stochastic Poisson process using the thinning method, and allows efficient sampling of Bayesian posteriors in big datasets. We prove that in the BPS no bias is introduced by noisy evaluations of the log-likelihood gradient. On the other hand, we argue that efficiency considerations favor a small, controllable bias in the construction of the thinning proposals, in exchange for faster mixing. We introduce a simple regression-based proposal intensity for the thinning method that controls this trade-off. We illustrate the algorithm in several examples in which it outperforms both unbiased, but slowly mixing stochastic versions of BPS, as well as biased stochastic gradient-based samplers.

## Authors

• 8 publications
• 9 publications
• 13 publications
• 16 publications
• ### Randomized Hamiltonian Monte Carlo as Scaling Limit of the Bouncy Particle Sampler and Dimension-Free Convergence Rates

The Bouncy Particle Sampler is a Markov chain Monte Carlo method based o...
08/13/2018 ∙ by George Deligiannidis, et al. ∙ 0

• ### Unbiased Estimation of the Gradient of the Log-Likelihood for a Class of Continuous-Time State-Space Models

In this paper, we consider static parameter estimation for a class of co...
05/24/2021 ∙ by Marco Ballesio, et al. ∙ 0

• ### Binary Bouncy Particle Sampler

The Bouncy Particle Sampler is a novel rejection-free non-reversible sam...
11/02/2017 ∙ by Ari Pakman, et al. ∙ 0

• ### Non-reversible, tuning- and rejection-free Markov chain Monte Carlo via iterated random functions

In this work we present a non-reversible, tuning- and rejection-free Mar...
11/20/2017 ∙ by Amir Sepehri, et al. ∙ 0

• ### Stochastic Bias-Reduced Gradient Methods

We develop a new primitive for stochastic optimization: a low-bias, low-...
06/17/2021 ∙ by Hilal Asi, et al. ∙ 0

• ### Bayesian Posterior Sampling via Stochastic Gradient Fisher Scoring

In this paper we address the following question: Can we approximately sa...
06/27/2012 ∙ by Sungjin Ahn, et al. ∙ 0

• ### Scalable Natural Gradient Langevin Dynamics in Practice

Stochastic Gradient Langevin Dynamics (SGLD) is a sampling scheme for Ba...
06/07/2018 ∙ by Henri Palacci, et al. ∙ 0

## Code Repositories

### SBPS-public

None

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The advent of the Big Data era presents special challenges to practitioners of Bayesian modeling because typical sampling-based inference methods have a computational cost per sample linear in the size of the dataset. This computational burden has been addressed in recent years through two major approaches (see (Bardenet et al., 2015)

for a recent overview): (i) split the data into batches and combine posterior samples obtained in parallel from each batch, or (ii) use variants of the Markov Chain Monte Carlo (MCMC) algorithm that only query a subset of the data at every iteration. Our interest in the paper is in the latter approach, where many methods are based on modifying both steps of the Metropolis-Hastings (MH) algorithm: in the proposal step, only a mini-batch of the data is used, and the accept-reject step is either ignored or approximated

(Korattikara et al., 2013; Bardenet et al., 2014). This strategy has been explored using proposals from Langevin (Welling & Teh, 2011), Riemannian Langevin (Patterson & Teh, 2013), Hamiltonian (Chen et al., 2014) and Riemannian Hamiltonian (Ma et al., 2015) dynamics. Other relevant works include (Ahn et al., 2012; Ding et al., 2014).

Despite the success of the above approach, the partial accept-reject step is a source of bias, the precise size of which is difficult to control, and which tends to be amplified by the noisy evaluation of the gradient. This has motivated the search for unbiased stochastic samplers, such as the Firefly MCMC algorithm (Maclaurin & Adams, 2014), the debiased pseudolikelihood approach of (Quiroz et al., 2016), and the quasi-stationary distribution approach of (Pollock et al., 2016).

The present work is motivated by the idea that the bias could be reduced by starting from a rejection-free MCMC algorithm, avoiding thus the Metropolis-Hastings algorithm altogether. Two similar algorithms of this type have been recently proposed: the Bouncy Particle Sampler (BPS) (Peters & de With, 2012; Bouchard-Côté et al., 2015), and Zig-Zag Monte Carlo (Bierkens & Roberts, 2015; Bierkens et al., 2016). These algorithms sample from the target distribution through non-reversible, piecewise linear Markov processes. Non-reversibility (i.e., the failure to satisfy detailed balance) has been shown in many cases to yield faster mixing rates (Neal, 2004; Vucelja, 2014; Bouchard-Côté et al., 2015).

Our contributions in this paper are twofold. Firstly, we show that the BPS algorithm is particularly well suited to sample from posterior distributions of big datasets, because the target distribution is invariant under zero-mean noisy perturbations of the log-likelihood gradient, such as those introduced by using mini-batches of the full dataset in each iteration.

Stochastic variants of BPS or Zig-Zag that preserve exactly the target distribution have been proposed, such as Local BPS (Bouchard-Côté et al., 2015) or Zig-Zag with subsampling (ZZ-SS) (Bierkens et al., 2016)

, but they lead to extremely slow mixing because are based on overly conservative bounds (which moreover must be derived on a case-by-case basis, and in many cases may not hold at all). This leads us to our second contribution, the Stochastic Bouncy Particle Sampler (SBPS), a stochastic version of the BPS algorithm which trades a small amount of bias for significantly reduced variance, yielding superior performance (and requiring no parameter tuning or derivation of problem-specific bounds) compared to existing subsampling-based Monte Carlo methods. SBPS inherits the piecewise linear sample paths of BPS, and therefore enjoys faster convergence of empirical means, particularly of rapidly varying test functions, compared to more standard approaches.

We organize this paper as follows. In Section 2 we review the Bouncy Particle Sampler, in Section 3 we study the invariance of the target distribution under noise perturbations to the BPS updates, in Section 4 we introduce SBPS, and in Section 5 a preconditioned variant. In Section 6 we discuss related works and in Section 7 we illustrate the advantages of SBPS in several examples.

## 2 The Bouncy Particle Sampler

Consider a distribution , where the normalization factor may be intractable. The Bouncy Particle Sampler (BPS), proposed in (Peters & de With, 2012; Monmarché, 2014) and formalized and developed in (Bouchard-Côté et al., 2015)

, introduces a random velocity vector

distributed uniformly in the unit sphere , and defines a continuous Markov process in . To describe this process we begin in discrete time and then take the continuous-time limit. Denoting time by , consider a discrete-time Markov process that acts on the variables as

 (w,v)t+Δt={(w+vΔt,v)w/prob. 1−Δt[G]+(w+vΔt,vr)w/prob. Δt[G]+ (1)

where

 [x]+=max(x,0), (2) G=v⋅∇U(w), (3) vr=v−2(v⋅∇U(w))∇U(w)||∇U(w)||2. (4)

Note that in (3) is the directional derivative of in the direction , and is a reflection of with respect to the plane perpendicular to the gradient , satisfying and . In other words, the particle moves along a straight line in the direction of and this direction is reflected as (4

) with probability

. This probability is non-zero only if the particle is moving in a direction of lower target probability , or equivalently higher potential .

Applying the transition (1) repeatedly and taking , the random reflection point becomes an event in an inhomogeneous Poisson process with intensity . The resulting sampling procedure generates a piecewise linear Markov process (Davis, 1984; Dufour et al., 2015), and is summarized in Algorithm 1. Note that the algorithm also includes occasional resamplings of , to ensure ergodicity (Bouchard-Côté et al., 2015). Remarkably, in the limit , the algorithm leaves the joint factorized distribution invariant, as we review in Supp. Material A.1.

The Zig-Zag process (Bierkens & Roberts, 2015; Bierkens et al., 2016) is similar to BPS, but velocity components can take only values, and the piecewise linear trajectories change direction only in a single coordinate at each random breakpoint. For a review of these methods, see (Fearnhead et al., 2016; Bierkens et al., 2017).

## 3 Noise Resilience and Big Data

### 3.1 Noise Resilience

Let us assume that only a noisy version of the gradient is available to compute the probability of bouncing and the reflected velocity in (4). In the Big Data scenario described below, this is the result of using a random subset of the data at each gradient evaluation, and can be represented as

 ∇~U(w)=∇U(w)+nw,nw∼p(nw|w), (5)

where and has zero mean.

Theorem 1: The invariance of under the BPS algorithm is unaffected by the zero-mean noise (5) if and are independent for .

See Supp. Material A.2 for a proof sketch. Defining , the intensity of the inhomogeneous Poisson process , which determines the time of the velocity bounce, now becomes stochastic, and the resulting point process is called a doubly stochastic, or Cox, process (Cox, 1955; Grandell, 1976). The effect of the gradient noise is to increase the average point process intensity, since , from Jensen’s inequality. This leads to more frequent bounces and typically a slower mixing of the Markov process, as illustrated in Figure 1.

Many Cox processes are based on Poisson intensities obeying stochastic differential equations, or assume that the joint distribution at several

’s has a non-trivial -dependent structure. Our case is different because we assume that and are independent even when and are infinitesimally close.

### 3.2 Sampling from Big Data posteriors

In a prototypical Bayesian setting, we have a prior , i.i.d. data points , and the negative log-posterior gradient is

 ∇U(w)=−∇[logf(w)+N∑i=1logp(xi|w)]. (6)

When is big we consider replacing the above gradient by the noisy approximation

 ∇~U(w)=−∇[logf(w)+Nnn∑i=1logp(xri|w)], (7)

where and the indices are sampled randomly without replacement. To sample from the posterior using the noisy gradient (7), we want to simulate the first arrival time in a doubly stochastic Poisson process with random intensity , where

 ~G(t)=v⋅∇~U(w+vt). (8)

Note that is a stochastic process, and noise independence for different ’s implies that different ’s require independent mini-batches. Out of several methods to sample from (noisy) Poisson processes, the thinning method (Lewis & Shedler, 1979) is compatible with the noise independence assumption. This is a form of rejection sampling which proposes a first arrival time , sampled from an inhomogeneous Poisson process with intensity such that The particle moves a distance , and accepts the proposal to bounce the velocity with probability . Note that this accept-reject step is different from the MH algorithm (Robert & Casella, 2013), since the particle always moves the distance , and a rejection only affects the velocity bouncing. This can greatly improve the efficiency of the sampler. As in the noiseless case, one should in general also resample occasionally, to ensure ergodicity (Bouchard-Côté et al., 2015), although in the examples we considered this was not empirically necessary, since the mini-batch noise serves to randomize the velocity sufficiently, preventing “non-ergodic” trajectories that do not explore the full space.

In some special cases one can derive a bound that always holds (Bouchard-Côté et al., 2015; Bierkens et al., 2017). But this is atypical, due to the dependence of in (8) on the changing velocity and the mini-batch noise. Even when such bounds do exist, they tend to be conservatively high, leading to an inefficient sampler with many rejected proposals (wasting many mini-batches of data) before accepting.

Instead, we propose below an adaptive approximate bound which achieves a bias-variance trade-off between the frequency of the bounce proposals and a controllable probability of bound violation.

## 4 Proposal from Local Regression

Our approach to an adaptive and tractable proposal intensity relies on a predictive model of  based on previous observations; the key idea is to exploit the correlations between nearby values. The upper value of the resulting predictive confidence band can then be used as , and this band is adaptively updated as more proposals are generated.

While there are many possibilities for such a predictive model, we found that a simple local linear model was very effective and computationally trivial. Consider then the linear regression of

observed values since the previous bounce,

 ~Gi=β1ti+β0+εtiεti∼N(0,c2ti), (9)

where

and the noise variance can be estimated from the mini-batch in (

7) as

 c2t=N2n(1−nN)Vari[v⋅∇logp(xri|w)]. (10)

Here denotes the sample variance of the mini-batch, and we included the finite population correction factor because the indices are sampled without replacement. The Gaussian noise assumption in in (9

) is valid when the mini-batch is sufficiently large that we can appeal to a central limit theorem. (For heavy-tailed noise we could consider more robust estimators, but we do not pursue this direction here.)

Adding a Gaussian prior to , and defining , the log posterior of is

 2logp(β|{ti,~Gi,c2ti}) = −m∑i=1(~Gi−xi⋅β)2c2ti − (β1−μ)2σ2+const.

Let and be the mean and covariance of this distribution. Using these estimates, we obtain the predictive distribution for for ,

 ^G(t)=^β1t+^β0+ηtηt∼N(0,ρ2(t)) (11)
 whereρ2(t)=xΣxT+c2tm (12)

with . Note that as usual the noise variance is different in (9) and (11), since in (9) we are fitting observed pairs , while in (11) we are predicting the value of and we include the uncertainty from the estimates. Also, for simplicity we extrapolate the observation noise to be the same as in the last mini-batch, .

We can now construct a tractable approximate thinning proposal intensity by choosing a confidence band multiple , and defining

as a linear interpolation between selected points along the non-linear curve

 ^β1t+^β0+kρ(t). (13)

The proposal intensity is now , and sampling from an inhomogeneous Poisson process with piecewise linear rate can be done analytically using the inverse CDF method. When a bounce time is proposed at time , the particle moves a distance , a noisy observation is made as in (8) and the bounce time is accepted with probability . If the bounce is accepted, the velocity is reflected as in (4) (using instead of ), and the set of observed values is reinitialized with , which are the values one would obtain from sampling the same mini-batch after the bounce, since . On the other hand, if the proposal is rejected, the observed

are added to the set of observed values. The hyperparameters

of the regression model can be learned by performing, after each bounce, a gradient ascent step on the marginal likelihood, ; this gradient can be computed analytically and does not significantly impact the computational cost.

The linear model for is good when the target distribution can be locally approximated by a Gaussian, since in (8) is a projection of the derivative of the negative log posterior. When the posterior is highly non-Gaussian, a decaying weight can be used for more-distant observations, leading to a local regression; the scale of this decay can be fit again via stochastic gradient ascent on the predictive likelihood. We have also explored a Gaussian Process regression model, but it did not improve over the linear model in the cases we considered. In Supp. Material E we discuss a potential problem with our approach in the case of multimodal distributions, and propose a solution for such cases.

Finally, note that the directional derivative of needed in (8) can in many cases be computed at a cheaper cost (by a factor of ) than the full gradient. The latter is only needed when a bounce is accepted. This is in contrast to other gradient based samplers which require the full gradient at every step.

We dub this approach to BPS with noisy gradients Stochastic BPS (SBPS). See Supp. Material C for pseudocode. Figure 2 illustrates the evolution of these dynamic proposal intensities in a simple example. In Section 5, we consider a variant to SBPS, called pSBPS, that learns a diagonal preconditioning factor for the gradient, and leads to a more efficient exploration of the space when the posterior is highly anisotropic and roughly axis-aligned.

### 4.1 Bias in the Samples

The constant in (13) controls the tradeoff between bias from possible cases and lower computational cost: higher leads to a more conservative (higher) proposal intensity and therefore a less-biased but more data-inefficient sampler. We present a bound on the Wasserstein distance between the exact and bias distributions in Supp. Material B, and explore this bias-variance tradeoff further in Supp. Material F. A quick bias diagnostic is the rate at which the bound is violated, i.e., cases with ; if this rate is significantly higher than expected under the local linear regression model, then a different approach should be considered.

## 5 Preconditioned SBPS

Consider now the linear transformation

with an arbitrary square matrix . A distribution of interest can be expressed in terms of as

 pz(z)dz = p(w(z))dw=p(Az)|A|dz, (14) = exp(−Uz(z))dz. (15)

The SBPS algorithm can be applied to the density using the gradients of . For this note that . The Poisson intensity to compute bounces is , with , and the velocity reflection is computed as

 vr=v−2(v⋅A∇U(w))A∇U(w)||A∇U(w)||2. (16)

The piecewise linear trajectory becomes . The matrix is called a preconditioner in the optimization literature, but can also be used in a sampling context to reduce anisotropy of posterior distributions; it is often the case that a good preconditioner is not known in advance but is instead learned adaptively (Duchi et al., 2011).

We use a diagonal preconditioner for simplicity. Denoting the th component at the th evaluation of the gradient by , we define

 aji = β(gji)2+(1−β)aj−1i, (17) ~aj = 1d∑di=11√aji+ϵ, (18)

for some . The preconditioner at iteration is defined as . This is the same preconditioner used in  (Li et al., 2016), up to the factor; the latter is needed here in order to prevent scaling of .

As noted in (Li et al., 2016), a time dependent preconditioner requires adding a term proportional to to the gradient, yet this term is negligibly small and can be ignored when , since in this parameter regime the preconditioner changes slowly as a function of and thus of .

We call this preconditioned variant pSBPS. It performs favorably compared to SBPS when the posterior is anisotropic and axis-aligned, since we use a diagonal approximation of the Hessian in the preconditioner. See (Bierkens et al., 2017) for a related approach. As Figure 3 shows, pSBPS converges to the posterior mode faster than SBPS, and mixes faster in the direction of greatest covariance.111pSBPS code at https://github.com/dargilboa/SBPS-public.

## 6 Related Works

Biased Samplers: Many stochastic gradient samplers (e.g. (Welling & Teh, 2011)) can be formulated exactly using a Wiener process (Ma et al., 2015), but they are biased because (i) the Gaussian assumption in the noise may not hold for small mini-batches, and (ii) the MH correction to the time discretization is avoided or approximated. Recently, irreversible samplers have been studied in this context (Ma et al., 2016). Choosing the step size in these samplers can be quite challenging, as discussed below: too-large step sizes increase the bias, while too-small step sizes slow the mixing, and in generic high-dimensional examples there is no way to automatically tune the step size (though see (Giles et al., 2016) for recent progress). In contrast, the bias in SBPS, controlled by the constant , does not come from time discretization, but from easy-to-track violations of the thinning bound when .

Exact non-BPS-like Samplers: Firefly MCMC (Maclaurin & Adams, 2014)

augments the target distribution with one binary variable per data point, and yields unbiased samples while only querying a subset of data points at each iteration. But it needs distribution-dependent lower bounds on the likelihood and requires an initial full sweep of the data. Also mixing can be extremely slow

(Quiroz et al., 2015; Bardenet et al., 2015), and all the dataset must be available for access all the time.

Two recent novel proposals are (Quiroz et al., 2016), based on debiased pseudolikelihood combined with variance reduction techniques, and (Pollock et al., 2016), based on quasi-stationary distributions. These methods are relatively more complex, and we have not yet systematically compared them against SBPS.

Exact BPS-like Samplers: Two subsampling variants of BPS which preserve the exact distribution are Local BPS (Bouchard-Côté et al., 2015), that needs a pre-processing step of computational cost , and ZZ-SS (Bierkens et al., 2016). In these approaches, the requirement to preserve the distribution exactly leads to extremely conservative thinning bounds, which in turn yield a very slow exploration of the space, as we will see below. Also, the bounds need to be rederived for each new model (if possible at all), unlike SBPS which can be used for any differentiable posterior distribution.

## 7 Experiments

### 7.1 Logistic Regression

Although simpler MCMC methods perform well in Bayesian logistic regression (BLR) models (Chopin & Ridgway, 2015), we begin with this well-understood case for comparing SBPS against a few of the existing stochastic MCMC methods discussed in the previous section. To generate the data, we sampled the components of the true from and data points from a -dimensional zero-mean Gaussian, with one component of the diagonal covariance set to 6 and all the rest to 1. Labels are drawn from , where . In the regime the Laplace approximation holds fairly well, providing another good comparison method. Figure 4 shows results for .

We run comparisons against the biased stochastic samplers Stochastic Gradient Langevin Dynamics (SGLD) (Welling & Teh, 2011) and multivariate Stochastic Gradient Nose-Hoover Thermostat (mSGNHT) (Li et al., 2015) with fixed step sizes. As noted above, choosing optimal step sizes for these samplers is challenging. To allow SGLD and mSGNHT to perform best, we performed a scan to find the largest (fastest-mixing) step size that did not lead to overly large bias compared to the Laplace approximation. (Note, importantly, that this scan is expensive and is not possible in high-dimensional examples where the Laplace approximation does not hold - precisely the cases where MCMC methods are most valuable.) See Supp. Material E for details of this scan, which led to an optimal step size of for SGLD. Larger step sizes led to visible biases in the samples (not shown); we also show the results with step size for comparison to note that the results do depend sensitively on this parameter.

We also compare against ZZ-SS. Instead of Local BPS, we ran comparisons against an unbiased method we call lipSBPS (short for Lipshitz BPS), where the velocity bounces occur as first arrival events in a Poisson process with noisy intensity built from a noisy gradient (7) of minimal size , and simulated with thinning using an exact upper bound derived in Supp. Material F. One can verify that the resulting stochastic process is identical to that of Local BPS. Our bound is higher than that used in (Bouchard-Côté et al., 2015) by up to a factor of 2, which results in up to twice as many bounce proposals. On the other hand, our bound can be computed in time, does not require non-negative covariates, and can be used also for . Again, we note that this lipSBPS method, like Local BPS and ZZ-SS, are not generally applicable because the derived bounds only apply in special cases.

The results of Figure 4 show that SBPS outperforms the optimally tuned SGLD and mSGNHT, and converges orders of magnitude faster than lipSBPS and ZZ-SS. While the latter two methods are unbiased, our results suggest that the small bias introduced by SBPS is worth the massive reduction in variance.

In Supp. Material F we explore the effects of the hyperparameters: , , and refresh rate . The conclusion is that in this logistic example no manual hyperparameter tuning was required (in stark contrast to the careful step size tuning required for SGLD): the bias-controlling constant can be set in the range (consistent with the tails of the Gaussian in the linear regression model) and the mini-batch size should be small, but large enough for the CLT to justify the noise term in (9); worked well, but the results were not sensitively dependent on . For small values of the mini-batch variability provided sufficient velocity randomness that no additional velocity refreshes were necessary, so we did not have to tune either.

The comparison to pSBPS shows an improvement in the rate of convergence to the posterior mode. The MAP estimator was calculated using SAG (Roux et al., 2012), and the Hessian was computed exactly.

### 7.2 Continuous Trajectory Sampling

A unique feature of BPS-like samplers is that their output is a continuous trajectory. Given and a set of velocities and bounce times , the estimated expectation of a test function is

 ⟨f(w)⟩BPS≡1TR−1∑i=0ti∫0f(wi+vit)dt (19)

where and is the total particle travel time. For simple test functions this integral is analytic, while more generally it can be computed numerically with standard efficient one-dimensional quadrature methods. When varies across a characteristic length shorter than the average trajectory length of the linear segments, we intuitively expect the error in the estimate (19) to be smaller than in estimators based on discrete samples. Note that this advantage tends to diminish for higher SBPS noise, since the linear segments become shorter.

Figure 5 explores empirically this idea in a simple setting by comparing the value of the expectation of under the posterior distribution of the logistic example considered above. Here are the first coordinates of the vectors , is the MAP value, and the characteristic length of . As expected, the error in the expectation is lower in the continuous case for .

### 7.3 Neural Network Posterior Sampling

We considered a simple model of one hidden layer followed by a softmax. For Bayesian approaches to neural networks see (Neal, 2012; Gal, 2016). The likelihood was the standard cross entropy with an additional regularization term where

is the probability of classifying the

th example correctly. was approximated via subsampling, and . This architecture was trained on the MNIST dataset. A subset of the training set was preprocessed by downsampling the images to , removing pixels that are 0 for all training examples and decreasing the number of digits to 4. The resulting training set size was . The resulting dimensionality of the posterior was . Mini-batch size was for all methods. All weights were initialized at 0 and all methods were run for epochs. SBPS is compared with SGLD at different step sizes, and performance is comparable to SGLD with an appropriate step size without requiring an expensive scan over step sizes. Since the additional regularization term can lead to unbounded gradients of the log posterior one can no longer use the bounds derived for the Local BPS and ZZ-SS algorithms and thus they cannot be applied to this problem without further work. This is not the case for SBPS. The posterior is not Gaussian due to the likelihood terms and thus the Laplace approximation is not effective unless the posterior is dominated by the prior.

In order to assess the quality of the sampling, we compare the trajectories to a standard costly Metropolis-Hastings MCMC using a Gaussian with variance as the proposal distribution. This algorithm was run for epochs and the proposal acceptance rate was 0.43. Figure 6 shows samples in the directions of the largest, median and smallest variance of the empirical covariance matrix of the Metropolis-Hastings samples.

## 8 Conclusions

This paper introduced a non-reversible sampler that can be applied to big datasets by means of subsampling the data in each iteration. At the price of a small, controllable bias, it provides the benefits of (i) high mixing speed associated with non-reversibility, and (ii) continuous sample trajectories, with (iii) minimal hyperparameter tuning required, leading to state of the art performance and making it a convenient alternative to biased, difficult-to-tune MH-based stochastic samplers.

Stochastic Bouncy Particle Sampler

Supplementary Material

## Appendix A Proof Sketch of Invariance under Noisy Gradients

In this section we start with a simple reformulation of the proof in (Bouchard-Côté et al., 2015) that the BPS Markov process leaves invariant the distribution where

 p(w) ∝ e−U(w),w∈RD, (A.1) p(v) = Unif[SD−1], (A.2)

where is the -dimensional one-sphere. This will set the stage for the noisy case considered next. For a more formal and detailed treatment of the BPS algorithm, including ergodicity, see (Bouchard-Côté et al., 2015). For simplicity, we do not include here the velocity refreshments, which do not change the proof.

The proof sketches below are presented using a discrete-time approach followed by letting . We have found this approach more accessible for a machine learning audience. After submitting a preliminary version of this work to the arXiv, the preprint (Fearnhead et al., 2016) was submitted to the arXiv, which presents similar proofs of invariance by first deriving a general Fokker-Planck equation and then showing that the equation is satisfied both in noiseless and noisy cases.

To understand why the algorithm is correct, consider first the transition rule

 (w,v)t+Δt={(w+vΔt,v)with probability1−Δt[G]+(w+vΔt,vr)with probabilityΔt[G]+ (A.3)

where

 [x]+=max(x,0), (A.4) G=v⋅∇U(w), (A.5)

and

 vr=v−2(v⋅∇U(w))∇U(w)||∇U(w)||2. (A.6)

This rule acts on the probability density as,

 pt+Δt(w,v) = [pt+Δt(w,v)]d+[pt+Δt(w,v)]r. (A.7)

The two terms in (A.7) correspond to the two ways to reach at time . First, we can start at at time and move a distance without bouncing. This occurs with probability , so we have

 [pt+Δt(w,v)]d = (1−Δt[v⋅∇U]+)pt(v)pt(w−vΔt), (A.8) = (1−Δt[v⋅∇U]+)pt(v)(pt(w)−Δtv⋅∇pt(w)+O(Δt2)), (A.9) = pt(v)pt(w)[1+Δtv⋅∇U−Δt[v⋅∇U]+]+O(Δt2), (A.10)

where in (A.9) we did a Taylor expansion and in (A.10) we used (A.1).

The second term in (A.7) corresponds to being at at time , moving and bouncing. This occurs with probability , so we have

 [pt+Δt(w,v)]r = Δt[−v⋅∇U]+pt(w−vrΔt,vr), (A.11) = Δt[−v⋅∇U]+pt(w,vr)+O(Δt2), (A.12)

where again we did a Taylor expansion in (A.11). Adding (A.10) and (A.12), and using

 [v⋅∇U]+−[−v⋅∇U]+=v⋅∇U, (A.13)

equation (A.7) becomes

 pt+Δt(w,v) = pt(w,v)+O(Δt2), (A.14)

which implies that the distribution is stationary, .

Consider now a noisy gradient represented as

 ∇~U(w)=∇U(w)+nw,nw∼p(nw|w),nw∈RD, (A.15)

where we assume that has zero mean.

First note that the requirement that and are conditionally independent given and , with , is needed to preserve under the noise the Markov property of the sampler, which requires the bounce point process intensity to depend only on , and not the past history of the trajectory.

Next we decompose the random vector into two orthogonal components,

 nw=yv+nv, (A.16)

with , and . This induces a corresponding decomposition in the probability density as

 dnwp(nw|w)=dydnvp(y|w)p(nv|y,w,v), (A.17)

and note that from the assumption that has zero mean it follows that has zero mean. The noisy projected gradient becomes

 v⋅∇U(w)+y,y∼p(y|w). (A.18)

To study the invariance of under the noisy BPS, let us consider again the decomposition (A.7) into straight and bounced infinitesimal trajectories. The probability that the particle is at at time and moves a distance without bouncing is the average of over all the possible realizations of , and is therefore given by

 1−ΔtPv ≡ 1−Δt∫+∞−∞[v⋅∇U(w)+y]+p(y|w)dy, (A.19) = 1−Δt∫+∞−v⋅∇U(w)(v⋅∇U(w)+y)p(y|w)dy, (A.20)

where the above expression defines . The first term of (A.7) is therefore

 [pt+Δt(w,v)]d = (1−ΔtPv)p(w−vΔt,v), = pt(w,v)−Δtv⋅∇pt(w)pt(v)−ΔtPvpt(w)pt(v)+O(Δt2), = pt(w)pt(v)[1+Δtv⋅∇U(w)−ΔtPv]+O(Δt2), (A.22)

similarly to (A.8)-(A.10).

The second term in (A.7) now has contributions from all those values at time , such that a reflection of with respect to a noisy gives . Such a exists for every value of the noise vector , and is given by

 ~vr=v−2(v⋅∇~U(w))∇~U(w)||∇~U(w)||2, (A.23)

Therefore the second term in (A.7) contains contributions from all the possible realizations of and is

 [pt+Δt(w,v)]r = Δt∫RDdnw[~vr⋅∇~U(w)]+p(nw|w)pt(w−~vrΔt,~vr), = Δtpt(w,~vr)∫+∞−∞dyp(y|w)[−v⋅∇U(w)−y]+,×∫dnvp(nv|y,w,v)+O(Δt2), = ΔtPvrpt(w,~vr)+O(Δt2), (A.25)

where we used , the measure decomposition (A.17), and defined

 Pvr = ∫−v⋅∇U(w)−∞dy(−v⋅∇U(w)−y)p(y|w). (A.26)

Adding now (A.22) and (A.25), using (since is uniform) and

 Pv−Pvr=v⋅∇U(w), (A.27)

which follows from (A.20) and (A.26), and the fact that has zero mean, we get again the stationarity condition

 pt+Δt(w,v) = pt(w,v)+O(Δt2). (A.28)

## Appendix B Biased Approximation

### b.1 Biased bouncing rate

In the noiseless case, the velocity bounce is an event in a Poisson process with intensity while in the noisy case, the average Poisson intensity is where

 λn(w,y)=[v⋅∇U(w)+y]+. (B.29)

When a thinning upper bound for is unknown and the distribution of is Gaussian with predicted variance , our algorithm makes a bounce proposal from a Poisson process with intensity

 λρ(w)=^G+kρ(w), (B.30)

where is our estimate of . At the proposed bounce point , we evaluate , and accept with probability . The evaluation of also provides an estimate of the variance of . Assuming is Gaussian, the probability of the bound violation event , is

 q(w)=1−Φ((λρ(w)−v⋅∇U(w))/σ(w)), (B.31)

where is the standard normal CDF. For a given , the intensity is therefore,

 λb(w,y) = I[λnλρ<1]λn(w,y)+I[λnλρ>1]λρ(w) (B.32)

where is the indicator function. Averaging over we get

 λb(w) = Ey[λb(w,y)] (B.33) = (1−q(w))Eλn≤λρ[λn(w,y)]+q(w)λρ(w) (B.34)

If the probability of bound violation has a universal upper bound , we assume

 |λb(w)−λn(w)|≤Kq=Cq+O(q2) (B.35)

where is a constant.

### b.2 Preliminaries

We are interested bounding the distance between the equilibrium distribution of the biased, noisy BPS process with mean intensity , and the exact, noisy process with mean intensity . We start with some preliminary results.

### Wasserstein Distance and Kantorovich Duality

We will consider the Wasserstein distance, defined as

 dW(p1,p2)=supf∈CL|Ep1[f]−Ep2[f]|, (B.36)

where is the set of 1-Lipshitz continuous functions,

 CL={f:Rd→R:|f(y)−f(x)|≤|y−x|}. (B.37)

Given random variables

, a coupling is a joint distribution with marginals and . The Kantorovich duality (Villani, 2008) asserts that

 dW(p1,p2)=infp12Ep12[|z1−z2|]. (B.38)

### Generators

To simplify the notation, let us define , . The infinitesimal generator of a stochastic process is defined as

 Lf(z)=limδt→0E[f(zt+δt)|zt=z]−f(z)δt, (B.39)

and note that it satisfies

 E[Lf] = limδt→0∫dzt+δtdzp(zt+δt|z)p(z)f(zt+δt)−E[f(z)]δt, (B.40) = limδt→0∫dzt+δtp(zt+δt)f(zt+δt)−E[f(z)]δt, (B.41) = 0, (B.42)

where the expectation is with respect to the distribution invariant under the stochastic process, and we used . In our case, the generator of a BPS process with intensity is (Davis, 1984; Fearnhead et al., 2016)

 Lλnf(z)=v⋅∇wf(z)+Ey[λn(w,y)(f(zr)−f(z))] (B.43)

and similarly for .

Let us define

 fλ(z,t)=Eλ[f(zt)|z0=z], (B.44)

where the expectation is with respect to the distribution of the stochastic process with intensity at time and with a given initial condition. This expression satisfies the backward Kolmogorov equation

 ∂fλ(z,t)∂t = Lλfλ(z,t), (B.45)

and also (Jacod & Shiryaev, 1987)

 limt→∞fλ(z,t)=Eλ[f], (B.46)

where the expectation is with respect to the distribution invariant under the stochastic process with intensity .

### Ergodicity

We assume that the random process defined by SBPS is polynomial ergodic (although see the recent (Deligiannidis et al., 2017)). In particular, we assume that two distributions started at reflected velocities converge as

 dW(pλn,t,z,pλn,t,zr)≤CA(α+t)β (B.47)

where are constants.222This assumed property follows usually from the existence of small sets (Lemma 3 in (Bouchard-Côté et al., 2015)) along with an appropriate Lyapunov function (Roberts et al., 2004).

### Poisson Equation

Given a function , we will consider below the Poisson equation

 Lλuf(z)=f(z)−Eλ[f]. (B.48)

We assume the existence of the solution

 uf(z)=∫∞0ds(Eλ[f]−fλ(z,s)), (B.49)

where was defined in (B.44). The fact that this expression solves (B.48) can be easily verified using (B.45), (B.46) and . For (see (B.37) ), this solution satisfies

 |uf(z)−uf(zr)| =