1 Introduction
In developing algorithms for large datasets, much of the focus has been on optimization approaches that produce a point estimate with no characterization of uncertainty. This motivates scalable Bayesian algorithms. As variational methods and related analytic approximations lack theoretical support and can be inaccurate, this article focuses on posterior sampling algorithms.
One such class of methods is divideandconquer Markov chain Monte Carlo. The main idea is to divide data into chunks, run Markov chain Monte Carlo independently for each chunk, and then combine samples. There has been serious effort in this direction [32, 27, 24, 29]. However, combining samples inevitably leads to some bias, and accuracy theorems require sample sizes to increase within each subset.
Another approach targets an approximate version of the target distribution. One strategy first constructs a cheap estimate of the acceptance probability for a MetropolisHastings algorithm
[17] and uses that as a first accept/reject step. Only if samples are accepted at this stage is the full acceptance probability calculated. This is known as delayed–acceptance Markov chain Monte Carlo [15, 28]. However, this also introduces a bias into the invariant distribution.An alternative strategy uses subsamples to approximate transition probabilities and reduce bottlenecks in calculating likelihoods and gradients [33]. Such approaches typically rely on uniform random subsamples, which can be badly inefficient, as noted in an increasing frequentist literature on biased subsampling [13, 30]. The Bayesian literature has largely overlooked the use of biased subsampling in efficient algorithm design, though recent coreset approaches address a related problem [18, 9].
A problem with subsampling Markov chain Monte Carlo is that it is almost impossible to preserve the correct invariant distribution. While there has been work on quantifying the error [20, 1], it is usually difficult to do so in practice. While the pseudomarginal approach of [3]
provides a potential solution, it is generally impossible to obtain the required unbiased estimators of likelihoods using subsamples
[19].A promising recent direction has been using nonreversible samplers with subsampling within the framework of piecewise deterministic Markov processes [8, 11, 6]. These approaches use the gradient of the loglikelihood, which can be replaced by a subsamplebased unbiased estimator, so that the exactly correct invariant distribution is maintained. This article focuses on improving the efficiency of such approaches by using different forms of biased subsampling motivated concretely by logistic regression problems.
2 Logistic regression with sparse imbalanced data
2.1 Model
We focus on the logistic regression model
(1) 
where , , and . Consider data from model (1), where . For a prior distribution on , the posterior distribution is
(2) 
where and denotes the potential function.
A popular algorithm for sampling from (2) is PólyaGamma data augmentation [26]. However, this performs poorly if there is large imbalance in class labels [21]. Similar issues arise when the s are imbalanced. Logistic regression is routinely used in broad fields and such imbalance issues are extremely common, leading to a clear need for more efficient algorithms. While adaptive MetropolisHastings can perform well in small and settings [16], issues arise in scaling to large datasets.
2.2 The zigzag process
The zigzag process [6] is a type of piecewise deterministic Markov process which is particularly useful for logistic regression. Define to be the uniform measure on . The zigzag process is a continuous time stochastic process on the augmented space , which under fairly mild conditions is ergodic with respect to the product measure ; in particular, is ergodic with respect to the target measure : for any integrable function ,
(3) 
holds almost surely, with the position and the velocity of the process at time .
For a starting point and velocity , the zigzag process evolves deterministically as
(4) 
At a random time , a bouncing event flips the sign of one component of the velocity . The process then evolves as equation 4 with the new velocity until the next change in velocity. The time is the first arrival time of independent Poisson processes with intensity functions , that is, with . The sign flip applies to , with if and if . The intensity functions are of the form , where is a rate function. A sufficient condition for the zigzag process to satisfy equation 3 is
(5) 
[6]. This condition is satisfied if and only if there exist nonnegative functions , such that ; here denotes the positive part of .
If has a simple closed form, the arrival times can be sampled as for . Otherwise, are obtained via Poisson thinning [23]. Assume we have continuous functions such that
here are upper computational bounds. Let denote the first arrival times of nonhomogeneous Poisson processes with rates , and let . A zigzag process with intensity is still obtained if

is evolved according to equation 4 for time instead of , and

the sign of is flipped at time with probability .
Algorithmic details are provided in the Supplement.
The subsampling approach of [6] uses uniform subsampling of a single data point to obtain an unbiased estimate of the th partial derivative of the potential function
where is from the prior and with is from the data. Their subsampling algorithm preserves the correct stationary distribution. Let denote a reference point, usually chosen as a mode of the target distribution. The authors consider estimates of the form
(6) 
where indexes the sampled data point and ; if , the second term is a control variate. This estimate is used to construct a stochastic rate function as
(7) 
By using upper bounds satisfying for all , the rate functions can be replaced by stochastic versions , with being resampled at every iteration. This results in a zigzag process with effective bouncing rate satisfying equation 5. We refer to the resulting algorithm as zigzag with subsampling when and zigzag with control variates when .
3 Improved subsampling
3.1 General framework
We introduce generalizations of the above approach to replace uniform subsampling with importance sampling (Section 3.2), allow general minibatches instead of subsamples of size one (Section 3.3), and further improve mixing via stratified subsampling with control variates (Sections 3.43.5). Our motivation is to (a) increase sampling efficiency, and (b) simplify construction of upper bounds. We achieve (b) by letting the Poisson process which determines bouncing times in component to be a superpositioning of two independent Poisson processes with state dependent bouncing rates
(8) 
respectively. A nonnegative refreshment intensity is induced by subsampling, so the superpositioning of the two processes defines a zigzag process which preserves the target measure . This follows from the fact that the corresponding state dependent bouncing rate
(9) 
of the superpositioning satisfies equation 5. We achieve (a) through general forms of the estimator in the Poisson thinning step obtained through nonuniform subsampling. More precisely, we implement variants of the zigzag process with data subsampling where we simulate the Poisson process with rate using unbiased estimators of the partial derivatives of the negative loglikelihood function in the Poisson thinning step. The following proposition shows such an approach – implemented in Algorithm 1 – results in a zigzag process preserving the target measure .
Proposition 3.1.
Let , , be a probability measure such that , , is an unbiased estimator of the th partial derivative of negative loglikelihood function . Let and be such that for all and , . Similarly, let and be such that for all , . Then the zigzag process generated by Algorithm 1 preserves the target measure , and the effective bouncing rate of the generated zigzag process is of the form , with
(10) 
We have decomposed the stochastic rate function from equation 7 as the sum of a deterministic part for the prior and a stochastic part for the data, where the stochasticity is with respect to subsampling. This construction is related to ideas in [8]. The proof of Proposition 3.1 generalizes that of Theorem 4.1 in [6], and results similar to Proposition 3.1 are in [31]. However, these ideas have not been exploited to design improved subsampling strategies. Our construction using two independent Poisson processes allows Poisson thinning of each process separately and simplifies construction of tight upper bounds for certain subsampling schemes, see Section 3.2 below.
In the remainder of this article, we restrict considerations to situations where one of the following assumptions on the terms in the loglikelihood function apply.
Assumption 3.1.
The partial derivatives of are uniformly bounded, that is, there exist constants , such that for all , .
Assumption 3.2.
The partial derivatives of are globally Lipschitz, that is, for suitable , there are constants , such that for all ,
To adapt equation 6 to Algorithm 1, choose the estimator of partial derivatives as
(11) 
with and . For under Assumption 3.1, we can bound realizations of the stochastic rate function by for . Similarly, if Assumption 3.2 holds and , we can bound realizations of by for . The state dependent bouncing rate of the resulting zigzag process is as defined in equation 9, with having the explicit form
(12) 
In the sequel, we consider the above two cases of Algorithm 1 as baselines for comparison. We introduce several subsampling schemes, corresponding to the measures in Proposition 3.1, and associated estimators and bounds as alternative cases of Algorithm 1. These are designed to improve sampling efficiency by either improving the diffusive properties of the zigzag process, reducing the integrated autocorrelation time, or reducing the computational cost per simulated unit time interval.
3.2 Improving bounds via importance sampling
A generalization of the estimator is to consider the index
to be sampled from a nonuniform probability distribution
, defined by , where are weights satisfying . It follows thatwith defines an unbiased estimator of . For ,
(13) 
defines an upper bound for rate function under Assumption 3.1. Similarly, if and Assumption 3.2 holds,
(14) 
is an upper bound for . By Proposition 3.1, the resulting zigzag process generated by Algorithm 1 preserves in either setup the target measure . The contribution to the effective bouncing rate is , which is identical to (12). Hence, for either choice of , the effective bouncing rate remains unchanged from that for uniform subsampling. We can minimize magnitudes of upper bounds (13) and (14
) by choosing the weight vector
such that the constants and are minimized. This can be verified to be the case when with3.3 Improving mixing via minibatches
In algorithms such as stochastic gradient descent
[7], stochastic gradient Langevin [33], and their variants, minibatches of size larger than one are used to decrease the variance of gradient estimates, which typically improves convergence. In the context of piecewise deterministic Markov processes, the motivation of using minibatches of size larger than one is to reduce the effective refreshment rate, which for large effective refreshment rates improves the mixing of the process
[2].The following propositions provide an explicit form for the refreshment rate induced by subsampling, and corresponding upper bounds.
Proposition 3.2.
The refreshment rate induced by subsampling is .
Proposition 3.3.
The framework of Algorithm 1 allows us to consider estimates of using subsamples of size larger than one: we consider a minibatch of random indices ), so that is an unbiased estimator of . Typically, entries of the minibatch are sampled uniformly and independently from the dataset, so that is the uniform measure on . This yields unbiased estimators of the form
. Since for any function , we have , it follows that upper bounds for minibatch size are also upper bounds for the stochastic rate functions associated with the estimate if . By the same arguments, we can conclude the following Lemma 3.1, which justifies using the bounds derived in Section 3.2 for importance subsampling with a minibatch of size in the setup of minibatches of arbitrary size .
Lemma 3.1.
Let
where and and are as defined in Section 3.2. The value of does not depend on the size of the minibatch .
If we consider minibatches of size , the effective bouncing rate of the process generated by Algorithm 1 when used with the above described estimators can be computed as
The effective refreshment rate is reduced with increased minibatch size, as the following lemma shows.
Lemma 3.2.
For all , , , we have
3.4 Improving mixing via stratified subsampling
To further enhance mixing, we also design an approach based on stratified subsampling using control variates. To simplify notation, we only describe construction of rates and bounds when in equation 11; this can easily be adapted for the case . Suppose we divide the data index set into strata , which are such that for every component index , the sets form a partition of ; that is, , and for . If we consider the minibatch to be sampled such that the th entry of is sampled independently and uniformly from the strata , that is, , , then it is easy to verify that
(15) 
defines an unbiased estimator for . When constructing upper bounds for the corresponding stochastic rate function, stratified subsampling can only improve upon the upper bounds used in uniform subsampling, since
where are the upper bounds derived in Section 3.1. Hence, Algorithm 1 parameterized with as defined in equation 15 and corresponding upper bounds generates a zigzag process with effective bouncing rates , where .
It is also possible to combine stratified subsampling with importance subsampling as follows. As in Section 3.2, weights , , can be assigned within the strata such that . Defining measures on respectively such that , one can verify that , (), is an unbiased estimator for . The optimal importance weights can be chosen as in Section 3.2.
3.5 Construction of strata
We describe how gradient information at a reference point can be used to construct strata for the above described estimator so that under the regularity conditions of Assumption 3.2, the refreshment rate of the associated zigzag process is provably reduced within some vicinity of : by Proposition 3.2, the effective refreshment rate which is induced by the stratified subsampling approach is of the form
Moreover, if Assumption 3.2 is satisfied, it follows that for all , since
for all . This suggests that in order to minimize the refreshment rate in the vicinity of the reference point , we construct the strata as the solution of the minimization problem
(16) 
Solving equation 16 may be hard if the number of strata is large. For this reason, we instead consider strata constructed as the solution of the minimization problem
(17) 
where for a discrete set , is the diameter of the set . This approximation is justified by noticing that the objective function of (17) is an upper bound of the original objective as shown in the following proposition, which is a consequence of Lemma A.1 in the appendix.
Proposition 3.4.
Consider the estimator as defined in equation 15. The refreshment rate due to subsampling is bounded from above as .
We provide an efficient algorithm in the supplementary material which can be used to compute an approximate solution of this optimization problem. Under certain regularity conditions on the loglikelihood function, one may construct strata by minimizing an upper bound for the variance of the estimate (15) which holds uniformly in [35]. Since by Proposition 3.3 that upper bound is also an upper bound for the refreshment rate, this provides an alternative approach for constructing strata. Moreover, in practice one may also include other a priori available information on the data to inform the construction of the strata (see the construction of strata for the example described in Section 5).
4 Numerical Examples
4.1 Computational bounds
To keep things simple, we consider the prior to be ; we discuss other choices in the supplementary material. The corresponding negative log density function is . We let and derive computational bounds as in [6]. We write with and . For , the th partial derivatives are
Comments
There are no comments yet.