Efficient posterior sampling for high-dimensional imbalanced logistic regression

05/27/2019 ∙ by Deborshee Sen, et al. ∙ 0

High-dimensional data are routinely collected in many application areas. In this article, we are particularly interested in classification models in which one or more variables are imbalanced. This creates difficulties in estimation. To improve performance, one can apply a Bayesian approach with Markov chain Monte Carlo algorithms used for posterior computation. However, current algorithms can be inefficient as n and/or p increase due to worsening time per step and mixing rates. One promising strategy is to use a gradient-based sampler to improve mixing while using data sub-samples to reduce per step computational complexity. However, usual sub-sampling breaks down when applied to imbalanced data. Instead, we generalize recent piece-wise deterministic Markov chain Monte Carlo algorithms to include stratified and importance-weighted sub-sampling. We also propose a new sub-sampling algorithm based on sorting data-points. These approaches maintain the correct stationary distribution with arbitrarily small sub-samples, and substantially outperform current competitors. We provide theoretical support and illustrate gains in simulated and real data applications.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In developing algorithms for large datasets, much of the focus has been on optimization approaches that produce a point estimate with no characterization of uncertainty. This motivates scalable Bayesian algorithms. As variational methods and related analytic approximations lack theoretical support and can be inaccurate, this article focuses on posterior sampling algorithms.

One such class of methods is divide-and-conquer Markov chain Monte Carlo. The main idea is to divide data into chunks, run Markov chain Monte Carlo independently for each chunk, and then combine samples. There has been serious effort in this direction [32, 27, 24, 29]. However, combining samples inevitably leads to some bias, and accuracy theorems require sample sizes to increase within each subset.

Another approach targets an approximate version of the target distribution. One strategy first constructs a cheap estimate of the acceptance probability for a Metropolis-Hastings algorithm

[17] and uses that as a first accept/reject step. Only if samples are accepted at this stage is the full acceptance probability calculated. This is known as delayed–acceptance Markov chain Monte Carlo [15, 28]. However, this also introduces a bias into the invariant distribution.

An alternative strategy uses sub-samples to approximate transition probabilities and reduce bottlenecks in calculating likelihoods and gradients [33]. Such approaches typically rely on uniform random sub-samples, which can be badly inefficient, as noted in an increasing frequentist literature on biased sub-sampling [13, 30]. The Bayesian literature has largely overlooked the use of biased sub-sampling in efficient algorithm design, though recent coreset approaches address a related problem [18, 9].

A problem with sub-sampling Markov chain Monte Carlo is that it is almost impossible to preserve the correct invariant distribution. While there has been work on quantifying the error [20, 1], it is usually difficult to do so in practice. While the pseudo-marginal approach of [3]

provides a potential solution, it is generally impossible to obtain the required unbiased estimators of likelihoods using sub-samples

[19].

A promising recent direction has been using non-reversible samplers with sub-sampling within the framework of piece-wise deterministic Markov processes [8, 11, 6]. These approaches use the gradient of the log-likelihood, which can be replaced by a sub-sample-based unbiased estimator, so that the exactly correct invariant distribution is maintained. This article focuses on improving the efficiency of such approaches by using different forms of biased sub-sampling motivated concretely by logistic regression problems.

2 Logistic regression with sparse imbalanced data

2.1 Model

We focus on the logistic regression model

(1)

where , , and . Consider data from model (1), where . For a prior distribution on , the posterior distribution is

(2)

where and denotes the potential function.

A popular algorithm for sampling from (2) is Pólya-Gamma data augmentation [26]. However, this performs poorly if there is large imbalance in class labels [21]. Similar issues arise when the s are imbalanced. Logistic regression is routinely used in broad fields and such imbalance issues are extremely common, leading to a clear need for more efficient algorithms. While adaptive Metropolis-Hastings can perform well in small and settings [16], issues arise in scaling to large datasets.

2.2 The zig-zag process

The zig-zag process [6] is a type of piece-wise deterministic Markov process which is particularly useful for logistic regression. Define to be the uniform measure on . The zig-zag process is a continuous time stochastic process on the augmented space , which under fairly mild conditions is ergodic with respect to the product measure ; in particular, is ergodic with respect to the target measure : for any -integrable function ,

(3)

holds almost surely, with the position and the velocity of the process at time .

For a starting point and velocity , the zig-zag process evolves deterministically as

(4)

At a random time , a bouncing event flips the sign of one component of the velocity . The process then evolves as equation 4 with the new velocity until the next change in velocity. The time is the first arrival time of independent Poisson processes with intensity functions , that is, with . The sign flip applies to , with if and if . The intensity functions are of the form , where is a rate function. A sufficient condition for the zig-zag process to satisfy equation 3 is

(5)

[6]. This condition is satisfied if and only if there exist non-negative functions , such that ; here denotes the positive part of .

If has a simple closed form, the arrival times can be sampled as for . Otherwise, are obtained via Poisson thinning [23]. Assume we have continuous functions such that

here are upper computational bounds. Let denote the first arrival times of non-homogeneous Poisson processes with rates , and let . A zig-zag process with intensity is still obtained if

  1. is evolved according to equation 4 for time instead of , and

  2. the sign of is flipped at time with probability .

Algorithmic details are provided in the Supplement.

The sub-sampling approach of [6] uses uniform sub-sampling of a single data point to obtain an unbiased estimate of the -th partial derivative of the potential function

where is from the prior and with is from the data. Their sub-sampling algorithm preserves the correct stationary distribution. Let denote a reference point, usually chosen as a mode of the target distribution. The authors consider estimates of the form

(6)

where indexes the sampled data point and ; if , the second term is a control variate. This estimate is used to construct a stochastic rate function as

(7)

By using upper bounds satisfying for all , the rate functions can be replaced by stochastic versions , with being resampled at every iteration. This results in a zig-zag process with effective bouncing rate satisfying equation 5. We refer to the resulting algorithm as zig-zag with sub-sampling when and zig-zag with control variates when .

3 Improved sub-sampling

3.1 General framework

We introduce generalizations of the above approach to replace uniform sub-sampling with importance sampling (Section 3.2), allow general mini-batches instead of sub-samples of size one (Section 3.3), and further improve mixing via stratified sub-sampling with control variates (Sections 3.4-3.5). Our motivation is to (a) increase sampling efficiency, and (b) simplify construction of upper bounds. We achieve (b) by letting the Poisson process which determines bouncing times in component to be a super-positioning of two independent Poisson processes with state dependent bouncing rates

(8)

respectively. A non-negative refreshment intensity is induced by sub-sampling, so the super-positioning of the two processes defines a zig-zag process which preserves the target measure . This follows from the fact that the corresponding state dependent bouncing rate

(9)

of the super-positioning satisfies equation 5. We achieve (a) through general forms of the estimator in the Poisson thinning step obtained through non-uniform sub-sampling. More precisely, we implement variants of the zig-zag process with data sub-sampling where we simulate the Poisson process with rate using unbiased estimators of the partial derivatives of the negative log-likelihood function in the Poisson thinning step. The following proposition shows such an approach – implemented in Algorithm 1 – results in a zig-zag process preserving the target measure .

Proposition 3.1.

Let , , be a probability measure such that , , is an unbiased estimator of the -th partial derivative of negative log-likelihood function . Let and be such that for all and , . Similarly, let and be such that for all , . Then the zig-zag process generated by Algorithm 1 preserves the target measure , and the effective bouncing rate of the generated zig-zag process is of the form , with

(10)
Input: Starting point , initial velocity , maximum number of bouncing attempts .
1:  Set .
2:  for  do
3:     Draw such that
4:     Let , and set .
5:     Evolve position: set .
6:     if   then
7:        Bounce: flip sign of the velocity of the -th component, .
8:     else
9:        Draw .
10:        if   then
11:           Bounce: flip sign of the -th component of the velocity, .
12:        else
13:           Set .
14:        end if
15:     end if
16:  end for
Output: The path of a zig-zag process specified by skeleton points and bouncing times .
Algorithm 1 Zig-zag algorithm with non-uniform sub-sampling

We have decomposed the stochastic rate function from equation 7 as the sum of a deterministic part for the prior and a stochastic part for the data, where the stochasticity is with respect to sub-sampling. This construction is related to ideas in [8]. The proof of Proposition 3.1 generalizes that of Theorem 4.1 in [6], and results similar to Proposition 3.1 are in [31]. However, these ideas have not been exploited to design improved sub-sampling strategies. Our construction using two independent Poisson processes allows Poisson thinning of each process separately and simplifies construction of tight upper bounds for certain sub-sampling schemes, see Section 3.2 below.

In the remainder of this article, we restrict considerations to situations where one of the following assumptions on the terms in the log-likelihood function apply.

Assumption 3.1.

The partial derivatives of are uniformly bounded, that is, there exist constants , such that for all , .

Assumption 3.2.

The partial derivatives of are globally Lipschitz, that is, for suitable , there are constants , such that for all ,

To adapt equation 6 to Algorithm 1, choose the estimator of partial derivatives as

(11)

with and . For under Assumption 3.1, we can bound realizations of the stochastic rate function by for . Similarly, if Assumption 3.2 holds and , we can bound realizations of by for . The state dependent bouncing rate of the resulting zig-zag process is as defined in equation 9, with having the explicit form

(12)

In the sequel, we consider the above two cases of Algorithm 1 as baselines for comparison. We introduce several sub-sampling schemes, corresponding to the measures in Proposition 3.1, and associated estimators and bounds as alternative cases of Algorithm 1. These are designed to improve sampling efficiency by either improving the diffusive properties of the zig-zag process, reducing the integrated auto-correlation time, or reducing the computational cost per simulated unit time interval.

3.2 Improving bounds via importance sampling

A generalization of the estimator is to consider the index

to be sampled from a non-uniform probability distribution

, defined by , where are weights satisfying . It follows that

with defines an unbiased estimator of . For ,

(13)

defines an upper bound for rate function under Assumption 3.1. Similarly, if and Assumption 3.2 holds,

(14)

is an upper bound for . By Proposition 3.1, the resulting zig-zag process generated by Algorithm 1 preserves in either setup the target measure . The contribution to the effective bouncing rate is , which is identical to (12). Hence, for either choice of , the effective bouncing rate remains unchanged from that for uniform sub-sampling. We can minimize magnitudes of upper bounds (13) and (14

) by choosing the weight vector

such that the constants and are minimized. This can be verified to be the case when with

3.3 Improving mixing via mini-batches

In algorithms such as stochastic gradient descent

[7], stochastic gradient Langevin [33]

, and their variants, mini-batches of size larger than one are used to decrease the variance of gradient estimates, which typically improves convergence. In the context of piece-wise deterministic Markov processes, the motivation of using mini-batches of size larger than one is to reduce the effective refreshment rate, which for large effective refreshment rates improves the mixing of the process

[2].

The following propositions provide an explicit form for the refreshment rate induced by sub-sampling, and corresponding upper bounds.

Proposition 3.2.

The refreshment rate induced by sub-sampling is .

Proposition 3.3.

The refreshment rate

is bounded from above by the standard deviation of the estimator as

.

The framework of Algorithm 1 allows us to consider estimates of using sub-samples of size larger than one: we consider a mini-batch of random indices ), so that is an unbiased estimator of . Typically, entries of the mini-batch are sampled uniformly and independently from the data-set, so that is the uniform measure on . This yields unbiased estimators of the form

. Since for any function , we have , it follows that upper bounds for mini-batch size are also upper bounds for the stochastic rate functions associated with the estimate if . By the same arguments, we can conclude the following Lemma 3.1, which justifies using the bounds derived in Section 3.2 for importance sub-sampling with a mini-batch of size in the setup of mini-batches of arbitrary size .

Lemma 3.1.

Let

where and and are as defined in Section 3.2. The value of does not depend on the size of the mini-batch .

If we consider mini-batches of size , the effective bouncing rate of the process generated by Algorithm 1 when used with the above described estimators can be computed as

The effective refreshment rate is reduced with increased mini-batch size, as the following lemma shows.

Lemma 3.2.

For all , , , we have

3.4 Improving mixing via stratified sub-sampling

To further enhance mixing, we also design an approach based on stratified sub-sampling using control variates. To simplify notation, we only describe construction of rates and bounds when in equation 11; this can easily be adapted for the case . Suppose we divide the data index set into strata , which are such that for every component index , the sets form a partition of ; that is, , and for . If we consider the mini-batch to be sampled such that the -th entry of is sampled independently and uniformly from the strata , that is, , , then it is easy to verify that

(15)

defines an unbiased estimator for . When constructing upper bounds for the corresponding stochastic rate function, stratified sub-sampling can only improve upon the upper bounds used in uniform sub-sampling, since

where are the upper bounds derived in Section 3.1. Hence, Algorithm 1 parameterized with as defined in equation 15 and corresponding upper bounds generates a zig-zag process with effective bouncing rates , where .

It is also possible to combine stratified sub-sampling with importance sub-sampling as follows. As in Section 3.2, weights , , can be assigned within the strata such that . Defining measures on respectively such that , one can verify that , (), is an unbiased estimator for . The optimal importance weights can be chosen as in Section 3.2.

3.5 Construction of strata

We describe how gradient information at a reference point can be used to construct strata for the above described estimator so that under the regularity conditions of Assumption 3.2, the refreshment rate of the associated zig-zag process is provably reduced within some vicinity of : by Proposition 3.2, the effective refreshment rate which is induced by the stratified sub-sampling approach is of the form

Moreover, if Assumption 3.2 is satisfied, it follows that for all , since

for all . This suggests that in order to minimize the refreshment rate in the vicinity of the reference point , we construct the strata as the solution of the minimization problem

(16)

Solving equation 16 may be hard if the number of strata is large. For this reason, we instead consider strata constructed as the solution of the minimization problem

(17)

where for a discrete set , is the diameter of the set . This approximation is justified by noticing that the objective function of (17) is an upper bound of the original objective as shown in the following proposition, which is a consequence of Lemma A.1 in the appendix.

Proposition 3.4.

Consider the estimator as defined in equation 15. The refreshment rate due to sub-sampling is bounded from above as .

We provide an efficient algorithm in the supplementary material which can be used to compute an approximate solution of this optimization problem. Under certain regularity conditions on the log-likelihood function, one may construct strata by minimizing an upper bound for the variance of the estimate (15) which holds uniformly in [35]. Since by Proposition 3.3 that upper bound is also an upper bound for the refreshment rate, this provides an alternative approach for constructing strata. Moreover, in practice one may also include other a priori available information on the data to inform the construction of the strata (see the construction of strata for the example described in Section 5).

4 Numerical Examples

4.1 Computational bounds

To keep things simple, we consider the prior to be ; we discuss other choices in the supplementary material. The corresponding negative log density function is . We let and derive computational bounds as in [6]. We write with and . For , the -th partial derivatives are