Local case-control sampling: Efficient subsampling in imbalanced data sets

06/16/2013
by   William Fithian, et al.
0

For classification problems with significant class imbalance, subsampling can reduce computational costs at the price of inflated variance in estimating model parameters. We propose a method for subsampling efficiently for logistic regression by adjusting the class balance locally in feature space via an accept-reject scheme. Our method generalizes standard case-control sampling, using a pilot estimate to preferentially select examples whose responses are conditionally rare given their features. The biased subsampling is corrected by a post-hoc analytic adjustment to the parameters. The method is simple and requires one parallelizable scan over the full data set. Standard case-control sampling is inconsistent under model misspecification for the population risk-minimizing coefficients θ^*. By contrast, our estimator is consistent for θ^* provided that the pilot estimate is. Moreover, under correct specification and with a consistent, independent pilot estimate, our estimator has exactly twice the asymptotic variance of the full-sample MLE - even if the selected subsample comprises a miniscule fraction of the full data set, as happens when the original data are severely imbalanced. The factor of two improves to 1+1/c if we multiply the baseline acceptance probabilities by c>1 (and weight points with acceptance probability greater than 1), taking roughly 1+c/2 times as many data points into the subsample. Experiments on simulated and real data show that our method can substantially outperform standard case-control subsampling.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/06/2020

Surprise sampling: improving and extending the local case-control sampling

Fithian and Hastie (2014) proposed a new sampling scheme called local ca...
research
10/08/2022

Unweighted estimation based on optimal sample under measurement constraints

To tackle massive data, subsampling is a practical approach to select th...
research
10/25/2021

Nonuniform Negative Sampling and Log Odds Correction with Rare Events Data

We investigate the issue of parameter estimation with nonuniform negativ...
research
04/27/2016

Local Uncertainty Sampling for Large-Scale Multi-Class Logistic Regression

A major challenge for building statistical models in the big data era is...
research
09/01/2022

Estimation for the Cox Model with Biased Sampling Data via Risk Set Sampling

Prevalent cohort sampling is commonly used to study the natural history ...
research
08/14/2021

Equity-Directed Bootstrapping: Examples and Analysis

When faced with severely imbalanced binary classification problems, we o...
research
05/30/2023

Predicting Rare Events by Shrinking Towards Proportional Odds

Training classifiers is difficult with severe class imbalance, but many ...

Please sign up or login with your details

Forgot password? Click here to reset