Surprise sampling: improving and extending the local case-control sampling

07/06/2020
by   Xinwei Shen, et al.
0

Fithian and Hastie (2014) proposed a new sampling scheme called local case-control (LCC) sampling that achieves stability and efficiency by utilizing a clever adjustment pertained to the logistic model. It is particularly useful for classification with large and imbalanced data. This paper proposes a more general sampling scheme based on a working principle that data points deserve higher sampling probability if they contain more information or appear "surprising" in the sense of, for example, a large error of pilot prediction or a large absolute score. Compared with the relevant existing sampling schemes, as reported in Fithian and Hastie (2014) and Ai, et al. (2018), the proposed one has several advantages. It adaptively gives out the optimal forms to a variety of objectives, including the LCC and Ai et al. (2018)'s sampling as special cases. Under same model specifications, the proposed estimator also performs no worse than those in the literature. The estimation procedure is valid even if the model is misspecified and/or the pilot estimator is inconsistent or dependent on full data. We present theoretical justifications of the claimed advantages and optimality of the estimation and the sampling design. Different from Ai, et al. (2018), our large sample theory are population-wise rather than data-wise. Moreover, the proposed approach can be applied to unsupervised learning studies, since it essentially only requires a specific loss function and no response-covariate structure of data is needed. Numerical studies are carried out and the evidence in support of the theory is shown.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/08/2018

More Efficient Estimation for Logistic Regression with Optimal Subsample

Facing large amounts of data, subsampling is a practical technique to ex...
research
06/16/2013

Local case-control sampling: Efficient subsampling in imbalanced data sets

For classification problems with significant class imbalance, subsamplin...
research
10/08/2022

Unweighted estimation based on optimal sample under measurement constraints

To tackle massive data, subsampling is a practical approach to select th...
research
06/08/2021

Coresets for Classification – Simplified and Strengthened

We give relative error coresets for training linear classifiers with a b...
research
09/10/2019

Robust Multivariate Estimation Based On Statistical Data Depth Filters

In the classical contamination models, such as the gross-error (Huber an...
research
01/31/2022

L-SVRG and L-Katyusha with Adaptive Sampling

Stochastic gradient-based optimization methods, such as L-SVRG and its a...
research
08/05/2022

Kendall's tau estimator for bivariate zero-inflated count data

In this paper, we extend the work of Pimentel et al. (2015) and propose ...

Please sign up or login with your details

Forgot password? Click here to reset