Optimal Subsampling for Large Sample Logistic Regression

02/03/2017
by   HaiYing Wang, et al.
0

For massive data, the family of subsampling algorithms is popular to downsize the data volume and reduce computational burden. Existing studies focus on approximating the ordinary least squares estimate in linear regression, where statistical leverage scores are often used to define subsampling probabilities. In this paper, we propose fast subsampling algorithms to efficiently approximate the maximum likelihood estimate in logistic regression. We first establish consistency and asymptotic normality of the estimator from a general subsampling algorithm, and then derive optimal subsampling probabilities that minimize the asymptotic mean squared error of the resultant estimator. An alternative minimization criterion is also proposed to further reduce the computational cost. The optimal subsampling probabilities depend on the full data estimate, so we develop a two-step algorithm to approximate the optimal subsampling procedure. This algorithm is computationally efficient and has a significant reduction in computing time compared to the full data approach. Consistency and asymptotic normality of the estimator from a two-step algorithm are also established. Synthetic and real data sets are used to evaluate the practical performance of the proposed method.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/18/2018

Optimal Subsampling Algorithms for Big Data Generalized Linear Models

To fast approximate the maximum likelihood estimator with massive data, ...
research
04/10/2022

Optimal Subsampling for Large Sample Ridge Regression

Subsampling is a popular approach to alleviating the computational burde...
research
09/17/2015

Optimal Subsampling Approaches for Large Sample Linear Regression

A significant hurdle for analyzing large sample data is the lack of effe...
research
09/07/2015

Poisson Subsampling Algorithms for Large Sample Linear Regression in Massive Data

Large sample size brings the computation bottleneck for modern data anal...
research
04/10/2018

Subsampled Optimization: Statistical Guarantees, Mean Squared Error Approximation, and Sampling Method

For optimization on large-scale data, exactly calculating its solution m...
research
05/21/2020

Optimal Distributed Subsampling for Maximum Quasi-Likelihood Estimators with Massive Data

Nonuniform subsampling methods are effective to reduce computational bur...
research
02/24/2020

Asymptotic Analysis of Sampling Estimators for Randomized Numerical Linear Algebra Algorithms

The statistical analysis of Randomized Numerical Linear Algebra (RandNLA...

Please sign up or login with your details

Forgot password? Click here to reset