Logistic Regression for Massive Data with Rare Events

06/01/2020
by   HaiYing Wang, et al.
0

This paper studies binary logistic regression for rare events data, or imbalanced data, where the number of events (observations in one class, often called cases) is significantly smaller than the number of nonevents (observations in the other class, often called controls). We first derive the asymptotic distribution of the maximum likelihood estimator (MLE) of the unknown parameter, which shows that the asymptotic variance convergences to zero in a rate of the inverse of the number of the events instead of the inverse of the full data sample size. This indicates that the available information in rare events data is at the scale of the number of events instead of the full data sample size. Furthermore, we prove that under-sampling a small proportion of the nonevents, the resulting under-sampled estimator may have identical asymptotic distribution to the full data MLE. This demonstrates the advantage of under-sampling nonevents for rare events data, because this procedure may significantly reduce the computation and/or data collection costs. Another common practice in analyzing rare events data is to over-sample (replicate) the events, which has a higher computational cost. We show that this procedure may even result in efficiency loss in terms of parameter estimation.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/05/2023

Distributed Logistic Regression for Massive Data with Rare Events

Large-scale rare events data are commonly encountered in practice. To ta...
research
10/25/2021

Nonuniform Negative Sampling and Log Odds Correction with Rare Events Data

We investigate the issue of parameter estimation with nonuniform negativ...
research
09/21/2021

Network meta-analysis of rare events using penalized likelihood regression

Network meta-analysis (NMA) of rare events has attracted little attentio...
research
05/30/2023

Predicting Rare Events by Shrinking Towards Proportional Odds

Training classifiers is difficult with severe class imbalance, but many ...
research
04/06/2021

A new weighting method when not all the events are selected as cases in a nested case-control study

Nested case-control (NCC) is a sampling method widely used for developin...
research
07/09/2023

On the sample complexity of estimation in logistic regression

The logistic regression model is one of the most popular data generation...
research
09/17/2020

Variational Disentanglement for Rare Event Modeling

Combining the increasing availability and abundance of healthcare data a...

Please sign up or login with your details

Forgot password? Click here to reset