A modern maximum-likelihood theory for high-dimensional logistic regression

03/19/2018
by   Pragya Sur, et al.
0

Every student in statistics or data science learns early on that when the sample size largely exceeds the number of variables, fitting a logistic model produces estimates that are approximately unbiased. Every student also learns that there are formulas to predict the variability of these estimates which are used for the purpose of statistical inference; for instance, to produce p-values for testing the significance of regression coefficients. Although these formulas come from large sample asymptotics, we are often told that we are on reasonably safe grounds when n is large in such a way that n > 5p or n > 10p. This paper shows that this is absolutely not the case. Consequently, inferences routinely produced by common software packages are unreliable. Consider a logistic model with independent features in which n and p become increasingly large in a fixed ratio. Then we show that (1) the MLE is biased, (2) the variability of the MLE is far greater than classically predicted, and (3) the commonly used likelihood-ratio test (LRT) is not distributed as a chi-square. The bias of the MLE is extremely problematic as it yields completely wrong predictions for the probability of a case based on observed values of the covariates. We develop a new theory, which asymptotically predicts (1) the bias of the MLE, (2) the variability of the MLE, and (3) the distribution of the LRT. We empirically also demonstrate that these predictions are extremely accurate in finite samples. Further, an appealing feature is that these novel predictions depend on the unknown sequence of regression coefficients only through a single scalar, the overall strength of the signal. This suggests very concrete procedures to adjust inference; we describe one such procedure learning a single parameter from data and producing accurate inference.

READ FULL TEXT
research
06/05/2017

The Likelihood Ratio Test in High-Dimensional Logistic Regression Is Asymptotically a Rescaled Chi-Square

Logistic regression is used thousands of times a day to fit data, predic...
research
03/23/2021

SLOE: A Faster Method for Statistical Inference in High-Dimensional Logistic Regression

Logistic regression remains one of the most widely used tools in applied...
research
01/25/2020

The Asymptotic Distribution of the MLE in High-dimensional Logistic Models: Arbitrary Covariance

We study the distribution of the maximum likelihood estimate (MLE) in hi...
research
01/19/2021

Firth's logistic regression with rare events: accurate effect estimates AND predictions?

Firth-type logistic regression has become a standard approach for the an...
research
09/08/2019

Inference In General Single-Index Models Under High-dimensional Symmetric Designs

We consider the problem of statistical inference for a finite number of ...
research
01/27/2021

To tune or not to tune, a case study of ridge logistic regression in small or sparse datasets

For finite samples with binary outcomes penalized logistic regression su...
research
01/14/2015

Quantifying Prosodic Variability in Middle English Alliterative Poetry

Interest in the mathematical structure of poetry dates back to at least ...

Please sign up or login with your details

Forgot password? Click here to reset