Computationally efficient univariate filtering for massive data

02/11/2020
by   M. Tsagris, et al.
0

The vast availability of large scale, massive and big data has increased the computational cost of data analysis. One such case is the computational cost of the univariate filtering which typically involves fitting many univariate regression models and is essential for numerous variable selection algorithms to reduce the number of predictor variables. The paper manifests how to dramatically reduce that computational cost by employing the score test or the simple Pearson correlation (or the t-test for binary responses). Extensive Monte Carlo simulation studies will demonstrate their advantages and disadvantages compared to the likelihood ratio test and examples with real data will illustrate the performance of the score test and the log-likelihood ratio test under realistic scenarios. Depending on the regression model used, the score test is 30 - 60,000 times faster than the log-likelihood ratio test and produces nearly the same results. Hence this paper strongly recommends to substitute the log-likelihood ratio test with the score test when coping with large scale data, massive data, big data, or even with data whose sample size is in the order of a few tens of thousands or higher.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/18/2021

Regression model selection via log-likelihood ratio and constrained minimum criterion

Although the log-likelihood is widely used in model selection, the log-l...
research
12/01/2022

Are you using test log-likelihood correctly?

Test log-likelihood is commonly used to compare different models of the ...
research
04/11/2021

Parallel integrative learning for large-scale multi-response regression with incomplete outcomes

Multi-task learning is increasingly used to investigate the association ...
research
06/05/2017

The Likelihood Ratio Test in High-Dimensional Logistic Regression Is Asymptotically a Rescaled Chi-Square

Logistic regression is used thousands of times a day to fit data, predic...
research
11/30/2021

Efficient and robust high-dimensional sparse logistic regression via nonlinear primal-dual hybrid gradient algorithms

Logistic regression is a widely used statistical model to describe the r...
research
07/01/2019

Time-to-Event Prediction with Neural Networks and Cox Regression

New methods for time-to-event prediction are proposed by extending the C...
research
05/02/2023

Slow Kill for Big Data Learning

Big-data applications often involve a vast number of observations and fe...

Please sign up or login with your details

Forgot password? Click here to reset