Orthogonal Subsampling for Big Data Linear Regression

05/30/2021
by   Lin Wang, et al.
0

The dramatic growth of big datasets presents a new challenge to data storage and analysis. Data reduction, or subsampling, that extracts useful information from datasets is a crucial step in big data analysis. We propose an orthogonal subsampling (OSS) approach for big data with a focus on linear regression models. The approach is inspired by the fact that an orthogonal array of two levels provides the best experimental design for linear regression models in the sense that it minimizes the average variance of the estimated parameters and provides the best predictions. The merits of OSS are three-fold: (i) it is easy to implement and fast; (ii) it is suitable for distributed parallel computing and ensures the subsamples selected in different batches have no common data points; and (iii) it outperforms existing methods in minimizing the mean squared errors of the estimated parameters and maximizing the efficiencies of the selected subsamples. Theoretical results and extensive numerical results show that the OSS approach is superior to existing subsampling approaches. It is also more robust to the presence of interactions among covariates and, when they do exist, OSS provides more precise estimates of the interaction effects than existing methods. The advantages of OSS are also illustrated through analysis of real data.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/23/2022

Balanced Subsampling for Big Data with Categorical Covariates

The use and analysis of massive data are challenging due to the high sto...
research
08/24/2023

An Efficient Data Analysis Method for Big Data using Multiple-Model Linear Regression

This paper introduces a new data analysis method for big data using a ne...
research
03/03/2019

Multiple Learning for Regression in big data

Regression problems that have closed-form solutions are well understood ...
research
04/29/2023

Subdata selection for big data regression: an improved approach

In the big data era researchers face a series of problems. Even standard...
research
05/02/2023

On the selection of optimal subdata for big data regression based on leverage scores

Regression can be really difficult in case of big datasets, since we hav...
research
10/23/2015

On the complexity of switching linear regression

This technical note extends recent results on the computational complexi...

Please sign up or login with your details

Forgot password? Click here to reset