Least Squares Approximation for a Distributed System

08/14/2019
by   Xuening Zhu, et al.
3

In this work we develop a distributed least squares approximation (DLSA) method, which is able to solve a large family of regression problems (e.g., linear regression, logistic regression, Cox's model) on a distributed system. By approximating the local objective function using a local quadratic form, we are able to obtain a combined estimator by taking a weighted average of local estimators. The resulting estimator is proved to be statistically as efficient as the global estimator. In the meanwhile it requires only one round of communication. We further conduct the shrinkage estimation based on the DLSA estimation by using an adaptive Lasso approach. The solution can be easily obtained by using the LARS algorithm on the master node. It is theoretically shown that the resulting estimator enjoys the oracle property and is selection consistent by using a newly designed distributed Bayesian Information Criterion (DBIC). The finite sample performance as well as the computational efficiency are further illustrated by extensive numerical study and an airline dataset. The airline dataset is 52GB in memory size. The entire methodology has been implemented by Python for a de-facto standard Spark system. By using the proposed DLSA algorithm on the Spark system, it takes 26 minutes to obtain a logistic regression estimator whereas a full likelihood algorithm takes 15 hours to reaches an inferior result.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/05/2022

A weighted average distributed estimator for high dimensional parameter

In this paper, a new weighted average estimator (WAVE) is proposed to en...
research
10/22/2012

Strong oracle optimality of folded concave penalized estimation

Folded concave penalization methods have been shown to enjoy the strong ...
research
07/01/2020

Smooth Lasso Estimator for the Function-on-Function Linear Regression Model

A new estimator, named as S-LASSO, is proposed for the coefficient funct...
research
04/06/2020

Efficient Estimation for Generalized Linear Models on a Distributed System with Nonrandomly Distributed Data

Distributed systems have been widely used in practice to accomplish data...
research
12/20/2014

Competing with the Empirical Risk Minimizer in a Single Pass

In many estimation problems, e.g. linear and logistic regression, we wis...
research
04/05/2023

Distributed Logistic Regression for Massive Data with Rare Events

Large-scale rare events data are commonly encountered in practice. To ta...
research
09/17/2015

Optimal Subsampling Approaches for Large Sample Linear Regression

A significant hurdle for analyzing large sample data is the lack of effe...

Please sign up or login with your details

Forgot password? Click here to reset