Median Selection Subset Aggregation for Parallel Inference

10/24/2014
by   Xiangyu Wang, et al.
0

For massive data sets, efficient computation commonly relies on distributed algorithms that store and process subsets of the data on different machines, minimizing communication costs. Our focus is on regression and classification problems involving many features. A variety of distributed algorithms have been proposed in this context, but challenges arise in defining an algorithm with low communication, theoretical guarantees and excellent practical performance in general settings. We propose a MEdian Selection Subset AGgregation Estimator (message) algorithm, which attempts to solve these problems. The algorithm applies feature selection in parallel for each subset using Lasso or another method, calculates the `median' feature inclusion index, estimates coefficients for the selected features in parallel for each subset, and then averages these estimates. The algorithm is simple, involves very minimal communication, scales efficiently in both sample and feature size, and has theoretical guarantees. In particular, we show model selection consistency and coefficient estimation efficiency. Extensive experiments show excellent performance in variable selection, estimation, prediction, and computation time relative to usual competitors.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/19/2016

Robust and Parallel Bayesian Model Selection

Effective and accurate model selection is an important problem in modern...
research
02/08/2016

DECOrrelated feature space partitioning for distributed sparse regression

Fitting statistical models is computationally challenging when the sampl...
research
10/15/2016

Communication-efficient Distributed Sparse Linear Discriminant Analysis

We propose a communication-efficient distributed estimation method for s...
research
07/21/2020

ADAGES: adaptive aggregation with stability for distributed feature selection

In this era of "big" data, not only the large amount of data keeps motiv...
research
08/16/2019

NUQSGD: Improved Communication Efficiency for Data-parallel SGD via Nonuniform Quantization

As the size and complexity of models and datasets grow, so does the need...
research
08/01/2023

Best-Subset Selection in Generalized Linear Models: A Fast and Consistent Algorithm via Splicing Technique

In high-dimensional generalized linear models, it is crucial to identify...
research
05/30/2017

Forward-Backward Selection with Early Dropping

Forward-backward selection is one of the most basic and commonly-used fe...

Please sign up or login with your details

Forgot password? Click here to reset