Comunication-Efficient Algorithms for Statistical Optimization

09/19/2012
by   Yuchen Zhang, et al.
0

We analyze two communication-efficient algorithms for distributed statistical optimization on large-scale data sets. The first algorithm is a standard averaging method that distributes the N data samples evenly to machines, performs separate minimization on each subset, and then averages the estimates. We provide a sharp analysis of this average mixture algorithm, showing that under a reasonable set of conditions, the combined parameter achieves mean-squared error that decays as (N^-1+(N/m)^-2). Whenever m <√(N), this guarantee matches the best possible rate achievable by a centralized algorithm having access to all samples. The second algorithm is a novel method, based on an appropriate form of bootstrap subsampling. Requiring only a single round of communication, it has mean-squared error that decays as (N^-1 + (N/m)^-3), and so is more robust to the amount of parallelization. In addition, we show that a stochastic gradient-based method attains mean-squared error decaying as O(N^-1 + (N/ m)^-3/2), easing computation at the expense of penalties in the rate of convergence. We also provide experimental evaluation of our methods, investigating their performance both on simulated data and on a large-scale regression problem from the internet search domain. In particular, we show that our methods can be used to efficiently solve an advertisement prediction problem from the Chinese SoSo Search Engine, which involves logistic regression with N ≈ 2.4 × 10^8 samples and d ≈ 740,000 covariates.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/12/2017

Data-Efficient Policy Evaluation Through Behavior Policy Search

We consider the task of evaluating a policy for a Markov decision proces...
research
03/24/2020

Efficient Algorithms for Multidimensional Segmented Regression

We study the fundamental problem of fixed design multidimensional segme...
research
04/10/2018

Subsampled Optimization: Statistical Guarantees, Mean Squared Error Approximation, and Sampling Method

For optimization on large-scale data, exactly calculating its solution m...
research
07/09/2020

Provably-Efficient Double Q-Learning

In this paper, we establish a theoretical comparison between the asympto...
research
12/29/2022

Gaussian Heteroskedastic Empirical Bayes without Independence

In this note, we propose empirical Bayes methods under heteroskedastic G...
research
09/10/2022

scatteR: Generating instance space based on scagnostics

Modern synthetic data generators consist of model-based methods where th...
research
11/22/2017

PULasso: High-dimensional variable selection with presence-only data

In various real-world problems, we are presented with positive and unlab...

Please sign up or login with your details

Forgot password? Click here to reset