Trade-offs in Large-Scale Distributed Tuplewise Estimation and Learning

06/21/2019
by   Robin Vogel, et al.
2

The development of cluster computing frameworks has allowed practitioners to scale out various statistical estimation and machine learning algorithms with minimal programming effort. This is especially true for machine learning problems whose objective function is nicely separable across individual data points, such as classification and regression. In contrast, statistical learning tasks involving pairs (or more generally tuples) of data points - such as metric learning, clustering or ranking do not lend themselves as easily to data-parallelism and in-memory computing. In this paper, we investigate how to balance between statistical performance and computational efficiency in such distributed tuplewise statistical problems. We first propose a simple strategy based on occasionally repartitioning data across workers between parallel computation stages, where the number of repartitioning steps rules the trade-off between accuracy and runtime. We then present some theoretical results highlighting the benefits brought by the proposed method in terms of variance reduction, and extend our results to design distributed stochastic gradient descent algorithms for tuplewise empirical risk minimization. Our results are supported by numerical experiments in pairwise statistical estimation and learning on synthetic and real-world datasets.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/06/2020

Scalable Estimation and Inference with Large-scale or Online Survival Data

With the rapid development of data collection and aggregation technologi...
research
11/01/2022

On Medians of (Randomized) Pairwise Means

Tournament procedures, recently introduced in Lugosi Mendelson (2016...
research
05/01/2023

Performance and Energy Consumption of Parallel Machine Learning Algorithms

Machine learning models have achieved remarkable success in various real...
research
01/09/2015

Survey schemes for stochastic gradient descent with applications to M-estimation

In certain situations that shall be undoubtedly more and more common in ...
research
04/16/2016

DS-MLR: Exploiting Double Separability for Scaling up Distributed Multinomial Logistic Regression

Scaling multinomial logistic regression to datasets with very large numb...
research
01/19/2022

Flexible Parallel Learning in Edge Scenarios: Communication, Computational and Energy Cost

Traditionally, distributed machine learning takes the guise of (i) diffe...
research
10/26/2020

Enforcing Interpretability and its Statistical Impacts: Trade-offs between Accuracy and Interpretability

To date, there has been no formal study of the statistical cost of inter...

Please sign up or login with your details

Forgot password? Click here to reset