High-Performance Distributed ML at Scale through Parameter Server Consistency Models

10/29/2014
by   Wei Dai, et al.
0

As Machine Learning (ML) applications increase in data size and model complexity, practitioners turn to distributed clusters to satisfy the increased computational and memory demands. Unfortunately, effective use of clusters for ML requires considerable expertise in writing distributed code, while highly-abstracted frameworks like Hadoop have not, in practice, approached the performance seen in specialized ML implementations. The recent Parameter Server (PS) paradigm is a middle ground between these extremes, allowing easy conversion of single-machine parallel ML applications into distributed ones, while maintaining high throughput through relaxed "consistency models" that allow inconsistent parameter reads. However, due to insufficient theoretical study, it is not clear which of these consistency models can really ensure correct ML algorithm output; at the same time, there remain many theoretically-motivated but undiscovered opportunities to maximize computational throughput. Motivated by this challenge, we study both the theoretical guarantees and empirical behavior of iterative-convergent ML algorithms in existing PS consistency models. We then use the gleaned insights to improve a consistency model using an "eager" PS communication mechanism, and implement it as a new PS system that enables ML algorithms to reach their solution more quickly.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/30/2013

Consistent Bounded-Asynchronous Parameter Servers for Distributed ML

In distributed ML applications, shared parameters are usually replicated...
research
07/28/2023

Empirical Study of Straggler Problem in Parameter Server on Iterative Convergent Distributed Machine Learning

The purpose of this study is to test the effectiveness of current stragg...
research
12/31/2015

Strategies and Principles of Distributed Machine Learning on Big Data

The rise of Big Data has led to new demands for Machine Learning (ML) sy...
research
12/04/2014

LightLDA: Big Topic Models on Modest Compute Clusters

When building large-scale machine learning (ML) programs, such as big to...
research
12/19/2013

Structure-Aware Dynamic Scheduler for Parallel Machine Learning

Training large machine learning (ML) models with many variables or param...
research
09/01/2020

Tensor Relational Algebra for Machine Learning System Design

Machine learning (ML) systems have to support various tensor operations....
research
07/13/2021

Learnability of Learning Performance and Its Application to Data Valuation

For most machine learning (ML) tasks, evaluating learning performance on...

Please sign up or login with your details

Forgot password? Click here to reset