Consistent Bounded-Asynchronous Parameter Servers for Distributed ML

12/30/2013
by   Jinliang Wei, et al.
0

In distributed ML applications, shared parameters are usually replicated among computing nodes to minimize network overhead. Therefore, proper consistency model must be carefully chosen to ensure algorithm's correctness and provide high throughput. Existing consistency models used in general-purpose databases and modern distributed ML systems are either too loose to guarantee correctness of the ML algorithms or too strict and thus fail to fully exploit the computing power of the underlying distributed system. Many ML algorithms fall into the category of iterative convergent algorithms which start from a randomly chosen initial point and converge to optima by repeating iteratively a set of procedures. We've found that many such algorithms are to a bounded amount of inconsistency and still converge correctly. This property allows distributed ML to relax strict consistency models to improve system performance while theoretically guarantees algorithmic correctness. In this paper, we present several relaxed consistency models for asynchronous parallel computation and theoretically prove their algorithmic correctness. The proposed consistency models are implemented in a distributed parameter server and evaluated in the context of a popular ML application: topic modeling.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/29/2014

High-Performance Distributed ML at Scale through Parameter Server Consistency Models

As Machine Learning (ML) applications increase in data size and model co...
research
12/19/2013

Structure-Aware Dynamic Scheduler for Parallel Machine Learning

Training large machine learning (ML) models with many variables or param...
research
03/15/2018

Global Stabilization for Causally Consistent Partial Replication

Causally consistent distributed storage systems have received significan...
research
08/08/2018

Parallax: Automatic Data-Parallel Training of Deep Neural Networks

The employment of high-performance servers and GPU accelerators for trai...
research
01/19/2018

Just-Right Consistency: reconciling availability and safety

By the CAP Theorem, a distributed data storage system can ensure either ...
research
03/09/2023

Cloudless-Training: A Framework to Improve Efficiency of Geo-Distributed ML Training

Geo-distributed ML training can benefit many emerging ML scenarios (e.g....
research
01/29/2019

Data Consistency in Transactional Storage Systems: a Centralised Approach

Modern distributed databases weaken data consistency guarantees to allow...

Please sign up or login with your details

Forgot password? Click here to reset