Network-accelerated Distributed Machine Learning Using MLFabric

06/30/2019
by   Raajay Viswanathan, et al.
0

Existing distributed machine learning (DML) systems focus on improving the computational efficiency of distributed learning, whereas communication aspects have received less attention. Many DML systems treat the network as a blackbox. Thus, DML algorithms' performance is impeded by network bottlenecks, and DML systems end up sacrificing important algorithmic and system-level benefits. We present MLfabric, a communication library that manages all network transfers in a DML system, and holistically determines the communication pattern of a DML algorithm at any point in time. This allows MLfabric to carefully order transfers (i.e., gradient updates) to improve convergence, opportunistically aggregate updates in-network to improve efficiency, and proactively replicate some of them to support new notions of fault tolerance. We empirically find that MLfabric achieves up to 3X speed-up in training large deep learning models in realistic dynamic cluster settings.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/16/2021

A Survey on Fault-tolerance in Distributed Optimization and Machine Learning

The robustness of distributed optimization is an emerging field of study...
research
08/27/2019

Distributed Consistent Network Updates in SDNs: Local Verification for Global Guarantees

While SDNs enable more flexible and adaptive network operations, (logica...
research
02/27/2023

Architecting Peer-to-Peer Serverless Distributed Machine Learning Training for Improved Fault Tolerance

Distributed Machine Learning refers to the practice of training a model ...
research
12/14/2020

Quantizing data for distributed learning

We consider machine learning applications that train a model by leveragi...
research
03/16/2019

SLSGD: Secure and Efficient Distributed On-device Machine Learning

We consider distributed on-device learning with limited communication an...
research
06/22/2021

Communication Pattern Models: An Extension of Action Models for Dynamic-Network Distributed Systems

Halpern and Moses were the first to recognize, in 1984, the importance o...
research
06/28/2019

Asymptotic Network Independence in Distributed Optimization for Machine Learning

We provide a discussion of several recent results which have overcome a ...

Please sign up or login with your details

Forgot password? Click here to reset