Communication Scheduling as a First-Class Citizen in Distributed Machine Learning Systems

03/08/2018
by   Sayed Hadi Hashemi, et al.
0

State-of-the-art machine learning systems rely on graph-based models, with the distributed training of these models being the norm in AI-powered production pipelines. The performance of these communication-heavy systems depends on the effective overlap of communication and computation. While the overlap challenge has been addressed in systems with simpler model representations, it remains an open problem in graph-based models. In this work, we develop a system for communication scheduling which realizes near-optimal overlap of communication and computation in graph-based models. Our system is implemented over TensorFlow and requires no changes in the model or developer inputs. Our system improves the throughput by up to 82 inference and 20 2.8x. A part of our implementation is already merged with TensorFlow codebase; the rest is publicly available.

READ FULL TEXT
research
03/08/2018

TicTac: Accelerating Distributed Deep Learning with Communication Scheduling

State-of-the-art deep learning systems rely on iterative distributed tra...
research
03/14/2016

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

TensorFlow is an interface for expressing machine learning algorithms, a...
research
10/18/2018

Private Machine Learning in TensorFlow using Secure Computation

We present a framework for experimenting with secure multi-party computa...
research
10/16/2018

AutoGraph: Imperative-style Coding with Graph-based Performance

There is a perceived trade-off between machine learning code that is eas...
research
05/04/2018

Dynamic Control Flow in Large-Scale Machine Learning

Many recent machine learning models rely on fine-grained dynamic control...
research
03/07/2023

Probing Graph Representations

Today we have a good theoretical understanding of the representational p...
research
04/12/2021

Distributed Learning Systems with First-order Methods

Scalable and efficient distributed learning is one of the main driving f...

Please sign up or login with your details

Forgot password? Click here to reset