Caramel: Accelerating Decentralized Distributed Deep Learning with Computation Scheduling

04/29/2020
by   Sayed Hadi Hashemi, et al.
0

The method of choice for parameter aggregation in Deep Neural Network (DNN) training, a network-intensive task, is shifting from the Parameter Server model to decentralized aggregation schemes (AllReduce) inspired by theoretical guarantees of better performance. However, current implementations of AllReduce overlook the interdependence of communication and computation, resulting in significant performance degradation. In this paper, we develop Caramel, a system that accelerates decentralized distributed deep learning through model-aware computation scheduling and communication optimizations for AllReduce. Caramel achieves this goal through (a) computation DAG scheduling that expands the feasible window of transfer for each parameter (transfer boundaries), and (b) network optimizations for smoothening of the load including adaptive batching and pipelining of parameter transfers. Caramel maintains the correctness of the dataflow model, is hardware-independent, and does not require any user-level or framework-level changes. We implement Caramel over TensorFlow and show that the iteration time of DNN training can be improved by up to 3.62x in a cloud environment.

READ FULL TEXT

page 3

page 5

page 7

page 8

page 9

page 10

research
03/08/2018

TicTac: Accelerating Distributed Deep Learning with Communication Scheduling

State-of-the-art deep learning systems rely on iterative distributed tra...
research
03/11/2023

OCCL: a Deadlock-free Library for GPU Collective Communication

Various distributed deep neural network (DNN) training technologies lead...
research
03/06/2020

Communication Optimization Strategies for Distributed Deep Learning: A Survey

Recent trends in high-performance computing and deep learning lead to a ...
research
11/08/2021

BlueFog: Make Decentralized Algorithms Practical for Optimization and Deep Learning

Decentralized algorithm is a form of computation that achieves a global ...
research
05/21/2018

Parameter Hub: a Rack-Scale Parameter Server for Distributed Deep Neural Network Training

Distributed deep neural network (DDNN) training constitutes an increasin...
research
01/30/2018

Parameter Hub: High Performance Parameter Servers for Efficient Distributed Deep Neural Network Training

Most work in the deep learning systems community has focused on faster i...
research
05/31/2020

Cheetah: Optimizations and Methods for PrivacyPreserving Inference via Homomorphic Encryption

As the application of deep learning continues to grow, so does the amoun...

Please sign up or login with your details

Forgot password? Click here to reset