RPC Considered Harmful: Fast Distributed Deep Learning on RDMA

05/22/2018
by   Jilong Xue, et al.
0

Deep learning emerges as an important new resource-intensive workload and has been successfully applied in computer vision, speech, natural language processing, and so on. Distributed deep learning is becoming a necessity to cope with growing data and model sizes. Its computation is typically characterized by a simple tensor data abstraction to model multi-dimensional matrices, a data-flow graph to model computation, and iterative executions with relatively frequent synchronizations, thereby making it substantially different from Map/Reduce style distributed big data computation. RPC, commonly used as the communication primitive, has been adopted by popular deep learning frameworks such as TensorFlow, which uses gRPC. We show that RPC is sub-optimal for distributed deep learning computation, especially on an RDMA-capable network. The tensor abstraction and data-flow graph, coupled with an RDMA network, offers the opportunity to reduce the unnecessary overhead (e.g., memory copy) without sacrificing programmability and generality. In particular, from a data access point of view, a remote machine is abstracted just as a "device" on an RDMA channel, with a simple memory interface for allocating, reading, and writing memory regions. Our graph analyzer looks at both the data flow graph and the tensors to optimize memory allocation and remote data access using this interface. The result is up to 25 times speedup in representative deep learning benchmarks against the standard gRPC in TensorFlow and up to 169 optimized for RDMA, leading to faster convergence in the training process.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/03/2018

Designing a Micro-Benchmark Suite to Evaluate gRPC for TensorFlow: Early Experiences

Remote procedure call (RPC) is the backbone of many modern distributed s...
research
10/28/2021

OneFlow: Redesign the Distributed Deep Learning Framework from Scratch

Deep learning frameworks such as TensorFlow and PyTorch provide a produc...
research
03/08/2018

TicTac: Accelerating Distributed Deep Learning with Communication Scheduling

State-of-the-art deep learning systems rely on iterative distributed tra...
research
02/24/2023

Decoupling the All-Reduce Primitive for Accelerating Distributed Deep Learning

Communication scheduling has been shown to be effective in accelerating ...
research
09/22/2022

Deep Lake: a Lakehouse for Deep Learning

Traditional data lakes provide critical data infrastructure for analytic...
research
04/30/2021

Memory Reduction using a Ring Abstraction over GPU RDMA for Distributed Quantum Monte Carlo Solver

Scientific applications that run on leadership computing facilities ofte...
research
03/08/2019

Auto-Vectorizing TensorFlow Graphs: Jacobians, Auto-Batching And Beyond

We propose a static loop vectorization optimization on top of high level...

Please sign up or login with your details

Forgot password? Click here to reset