Designing a Micro-Benchmark Suite to Evaluate gRPC for TensorFlow: Early Experiences

04/03/2018
by   Rajarshi Biswas, et al.
0

Remote procedure call (RPC) is the backbone of many modern distributed systems. Google's gRPC is one of the most popular open source RPC frameworks available in the community. gRPC is the main communication engine for Google's Deep Learning framework TensorFlow. TensorFlow primarily uses gRPC for communicating tensors and administrative tasks among different processes. Tensor updates during the training phase are communication intensive and thus TensorFlow's performance is heavily dependent on the underlying network and the efficacy of the communication engine. Training deep learning models on TensorFlow can take significant time ranging from several minutes to several hours, even several days. Thus system researchers need to devote a lot of time to understand the impact of communication on the overall performance. Clearly, there is lack of benchmarks available for system researchers. Therefore, we propose TF-gRPC-Bench micro-benchmark suite that enables system researchers to quickly understand the impact of the underlying network and communication runtime on deep learning workloads. To achieve this, we first analyze the characteristics of TensorFlow workload over gRPC by training popular deep learning models. Then, we propose three micro-benchmarks that take account these workload characteristics. In addition, we comprehensively evaluate gRPC with TF-gRPC-Bench micro-benchmark suite on different clusters over Ethernet, IPoIB, and RDMA, and present the results.

READ FULL TEXT
research
10/06/2018

Characterizing Deep-Learning I/O Workloads in TensorFlow

The performance of Deep-Learning (DL) computing frameworks rely on the p...
research
05/22/2018

RPC Considered Harmful: Fast Distributed Deep Learning on RDMA

Deep learning emerges as an important new resource-intensive workload an...
research
03/19/2020

TF-Coder: Program Synthesis for Tensor Manipulations

The success and popularity of deep learning is on the rise, partially du...
research
06/11/2017

Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters

Deep learning models can take weeks to train on a single GPU-equipped ma...
research
03/12/2021

TensorGP – Genetic Programming Engine in TensorFlow

In this paper, we resort to the TensorFlow framework to investigate the ...
research
02/26/2021

Swift for TensorFlow: A portable, flexible platform for deep learning

Swift for TensorFlow is a deep learning platform that scales from mobile...
research
01/27/2019

Moving Deep Learning into Web Browser: How Far Can We Go?

Recently, several JavaScript-based deep learning frameworks have emerged...

Please sign up or login with your details

Forgot password? Click here to reset