Horovod: fast and easy distributed deep learning in TensorFlow

02/15/2018
by   Alexander Sergeev, et al.
0

Training modern deep learning models requires large amounts of computation, often provided by GPUs. Scaling computation from one GPU to many can enable much faster training and research progress but entails two complications. First, the training library must support inter-GPU communication. Depending on the particular methods employed, this communication may entail anywhere from negligible to significant overhead. Second, the user must modify his or her training code to take advantage of inter-GPU communication. Depending on the training library's API, the modification required may be either significant or minimal. Existing methods for enabling multi-GPU training under the TensorFlow library entail non-negligible communication overhead and require users to heavily modify their model-building code, leading many researchers to avoid the whole mess and stick with slower single-GPU training. In this paper we introduce Horovod, an open source library that improves on both obstructions to scaling: it employs efficient inter-GPU communication via ring reduction and requires only a few lines of modification to user code, enabling faster, easier distributed training in TensorFlow. Horovod is available under the Apache 2.0 license at https://github.com/uber/horovod.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/11/2017

Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters

Deep learning models can take weeks to train on a single GPU-equipped ma...
research
08/07/2017

PowerAI DDL

As deep neural networks become more complex and input datasets grow larg...
research
12/29/2020

TensorX: Extensible API for Neural Network Model Design and Deployment

TensorX is a Python library for prototyping, design, and deployment of c...
research
04/30/2019

AdaNet: A Scalable and Flexible Framework for Automatically Learning Ensembles

AdaNet is a lightweight TensorFlow-based (Abadi et al., 2015) framework ...
research
06/17/2021

DeepLab2: A TensorFlow Library for Deep Labeling

DeepLab2 is a TensorFlow library for deep labeling, aiming to provide a ...
research
01/01/2023

MIGPerf: A Comprehensive Benchmark for Deep Learning Training and Inference Workloads on Multi-Instance GPUs

New architecture GPUs like A100 are now equipped with multi-instance GPU...
research
01/30/2020

Non-Determinism in TensorFlow ResNets

We show that the stochasticity in training ResNets for image classificat...

Please sign up or login with your details

Forgot password? Click here to reset