Poseidon: A System Architecture for Efficient GPU-based Deep Learning on Multiple Machines

12/19/2015
by   Hao Zhang, et al.
0

Deep learning (DL) has achieved notable successes in many machine learning tasks. A number of frameworks have been developed to expedite the process of designing and training deep neural networks (DNNs), such as Caffe, Torch and Theano. Currently they can harness multiple GPUs on a single machine, but are unable to use GPUs that are distributed across multiple machines; as even average-sized DNNs can take days to train on a single GPU with 100s of GBs to TBs of data, distributed GPUs present a prime opportunity for scaling up DL. However, the limited bandwidth available on commodity Ethernet networks presents a bottleneck to distributed GPU training, and prevents its trivial realization. To investigate how to adapt existing frameworks to efficiently support distributed GPUs, we propose Poseidon, a scalable system architecture for distributed inter-machine communication in existing DL frameworks. We integrate Poseidon with Caffe and evaluate its performance at training DNNs for object recognition. Poseidon features three key contributions that accelerate DNN training on clusters: (1) a three-level hybrid architecture that allows Poseidon to support both CPU-only and GPU-equipped clusters, (2) a distributed wait-free backpropagation (DWBP) algorithm to improve GPU utilization and to balance communication, and (3) a structure-aware communication protocol (SACP) to minimize communication overheads. We empirically show that Poseidon converges to same objectives as a single machine, and achieves state-of-art training speedup across multiple models and well-established datasets using a commodity GPU cluster of 8 nodes (e.g. 4.5x speedup on AlexNet, 4x on GoogLeNet, 4x on CIFAR-10). On the much larger ImageNet22K dataset, Poseidon with 8 nodes achieves better speedup and competitive accuracy to recent CPU-based distributed systems such as Adam and Le et al., which use 10s to 1000s of nodes.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/11/2017

Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters

Deep learning models can take weeks to train on a single GPU-equipped ma...
research
10/20/2020

Towards Scalable Distributed Training of Deep Learning on Public Cloud Clusters

Distributed training techniques have been widely deployed in large-scale...
research
08/08/2018

Parallax: Automatic Data-Parallel Training of Deep Neural Networks

The employment of high-performance servers and GPU accelerators for trai...
research
11/30/2019

Training Distributed Deep Recurrent Neural Networks with Mixed Precision on GPU Clusters

In this paper, we evaluate training of deep recurrent neural networks wi...
research
07/14/2023

DistTGL: Distributed Memory-Based Temporal Graph Neural Network Training

Memory-based Temporal Graph Neural Networks are powerful tools in dynami...
research
03/11/2023

OCCL: a Deadlock-free Library for GPU Collective Communication

Various distributed deep neural network (DNN) training technologies lead...
research
04/05/2020

Reducing Data Motion to Accelerate the Training of Deep Neural Networks

This paper reduces the cost of DNNs training by decreasing the amount of...

Please sign up or login with your details

Forgot password? Click here to reset