Parameter Hub: High Performance Parameter Servers for Efficient Distributed Deep Neural Network Training

01/30/2018
by   Liang Luo, et al.
0

Most work in the deep learning systems community has focused on faster inference, but arriving at a trained model requires lengthy experiments. Accelerating training lets developers iterate faster and come up with better models. DNN training is often seen as a compute-bound problem, best done in a single large compute node with many GPUs. As DNNs get bigger, training requires going distributed. Distributed deep neural network (DDNN) training constitutes an important workload on the cloud. Larger DNN models and faster compute engines shift training performance bottleneck from computation to communication. Our experiments show existing DNN training frameworks do not scale in a typical cloud environment due to insufficient bandwidth and inefficient parameter server software stacks. We propose PHub, a high performance parameter server (PS) software design that provides an optimized network stack and a streamlined gradient processing pipeline to benefit common PS setups, and PBox, a balanced, scalable central PS hardware that fully utilizes PHub capabilities. We show that in a typical cloud environment, PHub can achieve up to 3.8x speedup over state-of-theart designs when training ImageNet. We discuss future directions of integrating PHub with programmable switches for in-network aggregation during training, leveraging the datacenter network topology to reduce bandwidth usage and localize data movement.

READ FULL TEXT

page 1

page 2

page 3

research
05/21/2018

Parameter Hub: a Rack-Scale Parameter Server for Distributed Deep Neural Network Training

Distributed deep neural network (DDNN) training constitutes an increasin...
research
02/19/2019

Optimizing Network Performance for Distributed DNN Training on GPU Clusters: ImageNet/AlexNet Training in 1.5 Minutes

It is important to scale out deep neural network (DNN) training for redu...
research
04/12/2021

Software-Hardware Co-design for Fast and Scalable Training of Deep Learning Recommendation Models

Deep learning recommendation models (DLRMs) are used across many busines...
research
09/03/2022

HammingMesh: A Network Topology for Large-Scale Deep Learning

Numerous microarchitectural optimizations unlocked tremendous processing...
research
04/29/2020

Caramel: Accelerating Decentralized Distributed Deep Learning with Computation Scheduling

The method of choice for parameter aggregation in Deep Neural Network (D...
research
09/18/2022

Empirical Analysis on Top-k Gradient Sparsification for Distributed Deep Learning in a Supercomputing Environment

To train deep learning models faster, distributed training on multiple G...
research
08/15/2020

Skyline: Interactive In-Editor Computational Performance Profiling for Deep Neural Network Training

Training a state-of-the-art deep neural network (DNN) is a computational...

Please sign up or login with your details

Forgot password? Click here to reset