Themis: A Network Bandwidth-Aware Collective Scheduling Policy for Distributed Training of DL Models

10/09/2021
by   Saeed Rashidi, et al.
0

The continuous growth in both size and training data for modern Deep Neural Networks (DNNs) models has led to training tasks taking days or even months. Distributed training is a solution to reduce training time by splitting the task across multiple NPUs (e.g., GPU/TPU). However, distributed training adds communication overhead between the NPUs in order to synchronize the gradients and/or activation, depending on the parallelization strategy. In today's datacenters, for training at scale, NPUs are connected through multi-dimensional interconnection links with different bandwidth and latency. Hence, keeping all network dimensions busy and maximizing the network BW is a challenging task in such a hybrid network environment, as this work identifies. We propose Themis, a novel collective scheduling scheme that dynamically schedules collectives (divided into chunks) to balance the communication loads across all dimensions, further improving the network BW utilization. Our results show that on average, Themis can improve the network BW utilization of single All-Reduce by 1.88x (2.92x max), and improve the end-to-end training iteration performance of real workloads such as ResNet-50, GNMT, DLRM, and Transformer- 1T by 1.49x (1.96x max), 1.41x (1.81x max), 1.42x (1.80x max), and 1.35x (1.78x max), respectively.

READ FULL TEXT

page 1

page 9

research
09/24/2021

Exploring Multi-dimensional Hierarchical Network Topologies for Efficient Distributed Training of Trillion Parameter DL Models

Deep Neural Networks have gained significant attraction due to their wid...
research
06/30/2020

Efficient Communication Acceleration for Next-Gen Scale-up Deep Learning Training Platforms

Deep Learning (DL) training platforms are built by interconnecting multi...
research
06/30/2020

Efficient Communication Acceleration for Next-GenScale-up Deep Learning Training Platforms

Deep Learning (DL) training platforms are built by interconnecting multi...
research
03/11/2023

OCCL: a Deadlock-free Library for GPU Collective Communication

Various distributed deep neural network (DNN) training technologies lead...
research
02/18/2021

Efficient Distributed Auto-Differentiation

Although distributed machine learning has opened up numerous frontiers o...
research
08/10/2023

Isolated Scheduling for Distributed Training Tasks in GPU Clusters

Distributed machine learning (DML) technology makes it possible to train...
research
04/22/2022

FPGA-based AI Smart NICs for Scalable Distributed AI Training Systems

Rapid advances in artificial intelligence (AI) technology have led to si...

Please sign up or login with your details

Forgot password? Click here to reset