Efficient Communication Acceleration for Next-GenScale-up Deep Learning Training Platforms

06/30/2020
by   Saeed Rashidi, et al.
0

Deep Learning (DL) training platforms are built by interconnecting multiple DL accelerators (e.g., GPU/TPU) via fast, customized interconnects. As the size of DL models and the compute efficiency of the accelerators has continued to increase, there has also been a corresponding steady increase in the bandwidth of these interconnects.Systems today provide 100s of gigabytes (GBs) of inter-connect bandwidth via a mix of solutions such as Multi-Chip packaging modules (MCM) and proprietary interconnects(e.g., NVlink) that together from the scale-up network of accelerators. However, as we identify in this work, a significant portion of this bandwidth goes under-utilized. This is because(i) using compute cores for executing collective operations such as all-reduce decreases overall compute efficiency, and(ii) there is memory bandwidth contention between the accesses for arithmetic operations vs those for collectives, and(iii) there are significant internal bus congestions that increase the latency of communication operations. To address this challenge, we propose a novel microarchitecture, calledAccelerator Collectives Engine(ACE), forDL collective communication offload. ACE is a smart net-work interface (NIC) tuned to cope with the high-bandwidth and low latency requirements of scale-up networks and is able to efficiently drive the various scale-up network systems(e.g. switch-based or point-to-point topologies). We evaluate the benefits of the ACE with micro-benchmarks (e.g. single collective performance) and popular DL models using an end-to-end DL training simulator. For modern DL workloads, ACE on average increases the net-work bandwidth utilization by 1.97X, resulting in 2.71X and 1.44X speedup in iteration time for ResNet-50 and GNMT, respectively.

READ FULL TEXT

page 1

page 4

page 6

page 10

research
06/30/2020

Efficient Communication Acceleration for Next-Gen Scale-up Deep Learning Training Platforms

Deep Learning (DL) training platforms are built by interconnecting multi...
research
04/28/2022

FuncPipe: A Pipelined Serverless Framework for Fast and Cost-efficient Training of Deep Learning Models

Training deep learning (DL) models has become a norm. With the emergence...
research
03/15/2023

MCR-DL: Mix-and-Match Communication Runtime for Deep Learning

In recent years, the training requirements of many state-of-the-art Deep...
research
11/30/2020

Accelerating Bandwidth-Bound Deep Learning Inference with Main-Memory Accelerators

DL inference queries play an important role in diverse internet services...
research
04/22/2022

FPGA-based AI Smart NICs for Scalable Distributed AI Training Systems

Rapid advances in artificial intelligence (AI) technology have led to si...
research
10/09/2021

Themis: A Network Bandwidth-Aware Collective Scheduling Policy for Distributed Training of DL Models

The continuous growth in both size and training data for modern Deep Neu...
research
03/31/2022

Efficient and Eventually Consistent Collective Operations

Collective operations are common features of parallel programming models...

Please sign up or login with your details

Forgot password? Click here to reset