FPGA-based AI Smart NICs for Scalable Distributed AI Training Systems

04/22/2022
by   Rui Ma, et al.
0

Rapid advances in artificial intelligence (AI) technology have led to significant accuracy improvements in a myriad of application domains at the cost of larger and more compute-intensive models. Training such models on massive amounts of data typically requires scaling to many compute nodes and relies heavily on collective communication algorithms, such as all-reduce, to exchange the weight gradients between different nodes. The overhead of these collective communication operations in a distributed AI training system can bottleneck its performance, with more pronounced effects as the number of nodes increases. In this paper, we first characterize the all-reduce operation overhead by profiling distributed AI training. Then, we propose a new smart network interface card (NIC) for distributed AI training systems using field-programmable gate arrays (FPGAs) to accelerate all-reduce operations and optimize network bandwidth utilization via data compression. The AI smart NIC frees up the system's compute resources to perform the more compute-intensive tensor operations and increases the overall node-to-node communication efficiency. We perform real measurements on a prototype distributed AI training system comprised of 6 compute nodes to evaluate the performance gains of our proposed FPGA-based AI smart NIC compared to a baseline system with regular NICs. We also use these measurements to validate an analytical model that we formulate to predict performance when scaling to larger systems. Our proposed FPGA-based AI smart NIC enhances overall training performance by 1.6x at 6 nodes, with an estimated 2.5x performance improvement at 32 nodes, compared to the baseline system using conventional NICs.

READ FULL TEXT
research
08/02/2022

Interplay between Distributed AI Workflow and URLLC

Distributed artificial intelligence (AI) has recently accomplished treme...
research
06/30/2020

Efficient Communication Acceleration for Next-GenScale-up Deep Learning Training Platforms

Deep Learning (DL) training platforms are built by interconnecting multi...
research
06/30/2020

Efficient Communication Acceleration for Next-Gen Scale-up Deep Learning Training Platforms

Deep Learning (DL) training platforms are built by interconnecting multi...
research
10/28/2021

NetDAM: Network Direct Attached Memory with Programmable In-Memory Computing ISA

Data-intensive applications like distributed AI-training may require mul...
research
12/12/2022

AI Model Utilization Measurements For Finding Class Encoding Patterns

This work addresses the problems of (a) designing utilization measuremen...
research
02/18/2021

Efficient Distributed Auto-Differentiation

Although distributed machine learning has opened up numerous frontiers o...
research
10/09/2021

Themis: A Network Bandwidth-Aware Collective Scheduling Policy for Distributed Training of DL Models

The continuous growth in both size and training data for modern Deep Neu...

Please sign up or login with your details

Forgot password? Click here to reset