Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads

05/12/2021
by   Abhinav Jangda, et al.
0

Recent trend towards increasing large machine learning models require both training and inference tasks to be distributed. Considering the huge cost of training these models, it is imperative to unlock optimizations in computation and communication to obtain best performance. However, current logical separation between computation and communication kernels in deep learning frameworks misses the optimization opportunities across such barrier. Breaking this abstraction with a holistic consideration can provide many optimizations to provide performance improvements in distributed workloads. Manually applying these optimizations needs modifications in underlying computation and communication libraries for each scenario, which is time consuming and error-prone. Therefore, we present CoCoNeT, with a DSL to express a program with both computation and communication. CoCoNeT contains several machine learning aware transformations to optimize a program and a compiler to generate high performance kernels. Providing both computation and communication as first class constructs allows users to work on a high-level abstraction and apply powerful optimizations, such as fusion or overlapping of communication and computation. CoCoNeT enables us to optimize data-, model-and pipeline-parallel workloads in large language models with only a few lines of code. Experiments show CoCoNeT significantly outperforms state-of-the-art distributed machine learning implementations.

READ FULL TEXT

page 9

page 12

research
03/08/2023

RAF: Holistic Compilation for Deep Learning Model Training

As deep learning is pervasive in modern applications, many deep learning...
research
04/12/2021

AI Powered Compiler Techniques for DL Code Optimization

Creating high performance implementations of deep learning primitives on...
research
10/20/2021

A Data-Centric Optimization Framework for Machine Learning

Rapid progress in deep learning is leading to a diverse set of quickly c...
research
11/02/2020

Cortex: A Compiler for Recursive Deep Learning Models

Optimizing deep learning models is generally performed in two steps: (i)...
research
10/22/2022

ALT: Boosting Deep Learning Performance by Breaking the Wall between Graph and Operator Level Optimizations

Deep learning models rely on highly optimized tensor libraries for effic...
research
10/31/2022

Lita: Accelerating Distributed Training of Sparsely Activated Models

Scaling model parameters usually improves model quality, but at the pric...
research
02/07/2020

Breaking Band: A Breakdown of High-performance Communication

The critical path of internode communication on large-scale systems is c...

Please sign up or login with your details

Forgot password? Click here to reset