RAF: Holistic Compilation for Deep Learning Model Training

03/08/2023
by   Cody Hao Yu, et al.
0

As deep learning is pervasive in modern applications, many deep learning frameworks are presented for deep learning practitioners to develop and train DNN models rapidly. Meanwhile, as training large deep learning models becomes a trend in recent years, the training throughput and memory footprint are getting crucial. Accordingly, optimizing training workloads with compiler optimizations is inevitable and getting more and more attentions. However, existing deep learning compilers (DLCs) mainly target inference and do not incorporate holistic optimizations, such as automatic differentiation and automatic mixed precision, in training workloads. In this paper, we present RAF, a deep learning compiler for training. Unlike existing DLCs, RAF accepts a forward model and in-house generates a training graph. Accordingly, RAF is able to systematically consolidate graph optimizations for performance, memory and distributed training. In addition, to catch up to the state-of-the-art performance with hand-crafted kernel libraries as well as tensor compilers, RAF proposes an operator dialect mechanism to seamlessly integrate all possible kernel implementations. We demonstrate that by in-house training graph generation and operator dialect mechanism, we are able to perform holistic optimizations and achieve either better training throughput or larger batch size against PyTorch (eager and torchscript mode), XLA, and DeepSpeed for popular transformer models on GPUs.

READ FULL TEXT
research
11/02/2020

Cortex: A Compiler for Recursive Deep Learning Models

Optimizing deep learning models is generally performed in two steps: (i)...
research
02/16/2023

Decoupled Model Schedule for Deep Learning Training

Recent years have seen an increase in the development of large deep lear...
research
05/12/2021

Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads

Recent trend towards increasing large machine learning models require bo...
research
11/16/2018

Image Classification at Supercomputer Scale

Deep learning is extremely computationally intensive, and hardware vendo...
research
02/13/2018

Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions

Deep learning models with convolutional and recurrent networks are now u...
research
02/20/2019

DNNVM : End-to-End Compiler Leveraging Heterogeneous Optimizations on FPGA-based CNN Accelerators

The convolutional neural network (CNN) has become a state-of-the-art met...
research
01/19/2022

Building a Performance Model for Deep Learning Recommendation Model Training on GPUs

We devise a performance model for GPU training of Deep Learning Recommen...

Please sign up or login with your details

Forgot password? Click here to reset