Colossal-Auto: Unified Automation of Parallelization and Activation Checkpoint for Large-scale Models

02/06/2023
by   Yuliang Liu, et al.
0

In recent years, large-scale models have demonstrated state-of-the-art performance across various domains. However, training such models requires various techniques to address the problem of limited computing power and memory on devices such as GPUs. Some commonly used techniques include pipeline parallelism, tensor parallelism, and activation checkpointing. While existing works have focused on finding efficient distributed execution plans (Zheng et al. 2022) and activation checkpoint scheduling (Herrmann et al. 2019, Beaumont et al. 2021, there has been no method proposed to optimize these two plans jointly. Moreover, ahead-of-time compilation relies heavily on accurate memory and computing overhead estimation, which is often time-consuming and misleading. Existing training systems and machine learning pipelines either physically execute each operand or estimate memory usage with a scaled input tensor. To address these challenges, we introduce a system that can jointly optimize distributed execution and gradient checkpointing plans. Additionally, we provide an easy-to-use symbolic profiler that generates memory and computing statistics for any PyTorch model with a minimal time cost. Our approach allows users to parallelize their model training on the given hardware with minimum code change based. The source code is publicly available at Colossal-AI GitHub or https://github.com/hpcaitech/ColossalAI

READ FULL TEXT

page 4

page 5

research
04/09/2021

Efficient Large-Scale Language Model Training on GPU Clusters

Large language models have led to state-of-the-art accuracies across a r...
research
10/28/2021

Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training

The Transformer architecture has improved the performance of deep learni...
research
05/10/2022

Reducing Activation Recomputation in Large Transformer Models

Training large transformer models is one of the most important computati...
research
07/25/2022

Dive into Big Model Training

The increasing scale of model size and continuous improvement of perform...
research
04/21/2020

torchgpipe: On-the-fly Pipeline Parallelism for Training Giant Models

We design and implement a ready-to-use library in PyTorch for performing...
research
10/27/2020

Memory Optimization for Deep Networks

Deep learning is slowly, but steadily, hitting a memory bottleneck. Whil...
research
07/29/2019

Precomputing Datalog evaluation plans in large-scale scenarios

With the more and more growing demand for semantic Web services over lar...

Please sign up or login with your details

Forgot password? Click here to reset