Amazon SageMaker Model Parallelism: A General and Flexible Framework for Large Model Training

11/10/2021
by   Can Karakus, et al.
12

With deep learning models rapidly growing in size, systems-level solutions for large-model training are required. We present Amazon SageMaker model parallelism, a software library that integrates with PyTorch, and enables easy training of large models using model parallelism and other memory-saving features. In contrast to existing solutions, the implementation of the SageMaker library is much more generic and flexible, in that it can automatically partition and run pipeline parallelism over arbitrary model architectures with minimal code change, and also offers a general and extensible framework for tensor parallelism, which supports a wider range of use cases, and is modular enough to be easily applied to new training scripts. The library also preserves the native PyTorch user experience to a much larger degree, supporting module re-use and dynamic graphs, while giving the user full control over the details of the training step. We evaluate performance over GPT-3, RoBERTa, BERT, and neural collaborative filtering, and demonstrate competitive performance over existing solutions.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/25/2022

Dive into Big Model Training

The increasing scale of model size and continuous improvement of perform...
research
06/10/2022

Merak: An Efficient Distributed DNN Training Framework with Automated 3D Parallelism for Giant Foundation Models

Foundation models are becoming the dominant deep learning technologies. ...
research
04/21/2020

torchgpipe: On-the-fly Pipeline Parallelism for Training Giant Models

We design and implement a ready-to-use library in PyTorch for performing...
research
03/30/2021

Automatic Graph Partitioning for Very Large-scale Deep Learning

This work proposes RaNNC (Rapid Neural Network Connector) as middleware ...
research
08/30/2023

Hanayo: Harnessing Wave-like Pipeline Parallelism for Enhanced Large Model Training Efficiency

Large-scale language models have become increasingly challenging and exp...
research
01/20/2023

ATP: Adaptive Tensor Parallelism for Foundation Models

Foundation models have impressive performance and generalization capabil...
research
12/06/2021

Automap: Towards Ergonomic Automated Parallelism for ML Models

The rapid rise in demand for training large neural network architectures...

Please sign up or login with your details

Forgot password? Click here to reset