PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

04/21/2023
by   Yanli Zhao, et al.
0

It is widely acknowledged that large models have the potential to deliver superior performance across a broad range of domains. Despite the remarkable progress made in the field of machine learning systems research, which has enabled the development and exploration of large models, such abilities remain confined to a small group of advanced users and industry leaders, resulting in an implicit technical barrier for the wider community to access and leverage these technologies. In this paper, we introduce PyTorch Fully Sharded Data Parallel (FSDP) as an industry-grade solution for large model training. FSDP has been closely co-designed with several key PyTorch core components including Tensor implementation, dispatcher system, and CUDA memory caching allocator, to provide non-intrusive user experiences and high training efficiency. Additionally, FSDP natively incorporates a range of techniques and settings to optimize resource utilization across a variety of hardware configurations. The experimental results demonstrate that FSDP is capable of achieving comparable performance to Distributed Data Parallel while providing support for significantly larger models with near-linear scalability in terms of TFLOPS.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/28/2020

PyTorch Distributed: Experiences on Accelerating Data Parallel Training

This paper presents the design, implementation, and evaluation of the Py...
research
03/07/2019

SLIDE : In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems

Deep Learning (DL) algorithms are the central focus of modern machine le...
research
03/01/2023

Kamodo: Simplifying Model Data Access and Utilization

To address the lack of user-friendly software needed to simplify the uti...
research
03/26/2022

Implementation of an Automated Learning System for Non-experts

Automated machine learning systems for non-experts could be critical for...
research
11/09/2021

How to Train Your Neural Network: A Comparative Evaluation

The field of deep learning has witnessed a remarkable shift towards extr...
research
07/27/2023

Scaling TransNormer to 175 Billion Parameters

We present TransNormerLLM, the first linear attention-based Large Langua...

Please sign up or login with your details

Forgot password? Click here to reset