Saturn: An Optimized Data System for Large Model Deep Learning Workloads

09/03/2023
by   Kabir Nagrecha, et al.
0

Large language models such as GPT-3 ChatGPT have transformed deep learning (DL), powering applications that have captured the public's imagination. These models are rapidly being adopted across domains for analytics on various modalities, often by finetuning pre-trained base models. Such models need multiple GPUs due to both their size and computational load, driving the development of a bevy of "model parallelism" techniques tools. Navigating such parallelism choices, however, is a new burden for end users of DL such as data scientists, domain scientists, etc. who may lack the necessary systems knowhow. The need for model selection, which leads to many models to train due to hyper-parameter tuning or layer-wise finetuning, compounds the situation with two more burdens: resource apportioning and scheduling. In this work, we tackle these three burdens for DL users in a unified manner by formalizing them as a joint problem that we call SPASE: Select a Parallelism, Allocate resources, and SchedulE. We propose a new information system architecture to tackle the SPASE problem holistically, representing a key step toward enabling wider adoption of large DL models. We devise an extensible template for existing parallelism schemes and combine it with an automated empirical profiler for runtime estimation. We then formulate SPASE as an MILP. We find that direct use of an MILP-solver is significantly more effective than several baseline heuristics. We optimize the system runtime further with an introspective scheduling approach. We implement all these techniques into a new data system we call Saturn. Experiments with benchmark DL workloads show that Saturn achieves 39-49 DL practice.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/16/2021

Hydra: A System for Large Multi-Model Deep Learning

In many deep learning (DL) applications, the desire for ever higher accu...
research
11/25/2022

Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism

Transformer models have achieved state-of-the-art performance on various...
research
03/15/2023

MCR-DL: Mix-and-Match Communication Runtime for Deep Learning

In recent years, the training requirements of many state-of-the-art Deep...
research
06/22/2020

LAMP: Large Deep Nets with Automated Model Parallelism for Image Segmentation

Deep Learning (DL) models are becoming larger, because the increase in m...
research
02/16/2023

Decoupled Model Schedule for Deep Learning Training

Recent years have seen an increase in the development of large deep lear...
research
03/07/2021

Measuring Discrimination to Boost Comparative Testing for Multiple Deep Learning Models

The boom of DL technology leads to massive DL models built and shared, w...
research
09/14/2019

FfDL : A Flexible Multi-tenant Deep Learning Platform

Deep learning (DL) is becoming increasingly popular in several applicati...

Please sign up or login with your details

Forgot password? Click here to reset