Hydra: A System for Large Multi-Model Deep Learning

10/16/2021
by   Kabir Nagrecha, et al.
0

In many deep learning (DL) applications, the desire for ever higher accuracy and the new ubiquity of transfer learning has led to a marked increase in the size and depth of model architectures. Thus, the memory capacity of GPUs is often a bottleneck for DL practitioners. Existing techniques that rely on partitioning the model architecture across a network of GPUs suffer from substantial underutilization and busy waiting due to sequential dependencies in most large-scale model architectures (Transformers, CNNs). We observe that almost all such prior large-model systems focus on training only one model at a time, but in reality DL practitioners often train many models in bulk due to model selection needs, e.g., hyper-parameter tuning, architecture finetuning, etc. This gap leads to significant system inefficiency. We approach this problem from first principles and propose a new information system architecture for scalable multi-model training that adapts and blends ideas from classical RDBMS design with task parallelism from the ML world. We propose a suite of techniques to optimize system efficiency holistically, including a highly general parameter-spilling design that enables large models to be trained even with a single GPU, a novel multi-query optimization scheme that blends model execution schedules efficiently and maximizes GPU utilization, and a double buffering idea to hide latency. We prototype our ideas on top of PyTorch to build a system we call Hydra. Experiments with real benchmark large-scale multi-model DL workloads show that Hydra is over 7x faster than regular model parallelism and 1.8-4.5X faster than state-of-the-art industrial tools for large-scale model training.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/03/2023

Saturn: An Optimized Data System for Large Model Deep Learning Workloads

Large language models such as GPT-3 ChatGPT have transformed deep le...
research
07/14/2021

Model-Parallel Model Selection for Deep Learning Systems

As deep learning becomes more expensive, both in terms of time and compu...
research
08/10/2017

Distributed Training Large-Scale Deep Architectures

Scale of data and scale of computation infrastructures together enable t...
research
02/16/2023

Decoupled Model Schedule for Deep Learning Training

Recent years have seen an increase in the development of large deep lear...
research
01/06/2023

Systems for Parallel and Distributed Large-Model Deep Learning Training

Deep learning (DL) has transformed applications in a variety of domains,...
research
01/27/2022

Prediction of GPU Failures Under Deep Learning Workloads

Graphics processing units (GPUs) are the de facto standard for processin...
research
01/19/2022

Building a Performance Model for Deep Learning Recommendation Model Training on GPUs

We devise a performance model for GPU training of Deep Learning Recommen...

Please sign up or login with your details

Forgot password? Click here to reset