Nimble: Lightweight and Parallel GPU Task Scheduling for Deep Learning

12/04/2020
by   Woosuk Kwon, et al.
0

Deep learning (DL) frameworks take advantage of GPUs to improve the speed of DL inference and training. Ideally, DL frameworks should be able to fully utilize the computation power of GPUs such that the running time depends on the amount of computation assigned to GPUs. Yet, we observe that in scheduling GPU tasks, existing DL frameworks suffer from inefficiencies such as large scheduling overhead and unnecessary serial execution. To this end, we propose Nimble, a DL execution engine that runs GPU tasks in parallel with minimal scheduling overhead. Nimble introduces a novel technique called ahead-of-time (AoT) scheduling. Here, the scheduling procedure finishes before executing the GPU kernel, thereby removing most of the scheduling overhead during run time. Furthermore, Nimble automatically parallelizes the execution of GPU tasks by exploiting multiple GPU streams in a single GPU. Evaluation on a variety of neural networks shows that compared to PyTorch, Nimble speeds up inference and training by up to 22.34× and 3.61×, respectively. Moreover, Nimble outperforms state-of-the-art inference systems, TensorRT and TVM, by up to 2.81× and 1.70×, respectively.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/03/2021

Scheduling Optimization Techniques for Neural Network Training

Neural network training requires a large amount of computation and thus ...
research
05/24/2022

Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision

Deep learning (DL) shows its prosperity in a wide variety of fields. The...
research
07/16/2018

Scheduling Computation Graphs of Deep Learning Models on Manycore CPUs

For a deep learning model, efficient execution of its computation graph ...
research
07/21/2023

Chrion: Optimizing Recurrent Neural Network Inference by Collaboratively Utilizing CPUs and GPUs

Deploying deep learning models in cloud clusters provides efficient and ...
research
05/04/2021

TinyStack: A Minimal GPU Stack for Client ML

TinyStack is a novel way for deploying GPU-accelerated computation on mo...
research
01/27/2022

Prediction of GPU Failures Under Deep Learning Workloads

Graphics processing units (GPUs) are the de facto standard for processin...
research
11/01/2021

Collage: Automated Integration of Deep Learning Backends

Strong demands for efficient deployment of Deep Learning (DL) applicatio...

Please sign up or login with your details

Forgot password? Click here to reset