Accelerating Multi-Model Inference by Merging DNNs of Different Weights

09/28/2020
by   Joo Seong Jeong, et al.
0

Standardized DNN models that have been proved to perform well on machine learning tasks are widely used and often adopted as-is to solve downstream tasks, forming the transfer learning paradigm. However, when serving multiple instances of such DNN models from a cluster of GPU servers, existing techniques to improve GPU utilization such as batching are inapplicable because models often do not share weights due to fine-tuning. We propose NetFuse, a technique of merging multiple DNN models that share the same architecture but have different weights and different inputs. NetFuse is made possible by replacing operations with more general counterparts that allow a set of weights to be associated with only a certain set of inputs. Experiments on ResNet-50, ResNeXt-50, BERT, and XLNet show that NetFuse can speed up DNN inference time up to 3.6x on a NVIDIA V100 GPU, and up to 3.0x on a TITAN Xp GPU when merging 32 model instances, while only using up a small additional amount of GPU memory.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/08/2020

Spatial Sharing of GPU for Autotuning DNN models

GPUs are used for training, inference, and tuning the machine learning m...
research
06/02/2023

Resolving Interference When Merging Models

Transfer learning - i.e., further fine-tuning a pre-trained model on a d...
research
09/18/2021

Serving DNN Models with Multi-Instance GPUs: A Case of the Reconfigurable Machine Scheduling Problem

Multi-Instance GPU (MIG) is a new feature introduced by NVIDIA A100 GPUs...
research
02/02/2022

Harmony: Overcoming the hurdles of GPU memory capacity to train massive DNN models on commodity servers

Deep neural networks (DNNs) have grown exponentially in complexity and s...
research
11/28/2021

Automated Runtime-Aware Scheduling for Multi-Tenant DNN Inference on GPU

With the fast development of deep neural networks (DNNs), many real-worl...
research
01/19/2022

GEMEL: Model Merging for Memory-Efficient, Real-Time Video Analytics at the Edge

Video analytics pipelines have steadily shifted to edge deployments to r...
research
07/24/2018

Supporting Very Large Models using Automatic Dataflow Graph Partitioning

There is a trend towards using very large deep neural networks (DNN) to ...

Please sign up or login with your details

Forgot password? Click here to reset