DeepAI AI Chat
Log In Sign Up

Accelerating Multi-Model Inference by Merging DNNs of Different Weights

by   Joo Seong Jeong, et al.

Standardized DNN models that have been proved to perform well on machine learning tasks are widely used and often adopted as-is to solve downstream tasks, forming the transfer learning paradigm. However, when serving multiple instances of such DNN models from a cluster of GPU servers, existing techniques to improve GPU utilization such as batching are inapplicable because models often do not share weights due to fine-tuning. We propose NetFuse, a technique of merging multiple DNN models that share the same architecture but have different weights and different inputs. NetFuse is made possible by replacing operations with more general counterparts that allow a set of weights to be associated with only a certain set of inputs. Experiments on ResNet-50, ResNeXt-50, BERT, and XLNet show that NetFuse can speed up DNN inference time up to 3.6x on a NVIDIA V100 GPU, and up to 3.0x on a TITAN Xp GPU when merging 32 model instances, while only using up a small additional amount of GPU memory.


page 1

page 2

page 3

page 4


Spatial Sharing of GPU for Autotuning DNN models

GPUs are used for training, inference, and tuning the machine learning m...

Resolving Interference When Merging Models

Transfer learning - i.e., further fine-tuning a pre-trained model on a d...

Serving DNN Models with Multi-Instance GPUs: A Case of the Reconfigurable Machine Scheduling Problem

Multi-Instance GPU (MIG) is a new feature introduced by NVIDIA A100 GPUs...

Harmony: Overcoming the hurdles of GPU memory capacity to train massive DNN models on commodity servers

Deep neural networks (DNNs) have grown exponentially in complexity and s...

Automated Runtime-Aware Scheduling for Multi-Tenant DNN Inference on GPU

With the fast development of deep neural networks (DNNs), many real-worl...

GEMEL: Model Merging for Memory-Efficient, Real-Time Video Analytics at the Edge

Video analytics pipelines have steadily shifted to edge deployments to r...

Supporting Very Large Models using Automatic Dataflow Graph Partitioning

There is a trend towards using very large deep neural networks (DNN) to ...