Accelerating Multi-Model Inference by Merging DNNs of Different Weights

by   Joo Seong Jeong, et al.

Standardized DNN models that have been proved to perform well on machine learning tasks are widely used and often adopted as-is to solve downstream tasks, forming the transfer learning paradigm. However, when serving multiple instances of such DNN models from a cluster of GPU servers, existing techniques to improve GPU utilization such as batching are inapplicable because models often do not share weights due to fine-tuning. We propose NetFuse, a technique of merging multiple DNN models that share the same architecture but have different weights and different inputs. NetFuse is made possible by replacing operations with more general counterparts that allow a set of weights to be associated with only a certain set of inputs. Experiments on ResNet-50, ResNeXt-50, BERT, and XLNet show that NetFuse can speed up DNN inference time up to 3.6x on a NVIDIA V100 GPU, and up to 3.0x on a TITAN Xp GPU when merging 32 model instances, while only using up a small additional amount of GPU memory.



There are no comments yet.


page 1

page 2

page 3

page 4


Spatial Sharing of GPU for Autotuning DNN models

GPUs are used for training, inference, and tuning the machine learning m...

Multi-model Machine Learning Inference Serving with GPU Spatial Partitioning

As machine learning techniques are applied to a widening range of applic...

Serving DNN Models with Multi-Instance GPUs: A Case of the Reconfigurable Machine Scheduling Problem

Multi-Instance GPU (MIG) is a new feature introduced by NVIDIA A100 GPUs...

GEMEL: Model Merging for Memory-Efficient, Real-Time Video Analytics at the Edge

Video analytics pipelines have steadily shifted to edge deployments to r...

Automated Runtime-Aware Scheduling for Multi-Tenant DNN Inference on GPU

With the fast development of deep neural networks (DNNs), many real-worl...

Supporting Very Large Models using Automatic Dataflow Graph Partitioning

There is a trend towards using very large deep neural networks (DNN) to ...

GAN Cocktail: mixing GANs without dataset access

Today's generative models are capable of synthesizing high-fidelity imag...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.