TrIMS: Transparent and Isolated Model Sharing for Low Latency Deep LearningInference in Function as a Service Environments

11/24/2018 ∙ by Abdul Dakkak, et al. ∙ 0

Deep neural networks (DNNs) have become core computation components within low latency Function as a Service (FaaS) prediction pipelines: including image recognition, object detection, natural language processing, speech synthesis, and personalized recommendation pipelines. Cloud computing, as the de-facto backbone of modern computing infrastructure for both enterprise and consumer applications, has to be able to handle user-defined pipelines of diverse DNN inference workloads while maintaining isolation and latency guarantees, and minimizing resource waste. The current solution for guaranteeing isolation within FaaS is suboptimal -- suffering from "cold start" latency. A major cause of such inefficiency is the need to move large amount of model data within and across servers. We propose TrIMS as a novel solution to address these issues. Our proposed solution consists of a persistent model store across the GPU, CPU, local storage, and cloud storage hierarchy, an efficient resource management layer that provides isolation, and a succinct set of application APIs and container technologies for easy and transparent integration with FaaS, Deep Learning (DL) frameworks, and user code. We demonstrate our solution by interfacing TrIMS with the Apache MXNet framework and demonstrate up to 24x speedup in latency for image classification models and up to 210x speedup for large models. We achieve up to 8x system throughput improvement.



There are no comments yet.


page 1

page 3

page 6

page 7

page 8

page 11

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Percentage of time spent in model loading, inference computation, and image preprocessing for “cold start” online DL inference (

) using CPU and GPU for MXNet, Caffe, Caffe2, and TensorFlow on an IBM S822LC with Pascal GPUs. The speedup of using GPU over CPU for the inference compute alone is shown between the pie charts. Inference time for all frameworks is dominated by model loading except for small models, such as SqueezeNet, where the model size is a few megabytes. For TensorFlow, high GPU initialization overhead impacts the end-to-end time and the achieved speedup.

The recent trend of computing sees a confluence between artificial intelligence, driven primarily by deep learning (DL), and cloud computing with both gaining traction within enterprise and consumer applications. Key to this trend is the superior performance, accessibility, and accuracy of deep neural networks (DNNs) in a wide array of intelligent tasks such as: image recognition, object detection, natural language understanding, speech synthesis, and personalized recommendation.

Today, many business-logic and consumer applications rely on DL inference as core components within their prediction pipelines. These pipelines tend to be deployed to the cloud through Function as a Service (FaaS) platforms  [8, 1, 5, 10], since they abstract away low-level details such as system setup, dev-ops, and monitoring — promising service isolation, decentralization, and scalability, while still being more cost-effective compared to dedicated servers. Since FaaS services execute arbitrary user pipelines, FaaS system must execute code in isolation — through virtual machines (VMs) or containers.

Current off-the-shelf DL inference [12, 2, 7, 4, 11] is performed through HTTP APIs and uses pre-built general models (model catalogs) deployed by the cloud provider or user defined models deployed by the user. Within the FaaS pipelines, users interact with these models using the HTTP inference APIs and construct their prediction pipelines by defining glue code that parse the input, perform the model prediction, and process the output. There are two ways to perform model inference, batch prediction and online prediction. Batch prediction is performed offline on a large set of inputs, while online prediction is usually performed in real-time on a one-by-one basis [15, 18]. In this paper we focus on online prediction within a latency sensitive FaaS prediction pipeline.

DL Service providers are aware of the “cold start” cost of inference, and therefore eagerly persist models within their catalog — keeping the models in memory (“warm”) to guarantee the promised latency. For example, Amazon ML attempts to respond to most real-time prediction requests within  [16]. Without model persistence, network overhead contributes to a significant portion of the end-to-end inference latency. As for the user deployed models, the inference latency is not only affected by the network, but is also dominated by the mode inference “cold start”. The “cold start” latency can be seconds to minutes depending on the model size and the deployment set-up. To avoid the “cold start” overhead, users have to pay [13, 3] an hourly cost to persist their models.

FaaS can be used to express latency sensitive prediction pipelines that leverage a chain or ensemble of models. However, the current practice of integrating FaaS with model catalogs is inefficient for this usage — the network latency associated with the inference limits how complex or intelligent a pipeline can be — making these pipelines out of reach for most but the cloud giants. For example, Google Translate targets a per sentence end-to-end latency to avoid user-visible degradation of service [35]. To meet the latency requirement, Google implements a monolithic in-house pipeline that uses fast intranet interconnects. Current FaaS users cannot express such a complex pipeline using modular DL inferences to achieve comparable latency.

Figure 2: The DL inference graph for AlexNet [47]. The input dimensions and the memory footprint are shown in Table 1.

Cloud computing, as the de-facto backbone of modern computing infrastructure, has to be able to enable this scenario in a cost-effective way. We envision a future FaaS infrastructure that avoids the network overhead, thus making building complex latency sensitive pipelines, with modular DL inference components feasible, while better leveraging the hardware resources. This enables the development of complex applications based on FaaS; e.g. users can build a personal assistant (similar to Amazon’s Alexa or Apple’s Siri) by employing off-the-shelf DL inference componets and still achieve comparable latency of the complex monolithic application from cloud giants. To achieve this goal, we advocate for collocating prediction pipelines with model serving within FaaS, effectively bringing the compute nearer to the model and circumventing the network latency.

The idea of collocating compute with data is not new and has been explored in other domains such as: databases and near memory acceleration. This paper does not deal with the mechanics of collocating compute and data, since they have been explored elsewhere [29, 76, 50, 36, 49, 37]. Instead, we tackle the challenge faced by collocating model serving and user code within FaaS — the current method of user code isolation incurs a high “cold start” latency for each invocation of the DL inference in the pipeline.

We observe that for “cold start” model inference, model loading (I/O, data structure deserialization, GPU data movement) is the main source of “cold start” latency. Figure 1 shows the “cold start” inference time breakdown for popular DL frameworks: Caffe [46], Caffe2 [45], MXNet [22], and TensorFlow [19]. For GPU inference, data movement is another contributing factor making GPU less attractive for accelerating inference — even though GPUs offer a significant compute speed advantage, as shown in Figure 1.

We also observe that in a cloud setting DL models are shared extensively across user FaaS pipelines. For example, Google reported that natural translation models can accommodate over of their translation requests in [7]. Because model parameters are constant, we can leverage model sharing across pipelines by persisting model parameters in GPU and/or CPU memory, hence eliminating the model loading overhead, decreasing the end-to-end latency, and reducing the memory footprint for DL inferences.

In this paper, we propose a Transparent and Isolated Model Sharing (TrIMS) scheme to address the “cold start” latency challenge faced by collocating user code with model catalogs within FaaS — it does so while maintaining the isolation constraints, minimizing model-loading overhead, and increasing hardware resource utilization. We describe TrIMS’s model resource manager (MRM) which offers a multi-tiered cache for DL models to be shared across user pipelines. By decreasing model loading and data movement overhead, TrIMS decreases latency of end-to-end model inference, making inference on GPU a viable target. TrIMS also increases memory efficiency for cloud data centers while maintaining accuracy.

Specifically, this paper makes the following contributions:

  • We characterize the “cold start” overhead for online DL model inference across popular DL frameworks, such as Caffe, Caffe2, MXNet, and TensorFlow, on both CPUs and GPUs and identify model loading as the bottleneck.

  • We propose TrIMS to mitigate the model loading overhead faced by collocating user code with model catalogs within FaaS, and increase the model serving efficiency by sharing DL models across all levels of the memory hierarchy in the cloud environment — GPU, CPU, local storage, and remote storage. To our knowledge, this work is the first to propose sharing DL models across isolated online prediction pipelines while increasing hardware efficiency and decreasing latency.

  • We implement TrIMS within Apache MXNet [22] and evaluate the impact on online inference performance for a representative set of models and systems. We show that TrIMS provides speedup on small (less than 600MB) models and speedup on large (up to 6GB) models and is within of ideal speedup (with ideal being that model loading and data movement taking no time — i.e. same as persisting the model), and gives system throughput improvement without loss of accuracy.

  • TrIMS eliminates a substantial part of the non-compute components of the end-to-end latency, making DL model inference on GPU and other novel compute accelerator more viable. We identify remaining latency components for inference, motivating future microarchitecture techniques for further inference latency improvements.

  • We architect TrIMS so that it can be easily integrated with existing frameworks without user code changes. The method is designed to be compatible with existing framework usage patterns, and requires minimal modifications for framework developers.

The rest of this paper is organized as follows: Sections 2 and 3 describes current overheads and practice for inference serving. Sections 4 and 5 details our design and implementation. Section 6 describes our evaluation setup and experiment results. Section 7 outlines related work. Section 8 concludes.

Figure 3: An example of using DL inference in the cloud.

application code calls functions from their

deployed model or a

model catalog. The code is then provisioned onto a

container running on

server by the cloud provider. The code performed API calls to

perform AlexNet inference and


the scene understanding API.

AlexNet is deployed by users through the cloud provider’s cloud deployment mechanism.

2 Deep Learning Inference Overhead

Index Name Dim MF (MB)
1 conv1_bias
2 conv1_weight
3 conv2_weight
4 conv2_bias
5 conv3_weight
6 conv3_bias
7 conv4_bias
8 conv4_weight
9 conv5_weight
10 conv5_bias
11 fc6_bias
12 fc6_weight
13 fc7_weight
14 fc7_bias
15 fc8_bias
16 fc8_weight
Table 1: Memory footprint (MF) for each layer in Figure 2.

A single DL inference is much less computationally intensive than training, making it more sensitive to the data loading and deserialization overhead. A DL inference compute graph is a DAG composed of a set of network layers. Each computational layer is parameterized through weights and constants. The model parameters along with the compute topology identify the model 111Throughout this paper, sharing a layer means that we are sharing both the weights and constants that parameterize the layer.. Each layer operator is a function of the incoming edges in the graph and the weights/constants. An inference pass iterates through the layers of a compute graph and applies the layer operators to its input. Figure 2 shows the inference compute graph for AlexNet [47] and Table 1 lists the dimension and memory footprint for each layer.

For GPUs, the compute graph and associated weights are loaded and copied to GPU memory ahead of the computation. Memory for intermediate layer outputs also need to be allocated. AlexNet, for example, requires of extra GPU memory to store the intermediate results during the inference process. These intermediate outputs are not constant and cannot be shared, since they depend on the user’s input. However, layer weights are constant and can be shared across processes. For AlexNet, this results in sharing of constant data.

When compute is optimized, the overhead of model loading is magnified. Figure 1

shows that GPU outperforms the CPU in terms of compute, thus making model loading a bottleneck for end-to-end inference. Without data transfer overhead the NVIDIA Tesla V100 GPU using Tensor Cores can achieve

higher throughput on CNNs and higher throughput on RNNs compared to a high-end CPU server [14]. Reducing the data movement overhead makes GPU a more appealing option for DL inference.

To mitigate the model loading overhead, cloud services and previous work [54, 21, 27] persist model catalogs in memory or perform inference in batches. These strategies require knowledge of the model requests, have potential resource waste since the system resources are persisted within processes for models even when they are not used, or increase the latency of requests if batching the inferences.

3 Current Prediction Pipelines in FaaS

Function as a Service (FaaS) is a cost-effective way for users to deploy functions or pipelines that are executed within the cloud. Users define prediction pipelines that use models they deployed or ones found within the model catalog. The pipelines are then mapped to a fabric of containers — used to maintain software stack separation, virtualize system resources, and provide isolation — that run on physical machines. Unlike traditional cloud execution, the functions executed in a FaaS are short lived and are priced on a per-invocation basis (with function execution time and resource utilization being the main cost factors). Because cloud providers use a per-API call and per-resource utilization price model, resource waste affects the cloud user’s total cost of ownership.

To motivate our work, we use image to scene description pipeline deployed within FaaS as an example — illustrated in Figure 3. The pipeline takes an image input and outputs a textual description, leveraging a deployed AlexNet and an off-the-shelf scene understanding model from the cloud provider’s model catalog. Both the AlexNet model inference

and the scene understanding API

are called within User Function 3

. Cloud providers then provision the function to run within a container 

on a cloud server 

. When user code is triggered, both

the deployed AlexNet model and

the scene understanding endpoints are called through HTTP REST API calls. Meeting the latency requirements for this application is challenging because of the multiple over-the-network requests.

11todo: 1Because of hardware resource and cost constraints all the models cannot be loaded on the GPU

To avoid the network latency, a common practice is to collocate the model within the deployed functions or the application pipelines. However, such embedding requires a copy of the model to be loaded privately for each function or application pipeline. For example,


have to load

AlexNet twice on the same machine — wasting memory resources. The private loads introduces latency overhead, since the model needs to be loaded for the first function invocation. Since in FaaS isolation must be guaranteed, the previously mentioned persistence schemes, in Section 2, is not a solution. Similarly, batching does not apply for low latency inference.

Figure 4: Multiple processes can perform IPC requests to the TrIMS Model Resource Manager (MRM) server; for example , , and are performing an Open request, while is performing a Close request. TrIMS’s MRM is responsible for loading and managing the placement of the models in GPU memory, CPU memory, or local disk.

In a cloud setting DL models are shared extensively across user functions, for example: between the user functions shown in Figure 2. Based on this observation, we propose TrIMS to eliminate such model loading overhead and hardware resource waste, while maintaining resource utilization efficiency and decreasing inference latency in user processes. TrIMS achives this by folding “private copies” of the model into a shared copy under the hood. This is performed by decoupling the model persistence from the user-code execution — enabling model sharing, isolation, and low latency inference.

4 TrIMS Design

TrIMS consists of two components: a Model Resource Manager (MRM) server and framework clients. MRM manages the model resources resident in the system memory and abstracts away the model loading from framework clients. Each framework client communicates with MRM through inter-process communication (IPC), as shown in Figure 4. Since TrIMS follows the original DL framework’s API and semantics — returning the same data structures as the unmodified framework — user code can leverage TrIMS transparently without any code modification.

Figure 5: When user code loads a model using the original framework API, instead of loading the model directly from disk, the corresponding TrIMS client sends an Open request with ModelRequest structure to the MRM, and receives a response of type ModelHandle, from which it constructs the compute graph with model weights. When user code unloads a model, then instead of directly destroying the allocated memory, the TrIMS client sends out a Close request with ModelHandle and TrIMS MRM does the housekeeping.

4.1 TrIMS Model Resource Manager (MRM)

TrIMS’s MRM is a model server daemon that performs model management and placement. MRM maintains a database of models, addressing them using namespaces, with framework as well as model name and version being used to distinguish frameworks and models. Figure 4

shows that MRM is managing models for MXNet, Caffe2 DL frameworks as well as word vector embedding models for FastText and Glove.

The MRM placement manager then maps the models into either GPU memory, CPU memory, local storage, or cloud storage. The four levels are analogous to the traditional CPU cache hierarchy. Because of this, we will simply refer to these four different memory hierarchies as “cache” in the rest of this paper whenever there is no ambiguity.

Figure 6: Cloud providers can use TrIMS MRM as a container plugin to provision running untrusted user functions while still leveraging model sharing. User code is executed within an isolated containers and can get the benefits of TrIMS without code modifications. Sharing occurs when the users utilize the same models as their peers — which is not uncommon in cloud settings using cloud provided APIs.

After system cold boot, initial model requests miss the GPU, CPU, and local storage caches, causing the model to be downloaded from the cloud storage and loaded into the “caches” to serve both the current quest and future requests. When one of the caches becomes full, one or more models are evicted from the cache.

For inter-process communication, TrIMS uses gRPC [38] to send and receive messages between the MRM and its clients. TrIMS leverages the CUDA runtime’s cudaIpc* to share GPU memory across processes. MRM abstracts away the model management, exposing two API functions to be used by the clients: trims::open and trims::close to load and close a model, respectively. MRM maintains a reference count for each model to determine the number of users currently using the shared model. The API is shown in Figure  5.

Figure 7: The logic for caching models on both GPU and CPU. The TrIMS client initiates the load model call to TrIMS MRM and gets back a pointer to GPU memory.

4.1.1 Loading Models

When loading a model, MRM performs shape inference on the model to estimate its memory footprint when running on GPU. Shape inference is a simple arithmetic computation performed by any framework to determine the amount of internal memory to allocate for a model. After shape inference, MRM follows the state diagram shown in Figure 

7 and needs to handle three cases:

GPU cache hit — Model is persistent in GPU memory

MRM increments the model’s reference count and creates a shared memory handle from the device memory owned by MRM. The handle is then returned to the framework client. Model eviction is triggered when the intermediate results for a model is greater than the available free memory.

GPU cache miss / CPU cache hit — model is persistent in CPU memory

The server queries the current memory utilization of the GPU to see if the model can be copied to GPU memory. If it can, then GPU memory is allocated and copied; if not, then some memory needs to be reclaimed — entering the memory reclamation procedure.

CPU and GPU cache miss — model is not persistent in memory

If the data is not on local storage, then MRM downloads the model from the cloud. If the data is on disk, then MRM loads the data from disk using the framework’s serializer. Pinned memory is allocated on the CPU and the model weights is copied to it. MRM then follows the same logic as when the data is persistent in CPU memory.

4.1.2 Reclaiming Memory and Evicting Models

Memory reclamation is performed when the memory space for MRM at a specific cache level is full. Which model to evict to reclaim memory is determined by the eviction policy. TrIMS supports a pluggable set of common eviction policies such as least recently used(LRU) and least commonly used (LCU). For the CPU and GPU level caches, one needs to make sure that eviction does not interfere with user’s code. Models within the MRM database are not candidates for reclamation if they are in use; i.e. the reference count of a model is non-zero. Evicting models that is currently being used (effectively freeing GPU memory that’s being used) causes undefined behavior in the user’s code.

4.1.3 Unloading Models

When a TrIMS framework client unloads a model (or the user process exists), a model unload request is sent to MRM. MRM looks up the model in the database and decrements its reference count. By default MRM does not free resources for models that have a zero reference count (not currently used), but MRM can be configured to eagerly reclaim these models.

4.2 TrIMS Frameworks

MRM can handle requests from multiple TrIMS-enabled frameworks, managing their weights (which have different data layouts) in separate namespaces. Shown in Figure 5, when a TrIMS framework performs a model load request, the framework’s name and version are sent along with the request. The server can then perform the model unmarshaling from disk using the format supported by the framework.

To enable TrIMS in a framework, the functions to load and unload models need to be modified to perform requests to MRM. Since, each framework may have its own serialization format, support for the model format, to enable unmarshaling the data from disk to memory, needs to be added to MRM. With these changes, any type of network supported by the framework (CNN, RNN, Word2Vec, etc.) and any compute pattern is automatically supported by TrIMS.

User application rewriting overhead

— Since MRM does not modify the framework’s API, code that is linked with a TrIMS-enabled framework does not require any change. TrIMS works within Python, Java, or R. This is an attractive feature, since the benefits of TrIMS can be leveraged by cloud provider transparently from the user.

Sharing Granularity


supports fixed-size block, layer, and model level sharing granularity. Sub-model level sharing granularity is interesting when considering layers or memory across models. For example, models trained using transfer learning 

[65] share the frozen layer weights. Block level granularity can also be used to share fixed-size buffers.

Multi-GPU and Multi-Node Support

— Multi-GPU is usually used when performing batched inference [17, 21]. TrIMS inherently supports the multi-GPUs by leveraging Unified Memory (UM) [6]. Support for Multi-GPU sharing can also be performed without relying on UM by making the TrIMS framework client query the device ID of the current GPU context when a model is loaded. The framework client can then send the device ID along with the request. TrIMS MRM would then load the model into the GPU with that device ID. When a request loads a model on a GPU and the requested model is persistent on another GPU, MRM will perform GPU peer-to-peer memory copy if supported.

Multiple independent instances of TrIMS MRM can be loaded for multi-node support and an off-the-shelf task scheduling and load balancing middleware can be used to route and load balance inference requests. TrIMS can be setup to advertise the models that have already been loaded by users and the current system load to the load balancer.

4.3 Inference Isolation and Fairness

To enable seamless container isolation, TrIMS provides a Docker [51] volume plugin that allows service providers to provision the container with a communication link to the TrIMS MRM. The TrIMS MRM process runs in the host system with a link for frameworks to communicate with it across container boundaries. Figure 6 shows how untrusted user code can be run on a multi-tenant system while maintaining isolation. The code shows how users can use DL models, provided by the cloud provider, to create an image to audio pipeline. The user uses the cloud provided vision, text, and audio models via a library that is part of a model catalog. All user code executes within a container that communicates with the MRM via the container’s IPC mechanism.

5 Implementation

The experiments reported in this paper are based on an implementation of TrIMS on top of the Apache MXNet 222The source code for TrIMS is open source and is found at

— a popular machine learning framework. The

TrIMS MRM includes serialization code from MXNet to unmarshal MXNet models from disk. We also modify the MXNet framework to integrate it with TrIMS — keeping the MXNet APIs unchanged. Communication between the MXNet framework client and the MRM uses Google’s gRPC [38] with the packets encoded using Protocol Buffers [9].

To validate the efficiency and generality of our proposal, we follow a few principles throughout our implementation — even if disregarding some would have given us better speedup:

Backward Compatible

— The implementation needs to work with the existing framework’s code base and language bindings, i.e. we should be able to run preexisting MXNet codes written in Python or Scala with no modifications.

Simple and Minimal

— The implementation needs to be simple and not modify the framework code as much as possible. Our modifications adds only lines of code (less than of the MXNet code base) to the framework ( lines for the server and lines for the client) and is self contained.


— The implementation has knobs to tweak everything from the eviction strategy of memory sharing, the amount of memory that can be used, whether to enable TrIMS, the levels of cache to enable, etc…

Fast, Concurrent and Scalable

— We communicate using gRPC and use efficient data structures [41] for the MRM database to make the serving fast and concurrent. The memory sharing strategy in TrIMS is scalable and can handle large amount of load.

5.1 TrIMS Apache MXNet Framework

We implement TrIMS on top of the Apache MXNet framework client by modifying the MXPredCreate and MXPredFree in the MXNet C predict API’s implementations. When TrIMS is enabled, trims::open and trims::close are called as part of the predictor creation and deletion. Listing 1 shows the main modification to the original MXNet code.

MXAPIPredictor MXPredCreate(MXPredParams * p){
  MXAPIPredictor *ret = new MXAPIPredictor();
  {...} // load in the symbol and model parameters
  {... shapes = infer_model_shapes(p); ... }
  if (trims::ENABLED) {
    auto tinfo      = trims::open(...);
    ret->handle_id  = std::get<0>(tinfo);
    ret->model_id   = std::get<1>(tinfo);
    goto setup_predictor;
  // original model loading
  dmlc::MemoryStream fi(p->buf, p->size);
  NDArray::Load(&fi, &data, &names)
  return ret;
void MXPredFree(PredictorHandle handle) {
  auto pred = (MXAPIPredictor *) handle;
  if (trims::ENABLED) trims::close(pred);
  delete pred;
Listing 1: To integrate TrIMS with MXNet we modify both the MXPredCreate and MXPredFree functions. MXPredCreate loads the model and initializes the compute graph to perform inference, if TrIMS is enabled, we call trims::open instead of NDArray::Load which loads the model from disk. To correctly free the models, we modify the MXPredFree function to call trims::close. MXPredFree is called in the Predictor destructor or at process exit.

Like most open-source DL Frameworks, MXNet is optimized for training and not inference. We apply a set of optimizations to the original MXNet to improve the inference latency. The optimizations avoid eager initialization of CUDA resources, remove cuDNN algorithm selection for backward propagation, and simplify the random resource generation. With our optimizations, MXNet is faster for inference on average than the vanilla MXNet for the suite of models we use. We use the modified MXNet as our baseline for evaluation.

5.2 GPU Memory Sharing

We perform GPU memory sharing using the CUDA’s cudaIPC* runtime functions. For Pre-Volta GPUs, the CUDA IPC mechanism utilizes CUDA MPS — an intermediate user process where the memory allocations are performed. This means that all CUDA operations end up serialized and executed within the same CUDA MPS context — enabling difference processes to share the same GPU virtual address space (VAS). For Volta GPUs, NVIDIA introduced a new feature to allows contexts to share page-table mappings. This makes it possible for user processes to run using different contexts while still sharing memory. For CUDA 9.2, CUDA MPS is still invoked to keep shared allocations and communicate across them, but, with the exception of a handful of functions, most CUDA operations are performed without IPC communication.

Because sharing may serialize to use CUDA MPS, one slight disadvantage of CUDA IPC functions is that they have a measurable overhead. This can become a bottleneck. When sharing models at layer granularity, networks with large number of layers, such as ResNet269-v2, have high overhead. We remedy this by having a per-group of layer sharing or model sharing granularity.

The CUDA IPC overhead is measurable, and we can quantify whether using TrIMS is beneficial statically using the empirical formula: , where is the number of objects to share (when the sharing granularity is at the model level, this value is ; when the granularity is at the layer, this value is the number of layers); is the overhead of sharing CUDA memory via CUDA IPC and is the overhead of obtaining a CUDA device pointer from a shared CUDA IPC handle; is the number of bytes the model occupies on disk; and is the disk I/O bandwidth. These constants can be computed once at system startup and cached to be used by TrIMS. If is positive, then its magnitude is correlated to the speedup one gets using TrIMS. This equation can be used within the TrIMS framework to determine at runtime whether to call TrIMS to share a model or not and at what granularity to share the model.

6 Evaluation

Name CPU GPU Memory GPU Memory Cached Reads Buffered Disk Reads
System 1 Intel Core i9-7900X TITAN Xp P110 32 GB 12 GB 8 GB/sec 193.30 MB/sec
System 2 Intel Xeon E5-2698 v4 Tesla V100-PCIE 256 GB 16 GB 10 GB/sec 421.30 MB/sec
System 3 IBM S822LC Power8 w/ NVLink Tesla P100-SXM2 512 GB 16 GB 27 GB/sec 521.32 MB/sec
Table 2: We evaluate TrIMS on 3 systems which represent both cloud offerings and consumer desktop system configurations currently used for DL inference. We use the Linux hdparm tool to measure the cached disk reads.

We evaluate TrIMS on 3 systems (shown in Table 2) using (shown in Table 3) pre-trained small models and large models (shown in Table 4). The systems selected represent different types of instances that are currently provisioned in the cloud. System 3 uses the NVLink bus [32, 63] which allows up to transfer between CPU and GPU. System 3 is used as proxy for understanding our proposed method’s behavior on high end cloud instances and next generation interconnects currently being deployed on HPC and cloud systems [68, 64]. Multi-GPU results are similar to the single-GPU results shown bellow and for simplicity are omitted.

Figure 8: A representative sample of the models shown in Table 3 are chosen and are run on the systems in Table 2 to achieve (a) the best case end-to-end time — when the model has been pre-loaded in GPU memory — and (b) the worst case end-to-end time — when the model misses both the CPU and GPU persistence and needs to be loaded from disk. The speedups are normalized to end-to-end running time of the model without TrIMS. The yellow dots show the ideal speedup; the speedup achieved by removing any I/O and data-transfer overhead — keeping only the framework initialization and compute. For models 33 and 36, the achieved speedup is shown on the bar (white) and the ideal speedup is shown on top of the bar (black).
Figure 9: Detailed normalized times of operations with and without TrIMS on System 3 using the models in Table 3. The duration for TrIMS is normalized to the end-to-end time of not using TrIMS. Model initialization is the time spent setting up the CUDA contexts for the model, initializing the the compute state, and (in the case of not using TrIMS) copying the weights to GPU memory. Compute is the time spent performing inference computation. Model sharing is the overhead introduced by using TrIMS and includes the gRPC communication and sharing GPU data using CUDA IPC. Through TrIMS we effectively eliminated model loading and data movement.
Figure 10: Large models in Table 4 are run to achieve the best case end-to-end time — when the model has been pre-loaded in GPU memory. The speedups are normalized to end-to-end running time of the model without TrIMS. The red dots show the percentage of time spent performing the compute. We see linear speedup until the inference becomes compute bound.

We used image processing models as a representative workload because these are currently the most plentiful in FaaS pipelines. TrIMS is agnostic to the compute patterns of a network and the analysis would apply to other types of networks such as: RNNs, word embeddings, or matrix factorization. The selected pre-trained image processing models, shown in Table 3, are based on their popularity in both research and usage. Some of the networks have variants. These are used to simulate user trained models — the same compute networks structure can have different weights. Large models are used to show how TrIMS performs with increasing model sizes.

Throughout this section we compare our performance within a FaaS setting against ideal (where the model loading and data movement takes no time — ideal is faster than model persistence) and use end-to-end “cold-start” inference as the base line, since that’s what is currently employed by FaaS environments.

ID Name # Layers ILS MWMF
1 AlexNet [47] 16 516 238
2 GoogLeNet [61] 116 111 27
3 CaffeNet [47] 16 512 233
4 RCNN-ILSVRC13 [33] 16 479 221
5 DPN68 [24] 361 122 49
6 DPN92 [24] 481 340 145
7 Inception-v3 [62] 472 257 92
8 Inception-v4 [60] 747 399 164
9 InceptionBN-v2 [43] 416 313 129
10 InceptionBN-v3 [62] 416 142 44
11 Inception-ResNet-v2 [60] 1102 493 214
12 LocationNet [69] 514 666 285
13 NIN [48] 24 131 29
14 ResNet101 [40] 526 423 170
15 ResNet101-v2 [40] 522 428 171
16 ResNet152 [40] 777 548 231
17 ResNet152-11k [40] 769 721 311
18 ResNet152-v2 [40] 761 340 231
19 ResNet18-v2 [40] 99 154 45
20 ResNet200-v2 [40] 1009 589 248
21 ResNet269-v2 [40] 1346 889 391
22 ResNet34-v2 [40] 179 222 84
23 ResNet50 [40] 268 270 98
24 ResNet50-v2 [40] 259 275 98
25 ResNeXt101 [71] 526 375 170
26 ResNeXt101-32x4d [71] 522 378 170
27 ResNeXt26-32x4d [71] 147 147 59
28 ResNeXt50 [71] 271 222 96
29 ResNeXt50-32x4d [71] 267 224 96
30 SqueezeNet-v1.0 [42] 52 34 4.8
31 SqueezeNet-v1.1 [42] 52 28 4.8
32 VGG16 [58] 32 1228 528
33 VGG16-SOD [75] 32 1198 514
34 VGG16-SOS [74] 32 1195 513
35 VGG19 [58] 38 1270 549
36 WRN50-v2 [73] 267 758 264
37 Xception [25] 236 244 88
Table 3: The small models are popular models used in literature and is used as proxy models that offer a wide variety of sizes and computational complexity. Image classification models are used since they are the most commonly used. Both internal layer sizes (ILS) and the model weights memory footprint (MWMF) are shown in megabytes. The number of models is chosen to be 2x larger than the available 16 GB memory on Systems 2 and 3.

6.1 Latency Improvement

We measure the end-to-end “cold-start” inference of MXNet with and without TrIMS – for the sake of clarity we omit the input processing time. Figure 8 shows the achieved speedup on a representative set of the models compared against MXNet that does not utilize TrIMS. We show two cases: (a) our best case (when there is a GPU cache hit) and (b) our worst case (when the cache misses both the CPU and GPU).

For best case analysis (Figure 8a), the server needs to create the CUDA IPC handles and the framework client needs to embed the GPU device pointers within the framework’s container. This introduces a slight overhead, however it is within of the ideal — ideal defined as the time for inference where model loading or deserialization times set to zero. We see that latency speedup improves proportionally to the model size, the system’s data movement bandwidth, the system’s compute resources, and the model’s compute complexity.

For small models, where the I/O overhead is very low, for example SqueezeNet (which has a memory footprint), we observe only marginal speedup (). These models are designed to have a small footprint — targeting edge devices — and are rarely used within the cloud. For state-of-the-art networks, such as VGG16-SOD, we observe speedup on System 1. Even with fast disk and the NVLink interconnect, which mitigates I/O overhead by offering greater data movement bandwidth, System 3 achieves speedup for VGG16-SOD.

For the worst case analysis (Figure 8b), the MRM needs to load the data from disk, persist the model on the CPU, copy the data to the GPU, and send the GPU memory handles to the client. Although we get a slow down, this case assumes there is no model sharing across pipelines, and therefore uncommon in cloud setting.

6.2 Speedup Breakdown

To understand where the new bottlenecks are for the inference using TrIMS, we look at System 3 where we achieve the lowest speedup and measure the (a) time to perform inference computation, (b) time to initialize the model (this includes copying the data to the GPU when not using TrIMS), (c) model loading from disk, and (d) model sharing overhead introduced by TrIMS. As can be seen in Figure 9, without using TrIMS an average of of the time is spent loading and initializing the model while only is spent performing computation. When using TrIMS we eliminate the model loading from disk and remove the need to perform memory copies to the GPU. Even though we introduce overhead, we still gain a geometric mean speedup.

6.3 Large Model Evaluation

We evaluate our method using large models which are common for medical image analysis, NLP, and time series modeling. We generated the large models by starting with the regular AlexNet and VGG16 networks, keeping their compute graph topology, and rescaling the input dimensions to generate enlarged model. Table 4 shows the models selected for evaluation, their memory footprint, and their input sizes.

ID Name Input Dims MWMF
1 AlexNet-S1 [47] 238
2 AlexNet-S3 [47] 770
3 AlexNet-S3 [47] 1694
4 AlexNet-S4 [47] 3010
5 VGG16-S1 [58] 528
6 VGG16-S2 [58] 1704
7 VGG16-S3 [58] 3664
8 VGG16-S4 [58] 6408
Table 4: Large models were used to evaluate our method. The models were generated by taking AlexNet and VGG16 and scaling the number of input features. Large models arise in either medical image analysis, NLP, or time series analysis where down-sampling decreases the accuracy or the network requires a large window of features to give accurate results.

Figure 10 shows that by removing model loading overhead, inference on large models becomes compute bound and gives an advantage to faster GPUs. This is why System 1 achieves less speedup than System 2 for the more compute intensive VGG16 network (for example for model 7), since model inference computation accounts for on System 1 and on System 2. We expect this to be a more pronounced bottleneck for lower end GPUs and less of an issue for specialized low latency inference accelerators.

We also observe that TrIMS increases the memory efficiency of the GPU. Without TrIMS, two inferences using model 8 cannot be run concurrently, since they overrun the GPU memory. TrIMS avoiding multiple private copies of the model on the same machine, enabling concurrent runs of large models.

Figure 11: We vary the percentage of models run (from Table 3) and we sample them following a Pareto distribution (with and ). We also vary the concurrency level (number of inferences performed concurrently) ranging it from 1 to 10. The iso-curves show the geometric mean of the speedups for Systems 1, 2, and 3.

6.4 Workload Modeling

Finally, we perform workload modeling to understand the behavior of TrIMS on multi-tenant oversubscribed system. The workload is selected from the 37 small models shown in Table 3 following a Pareto distribution. Since all the models cannot all be resident on the GPU at the same time — in total having GPU memory footprint — the TrIMS MRM needs to exercise the model reclamation and eviction procedure. Because of limited space, we only present the results for the LRU eviction strategy, but our observations are valid for other eviction strategies.

Figure 11 shows the level iso-efficiency curves for the geometric mean speedup 333We measure the speedup value using the geometric mean across the latency speedup of each model. as we vary the the concurrency level and number of models to run. We can see that even in an oversubscribed setting, we can still service clients concurrently, reduce the overall batch execution time (by up to ), while incurring only a latency penalty for each request. This slowdown is due to the cost of evicting models to accommodate the larger memory footprint, causing subsequent usage of the model miss the GPU cache.

For all three systems, we can observe an over-subscription sweet spot, where the percent of models and number of the concurrent request can be increased while the batch execution is preserved to a speedup of . All systems show a sweet spot when of models are actively being requested. For system 1 and 3, the number of concurrent requests can be increased to 4, and system 2 the same number improves to 6. The difference in the over subscription sweet spot can be explained due to the different compute capabilities between the systems. System 1 and 3 are provision with Pascal generation GPUs while system 2 has the latest Volta generation. Essentially, because we are successful in moving the inference bottleneck from I/O to compute, the sweet spot is determined by the available computing resources. In practice, cloud providers can perform sensitivity analysis to determine the number of models hosted on each server and the number of concurrent requests to service based on the service’s target requirements.

By removing model loading overhead, our speedups is bounded by the framework’s inference pipeline. Frameworks that are optimized for inference garner greater benefits. For older generation or lower end GPUs, compute would likely dominate inference. Therefore, if cloud providers are only interested in maintaining latency, they can utilize these older or lower end GPUs which have a lower initial cost of ownership.

7 Related Works

Recent related work has explored techniques to enable model serving at cloud scale. TesorFlow-Serving[54] provides soft model isolation to guard against concurrent running request interfering with each other performance. TFX[21] uses dedicated thread pool to hide model-loading overhead and provide thread-level user isolation. Clipper[27] combine concurrent streams of DL requests into batches to better utilize the GPU at the cost of longer latency. All of these techniques suffer from their inability to provide user isolation or handle low latency inference.

Recent work [56, 44, 55, 31] leverage CUDA IPC in order to improve various intra-node and inter-node MPI collectives of a single process/application, and thus facilitate the porting to, and improve the performance of HPC applications on GPUs. MVAPICH2 [53], for instance, supports the use of MPI calls directly over GPU memory. Unlike these works, TrIMS leverages CUDA IPC in order to persist data structures across processes and thus actively seeks to improve IO and memory footprint, instead of multi-GPU coordination.

To reduce DL model inference latency and memory requirement, a large body of work have been performed recently in compacting and accelerating convolutional neural networks (CNNs). Quantization 

[34, 70, 66, 39, 26] reduces the number of bits required to represent each weight by rescaling the weights to a domain smaller than the 32-bits required for floating point representation (usually 8-bits). Network pruning and sharing [59, 39, 23, 77] reduces redundant parameters which are not sensitive to the performance. Although these model optimization techniques can make the I/O vs. compute problem less severe, they have drawbacks and limited application scope. Quantized model inference suffer from accuracy loss while pruned network can significantly increase the computation intensity due to sparsity, especially on GPUs. Moreover these techniques currently only works with convolutional layers or fully connected layers, does not apply to other type of model inference, such as the fully connected layers used in word embedding. Model optimizations also could leverage TrIMS, enabling the sharing of optimized CNN models.

Although various CPU/GPU virtualization techniques [52, 67, 30] and GPU multi-tenancy [28, 57, 72, 20] can improve system utilization and throughput through time sharing or parallel sharing CPU or GPU, they do not help solving the model loading overhead within inference processes. TrIMS is orthogonal to these techniques and can be integrated into containers as a plugin. Also, in the very same way that NVIDIA Volta added the capability of effectively sharing memory across user-processes without the need of a proxy process (CUDA MPS server). The ability of sharing memory across different VMs (using a third level of virtual memory translation, as CPUs do) would enable TrIMS to work across VMs.

8 Conclusion

Collocating compute with model serving within FaaS overcomes the network barrier but suffers from high “code start” latency. We propose TrIMS to mitigate the major source of “code start” latency — the model loading overhead — and make building complex latency sensitive pipelines with modular DL components feasible. We do so by decoupling compute from model persistence and leveraging model sharing across user pipelines. TrIMS moves the bottleneck of DL model inference to compute, thus making GPU acceleration more appealing and making specialized novel inference architectures more tractable.

TrIMS was evaluated on three systems that represent current cloud system offerings. We used 45 DL models and show a speedup of up to for small models and up to for large models. When running concurrent inference, we can increase the overall system throughput by up to . Our methodology, when applied to DL frameworks, offers advantages to both cloud providers and users. The isolation along with the significant memory reduction through model sharing enable cloud providers to over-provision hardware resources, thus decreasing the total cost of ownership. The benefits of TrIMS to the cloud providers can be passed down to the users in the form of reducing latency or cost of inference.

TrIMS is a generic memory sharing technique that can be used when computation requires large number of constant parameters to be in situ on the CPU or GPU, while still maintaining isolation between users. As such, the proposed method can be easily generalized to any application or algorithm that spans multiple processes and requires large amount of constant data resources. While we motivated our work with deep learning, other types of applications such as image processing, physical simulation, or in-memory databases can benefit from our approach.