The recent trend of computing sees a confluence between artificial intelligence, driven primarily by deep learning (DL), and cloud computing with both gaining traction within enterprise and consumer applications.
Key to this trend is the superior performance, accessibility, and accuracy of deep neural networks (DNNs) in a wide array of intelligent tasks such as: image recognition, object detection, natural language understanding, speech synthesis, and personalized recommendation.
The recent trend of computing sees a confluence between artificial intelligence, driven primarily by deep learning (DL), and cloud computing with both gaining traction within enterprise and consumer applications. Key to this trend is the superior performance, accessibility, and accuracy of deep neural networks (DNNs) in a wide array of intelligent tasks such as: image recognition, object detection, natural language understanding, speech synthesis, and personalized recommendation.
Today, many business-logic and consumer applications rely on DL inference as core components within their prediction pipelines. These pipelines tend to be deployed to the cloud through Function as a Service (FaaS) platforms [8, 1, 5, 10], since they abstract away low-level details such as system setup, dev-ops, and monitoring — promising service isolation, decentralization, and scalability, while still being more cost-effective compared to dedicated servers. Since FaaS services execute arbitrary user pipelines, FaaS system must execute code in isolation — through virtual machines (VMs) or containers.
Current off-the-shelf DL inference [12, 2, 7, 4, 11] is performed through HTTP APIs and uses pre-built general models (model catalogs) deployed by the cloud provider or user defined models deployed by the user. Within the FaaS pipelines, users interact with these models using the HTTP inference APIs and construct their prediction pipelines by defining glue code that parse the input, perform the model prediction, and process the output. There are two ways to perform model inference, batch prediction and online prediction. Batch prediction is performed offline on a large set of inputs, while online prediction is usually performed in real-time on a one-by-one basis [15, 18]. In this paper we focus on online prediction within a latency sensitive FaaS prediction pipeline.
DL Service providers are aware of the “cold start” cost of inference, and therefore eagerly persist models within their catalog — keeping the models in memory (“warm”) to guarantee the promised latency. For example, Amazon ML attempts to respond to most real-time prediction requests within . Without model persistence, network overhead contributes to a significant portion of the end-to-end inference latency. As for the user deployed models, the inference latency is not only affected by the network, but is also dominated by the mode inference “cold start”. The “cold start” latency can be seconds to minutes depending on the model size and the deployment set-up. To avoid the “cold start” overhead, users have to pay [13, 3] an hourly cost to persist their models.
FaaS can be used to express latency sensitive prediction pipelines that leverage a chain or ensemble of models. However, the current practice of integrating FaaS with model catalogs is inefficient for this usage — the network latency associated with the inference limits how complex or intelligent a pipeline can be — making these pipelines out of reach for most but the cloud giants. For example, Google Translate targets a per sentence end-to-end latency to avoid user-visible degradation of service . To meet the latency requirement, Google implements a monolithic in-house pipeline that uses fast intranet interconnects. Current FaaS users cannot express such a complex pipeline using modular DL inferences to achieve comparable latency.
Cloud computing, as the de-facto backbone of modern computing infrastructure, has to be able to enable this scenario in a cost-effective way. We envision a future FaaS infrastructure that avoids the network overhead, thus making building complex latency sensitive pipelines, with modular DL inference components feasible, while better leveraging the hardware resources. This enables the development of complex applications based on FaaS; e.g. users can build a personal assistant (similar to Amazon’s Alexa or Apple’s Siri) by employing off-the-shelf DL inference componets and still achieve comparable latency of the complex monolithic application from cloud giants. To achieve this goal, we advocate for collocating prediction pipelines with model serving within FaaS, effectively bringing the compute nearer to the model and circumventing the network latency.
The idea of collocating compute with data is not new and has been explored in other domains such as: databases and near memory acceleration. This paper does not deal with the mechanics of collocating compute and data, since they have been explored elsewhere [29, 76, 50, 36, 49, 37]. Instead, we tackle the challenge faced by collocating model serving and user code within FaaS — the current method of user code isolation incurs a high “cold start” latency for each invocation of the DL inference in the pipeline.
We observe that for “cold start” model inference, model loading (I/O, data structure deserialization, GPU data movement) is the main source of “cold start” latency. Figure 1 shows the “cold start” inference time breakdown for popular DL frameworks: Caffe , Caffe2 , MXNet , and TensorFlow . For GPU inference, data movement is another contributing factor making GPU less attractive for accelerating inference — even though GPUs offer a significant compute speed advantage, as shown in Figure 1.
We also observe that in a cloud setting DL models are shared extensively across user FaaS pipelines. For example, Google reported that natural translation models can accommodate over of their translation requests in . Because model parameters are constant, we can leverage model sharing across pipelines by persisting model parameters in GPU and/or CPU memory, hence eliminating the model loading overhead, decreasing the end-to-end latency, and reducing the memory footprint for DL inferences.
In this paper, we propose a Transparent and Isolated Model Sharing (TrIMS) scheme to address the “cold start” latency challenge faced by collocating user code with model catalogs within FaaS — it does so while maintaining the isolation constraints, minimizing model-loading overhead, and increasing hardware resource utilization. We describe TrIMS’s model resource manager (MRM) which offers a multi-tiered cache for DL models to be shared across user pipelines. By decreasing model loading and data movement overhead, TrIMS decreases latency of end-to-end model inference, making inference on GPU a viable target. TrIMS also increases memory efficiency for cloud data centers while maintaining accuracy.
Specifically, this paper makes the following contributions:
We characterize the “cold start” overhead for online DL model inference across popular DL frameworks, such as Caffe, Caffe2, MXNet, and TensorFlow, on both CPUs and GPUs and identify model loading as the bottleneck.
We propose TrIMS to mitigate the model loading overhead faced by collocating user code with model catalogs within FaaS, and increase the model serving efficiency by sharing DL models across all levels of the memory hierarchy in the cloud environment — GPU, CPU, local storage, and remote storage. To our knowledge, this work is the first to propose sharing DL models across isolated online prediction pipelines while increasing hardware efficiency and decreasing latency.
We implement TrIMS within Apache MXNet  and evaluate the impact on online inference performance for a representative set of models and systems. We show that TrIMS provides – speedup on small (less than 600MB) models and – speedup on large (up to 6GB) models and is within of ideal speedup (with ideal being that model loading and data movement taking no time — i.e. same as persisting the model), and gives system throughput improvement without loss of accuracy.
TrIMS eliminates a substantial part of the non-compute components of the end-to-end latency, making DL model inference on GPU and other novel compute accelerator more viable. We identify remaining latency components for inference, motivating future microarchitecture techniques for further inference latency improvements.
We architect TrIMS so that it can be easily integrated with existing frameworks without user code changes. The method is designed to be compatible with existing framework usage patterns, and requires minimal modifications for framework developers.
The rest of this paper is organized as follows: Sections 2 and 3 describes current overheads and practice for inference serving. Sections 4 and 5 details our design and implementation. Section 6 describes our evaluation setup and experiment results. Section 7 outlines related work. Section 8 concludes.
2 Deep Learning Inference Overhead
A single DL inference is much less computationally intensive than training, making it more sensitive to the data loading and deserialization overhead. A DL inference compute graph is a DAG composed of a set of network layers. Each computational layer is parameterized through weights and constants. The model parameters along with the compute topology identify the model 111Throughout this paper, sharing a layer means that we are sharing both the weights and constants that parameterize the layer.. Each layer operator is a function of the incoming edges in the graph and the weights/constants. An inference pass iterates through the layers of a compute graph and applies the layer operators to its input. Figure 2 shows the inference compute graph for AlexNet  and Table 1 lists the dimension and memory footprint for each layer.
For GPUs, the compute graph and associated weights are loaded and copied to GPU memory ahead of the computation. Memory for intermediate layer outputs also need to be allocated. AlexNet, for example, requires of extra GPU memory to store the intermediate results during the inference process. These intermediate outputs are not constant and cannot be shared, since they depend on the user’s input. However, layer weights are constant and can be shared across processes. For AlexNet, this results in sharing of constant data.
When compute is optimized, the overhead of model loading is magnified.
Figure 1 shows that GPU outperforms the CPU in terms of compute, thus making model loading a bottleneck for end-to-end inference.
Without data transfer overhead the NVIDIA Tesla V100 GPU using Tensor Cores can achieve
shows that GPU outperforms the CPU in terms of compute, thus making model loading a bottleneck for end-to-end inference. Without data transfer overhead the NVIDIA Tesla V100 GPU using Tensor Cores can achievehigher throughput on CNNs and higher throughput on RNNs compared to a high-end CPU server . Reducing the data movement overhead makes GPU a more appealing option for DL inference.
To mitigate the model loading overhead, cloud services and previous work [54, 21, 27] persist model catalogs in memory or perform inference in batches. These strategies require knowledge of the model requests, have potential resource waste since the system resources are persisted within processes for models even when they are not used, or increase the latency of requests if batching the inferences.
3 Current Prediction Pipelines in FaaS
Function as a Service (FaaS) is a cost-effective way for users to deploy functions or pipelines that are executed within the cloud. Users define prediction pipelines that use models they deployed or ones found within the model catalog. The pipelines are then mapped to a fabric of containers — used to maintain software stack separation, virtualize system resources, and provide isolation — that run on physical machines. Unlike traditional cloud execution, the functions executed in a FaaS are short lived and are priced on a per-invocation basis (with function execution time and resource utilization being the main cost factors). Because cloud providers use a per-API call and per-resource utilization price model, resource waste affects the cloud user’s total cost of ownership.
To motivate our work, we use image to scene description pipeline deployed within FaaS as an example — illustrated in Figure 3. The pipeline takes an image input and outputs a textual description, leveraging a deployed AlexNet and an off-the-shelf scene understanding model from the cloud provider’s model catalog. Both the AlexNet model inference and the scene understanding API are called within User Function 3 . Cloud providers then provision the function to run within a container on a cloud server . When user code is triggered, both the deployed AlexNet model and the scene understanding endpoints are called through HTTP REST API calls. Meeting the latency requirements for this application is challenging because of the multiple over-the-network requests.
To avoid the network latency, a common practice is to collocate the model within the deployed functions or the application pipelines. However, such embedding requires a copy of the model to be loaded privately for each function or application pipeline. For example, and have to load AlexNet twice on the same machine — wasting memory resources. The private loads introduces latency overhead, since the model needs to be loaded for the first function invocation. Since in FaaS isolation must be guaranteed, the previously mentioned persistence schemes, in Section 2, is not a solution. Similarly, batching does not apply for low latency inference.
In a cloud setting DL models are shared extensively across user functions, for example: between the user functions shown in Figure 2. Based on this observation, we propose TrIMS to eliminate such model loading overhead and hardware resource waste, while maintaining resource utilization efficiency and decreasing inference latency in user processes. TrIMS achives this by folding “private copies” of the model into a shared copy under the hood. This is performed by decoupling the model persistence from the user-code execution — enabling model sharing, isolation, and low latency inference.
4 TrIMS Design
TrIMS consists of two components: a Model Resource Manager (MRM) server and framework clients. MRM manages the model resources resident in the system memory and abstracts away the model loading from framework clients. Each framework client communicates with MRM through inter-process communication (IPC), as shown in Figure 4. Since TrIMS follows the original DL framework’s API and semantics — returning the same data structures as the unmodified framework — user code can leverage TrIMS transparently without any code modification.
4.1 TrIMS Model Resource Manager (MRM)
TrIMS’s MRM is a model server daemon that performs model management and placement.
MRM maintains a database of models, addressing them using namespaces, with framework as well as model name and version being used to distinguish frameworks and models.
Figure 4 shows that MRM is managing models for MXNet, Caffe2 DL frameworks as well as word vector embedding models for FastText and Glove.
shows that MRM is managing models for MXNet, Caffe2 DL frameworks as well as word vector embedding models for FastText and Glove.
The MRM placement manager then maps the models into either GPU memory, CPU memory, local storage, or cloud storage. The four levels are analogous to the traditional CPU cache hierarchy. Because of this, we will simply refer to these four different memory hierarchies as “cache” in the rest of this paper whenever there is no ambiguity.
After system cold boot, initial model requests miss the GPU, CPU, and local storage caches, causing the model to be downloaded from the cloud storage and loaded into the “caches” to serve both the current quest and future requests. When one of the caches becomes full, one or more models are evicted from the cache.
For inter-process communication, TrIMS uses gRPC  to send and receive messages between the MRM and its clients. TrIMS leverages the CUDA runtime’s cudaIpc* to share GPU memory across processes. MRM abstracts away the model management, exposing two API functions to be used by the clients: trims::open and trims::close to load and close a model, respectively. MRM maintains a reference count for each model to determine the number of users currently using the shared model. The API is shown in Figure 5.
4.1.1 Loading Models
When loading a model, MRM performs shape inference on the model to estimate its memory footprint when running on GPU.
Shape inference is a simple arithmetic computation performed by any framework to determine the amount of internal memory to allocate for a model.
After shape inference, MRM follows the state diagram shown in Figure
When loading a model, MRM performs shape inference on the model to estimate its memory footprint when running on GPU. Shape inference is a simple arithmetic computation performed by any framework to determine the amount of internal memory to allocate for a model. After shape inference, MRM follows the state diagram shown in Figure7 and needs to handle three cases:
GPU cache hit — Model is persistent in GPU memory
MRM increments the model’s reference count and creates a shared memory handle from the device memory owned by MRM. The handle is then returned to the framework client. Model eviction is triggered when the intermediate results for a model is greater than the available free memory.
GPU cache miss / CPU cache hit — model is persistent in CPU memory
The server queries the current memory utilization of the GPU to see if the model can be copied to GPU memory. If it can, then GPU memory is allocated and copied; if not, then some memory needs to be reclaimed — entering the memory reclamation procedure.
CPU and GPU cache miss — model is not persistent in memory
If the data is not on local storage, then MRM downloads the model from the cloud. If the data is on disk, then MRM loads the data from disk using the framework’s serializer. Pinned memory is allocated on the CPU and the model weights is copied to it. MRM then follows the same logic as when the data is persistent in CPU memory.
4.1.2 Reclaiming Memory and Evicting Models
Memory reclamation is performed when the memory space for MRM at a specific cache level is full. Which model to evict to reclaim memory is determined by the eviction policy. TrIMS supports a pluggable set of common eviction policies such as least recently used(LRU) and least commonly used (LCU). For the CPU and GPU level caches, one needs to make sure that eviction does not interfere with user’s code. Models within the MRM database are not candidates for reclamation if they are in use; i.e. the reference count of a model is non-zero. Evicting models that is currently being used (effectively freeing GPU memory that’s being used) causes undefined behavior in the user’s code.
4.1.3 Unloading Models
When a TrIMS framework client unloads a model (or the user process exists), a model unload request is sent to MRM. MRM looks up the model in the database and decrements its reference count. By default MRM does not free resources for models that have a zero reference count (not currently used), but MRM can be configured to eagerly reclaim these models.
4.2 TrIMS Frameworks
MRM can handle requests from multiple TrIMS-enabled frameworks, managing their weights (which have different data layouts) in separate namespaces. Shown in Figure 5, when a TrIMS framework performs a model load request, the framework’s name and version are sent along with the request. The server can then perform the model unmarshaling from disk using the format supported by the framework.
To enable TrIMS in a framework, the functions to load and unload models need to be modified to perform requests to MRM. Since, each framework may have its own serialization format, support for the model format, to enable unmarshaling the data from disk to memory, needs to be added to MRM. With these changes, any type of network supported by the framework (CNN, RNN, Word2Vec, etc.) and any compute pattern is automatically supported by TrIMS.
User application rewriting overhead
— Since MRM does not modify the framework’s API, code that is linked with a TrIMS-enabled framework does not require any change. TrIMS works within Python, Java, or R. This is an attractive feature, since the benefits of TrIMS can be leveraged by cloud provider transparently from the user.
TrIMS supports fixed-size block, layer, and model level sharing granularity.
Sub-model level sharing granularity is interesting when considering layers or memory across models.
For example, models trained using transfer learning
supports fixed-size block, layer, and model level sharing granularity. Sub-model level sharing granularity is interesting when considering layers or memory across models. For example, models trained using transfer learning share the frozen layer weights. Block level granularity can also be used to share fixed-size buffers.
Multi-GPU and Multi-Node Support
— Multi-GPU is usually used when performing batched inference [17, 21]. TrIMS inherently supports the multi-GPUs by leveraging Unified Memory (UM) . Support for Multi-GPU sharing can also be performed without relying on UM by making the TrIMS framework client query the device ID of the current GPU context when a model is loaded. The framework client can then send the device ID along with the request. TrIMS MRM would then load the model into the GPU with that device ID. When a request loads a model on a GPU and the requested model is persistent on another GPU, MRM will perform GPU peer-to-peer memory copy if supported.
Multiple independent instances of TrIMS MRM can be loaded for multi-node support and an off-the-shelf task scheduling and load balancing middleware can be used to route and load balance inference requests. TrIMS can be setup to advertise the models that have already been loaded by users and the current system load to the load balancer.
4.3 Inference Isolation and Fairness
To enable seamless container isolation, TrIMS provides a Docker  volume plugin that allows service providers to provision the container with a communication link to the TrIMS MRM. The TrIMS MRM process runs in the host system with a link for frameworks to communicate with it across container boundaries. Figure 6 shows how untrusted user code can be run on a multi-tenant system while maintaining isolation. The code shows how users can use DL models, provided by the cloud provider, to create an image to audio pipeline. The user uses the cloud provided vision, text, and audio models via a library that is part of a model catalog. All user code executes within a container that communicates with the MRM via the container’s IPC mechanism.
The experiments reported in this paper are based on an implementation of TrIMS on top of the Apache MXNet 222The source code for TrIMS is open source and is found at http://github.com/REMOVED/DURING/REVIEW — a popular machine learning framework. The
— a popular machine learning framework. TheTrIMS MRM includes serialization code from MXNet to unmarshal MXNet models from disk. We also modify the MXNet framework to integrate it with TrIMS — keeping the MXNet APIs unchanged. Communication between the MXNet framework client and the MRM uses Google’s gRPC  with the packets encoded using Protocol Buffers .
To validate the efficiency and generality of our proposal, we follow a few principles throughout our implementation — even if disregarding some would have given us better speedup:
— The implementation needs to work with the existing framework’s code base and language bindings, i.e. we should be able to run preexisting MXNet codes written in Python or Scala with no modifications.
Simple and Minimal
— The implementation needs to be simple and not modify the framework code as much as possible. Our modifications adds only lines of code (less than of the MXNet code base) to the framework ( lines for the server and lines for the client) and is self contained.
— The implementation has knobs to tweak everything from the eviction strategy of memory sharing, the amount of memory that can be used, whether to enable TrIMS, the levels of cache to enable, etc…
Fast, Concurrent and Scalable
— We communicate using gRPC and use efficient data structures  for the MRM database to make the serving fast and concurrent. The memory sharing strategy in TrIMS is scalable and can handle large amount of load.
5.1 TrIMS Apache MXNet Framework
We implement TrIMS on top of the Apache MXNet framework client by modifying the MXPredCreate and MXPredFree in the MXNet C predict API’s implementations. When TrIMS is enabled, trims::open and trims::close are called as part of the predictor creation and deletion. Listing 1 shows the main modification to the original MXNet code.
Like most open-source DL Frameworks, MXNet is optimized for training and not inference. We apply a set of optimizations to the original MXNet to improve the inference latency. The optimizations avoid eager initialization of CUDA resources, remove cuDNN algorithm selection for backward propagation, and simplify the random resource generation. With our optimizations, MXNet is faster for inference on average than the vanilla MXNet for the suite of models we use. We use the modified MXNet as our baseline for evaluation.
5.2 GPU Memory Sharing
We perform GPU memory sharing using the CUDA’s cudaIPC* runtime functions. For Pre-Volta GPUs, the CUDA IPC mechanism utilizes CUDA MPS — an intermediate user process where the memory allocations are performed. This means that all CUDA operations end up serialized and executed within the same CUDA MPS context — enabling difference processes to share the same GPU virtual address space (VAS). For Volta GPUs, NVIDIA introduced a new feature to allows contexts to share page-table mappings. This makes it possible for user processes to run using different contexts while still sharing memory. For CUDA 9.2, CUDA MPS is still invoked to keep shared allocations and communicate across them, but, with the exception of a handful of functions, most CUDA operations are performed without IPC communication.
Because sharing may serialize to use CUDA MPS, one slight disadvantage of CUDA IPC functions is that they have a measurable overhead. This can become a bottleneck. When sharing models at layer granularity, networks with large number of layers, such as ResNet269-v2, have high overhead. We remedy this by having a per-group of layer sharing or model sharing granularity.
The CUDA IPC overhead is measurable, and we can quantify whether using TrIMS is beneficial statically using the empirical formula: , where is the number of objects to share (when the sharing granularity is at the model level, this value is ; when the granularity is at the layer, this value is the number of layers); is the overhead of sharing CUDA memory via CUDA IPC and is the overhead of obtaining a CUDA device pointer from a shared CUDA IPC handle; is the number of bytes the model occupies on disk; and is the disk I/O bandwidth. These constants can be computed once at system startup and cached to be used by TrIMS. If is positive, then its magnitude is correlated to the speedup one gets using TrIMS. This equation can be used within the TrIMS framework to determine at runtime whether to call TrIMS to share a model or not and at what granularity to share the model.
|Name||CPU||GPU||Memory||GPU Memory||Cached Reads||Buffered Disk Reads|
|System 1||Intel Core i9-7900X||TITAN Xp P110||32 GB||12 GB||8 GB/sec||193.30 MB/sec|
|System 2||Intel Xeon E5-2698 v4||Tesla V100-PCIE||256 GB||16 GB||10 GB/sec||421.30 MB/sec|
|System 3||IBM S822LC Power8 w/ NVLink||Tesla P100-SXM2||512 GB||16 GB||27 GB/sec||521.32 MB/sec|
We evaluate TrIMS on 3 systems (shown in Table 2) using (shown in Table 3) pre-trained small models and large models (shown in Table 4). The systems selected represent different types of instances that are currently provisioned in the cloud. System 3 uses the NVLink bus [32, 63] which allows up to transfer between CPU and GPU. System 3 is used as proxy for understanding our proposed method’s behavior on high end cloud instances and next generation interconnects currently being deployed on HPC and cloud systems [68, 64]. Multi-GPU results are similar to the single-GPU results shown bellow and for simplicity are omitted.
We used image processing models as a representative workload because these are currently the most plentiful in FaaS pipelines. TrIMS is agnostic to the compute patterns of a network and the analysis would apply to other types of networks such as: RNNs, word embeddings, or matrix factorization. The selected pre-trained image processing models, shown in Table 3, are based on their popularity in both research and usage. Some of the networks have variants. These are used to simulate user trained models — the same compute networks structure can have different weights. Large models are used to show how TrIMS performs with increasing model sizes.
Throughout this section we compare our performance within a FaaS setting against ideal (where the model loading and data movement takes no time — ideal is faster than model persistence) and use end-to-end “cold-start” inference as the base line, since that’s what is currently employed by FaaS environments.
6.1 Latency Improvement
We measure the end-to-end “cold-start” inference of MXNet with and without TrIMS – for the sake of clarity we omit the input processing time. Figure 8 shows the achieved speedup on a representative set of the models compared against MXNet that does not utilize TrIMS. We show two cases: (a) our best case (when there is a GPU cache hit) and (b) our worst case (when the cache misses both the CPU and GPU).
For best case analysis (Figure 8a), the server needs to create the CUDA IPC handles and the framework client needs to embed the GPU device pointers within the framework’s container. This introduces a slight overhead, however it is within of the ideal — ideal defined as the time for inference where model loading or deserialization times set to zero. We see that latency speedup improves proportionally to the model size, the system’s data movement bandwidth, the system’s compute resources, and the model’s compute complexity.
For small models, where the I/O overhead is very low, for example SqueezeNet (which has a memory footprint), we observe only marginal speedup (). These models are designed to have a small footprint — targeting edge devices — and are rarely used within the cloud. For state-of-the-art networks, such as VGG16-SOD, we observe speedup on System 1. Even with fast disk and the NVLink interconnect, which mitigates I/O overhead by offering greater data movement bandwidth, System 3 achieves speedup for VGG16-SOD.
For the worst case analysis (Figure 8b), the MRM needs to load the data from disk, persist the model on the CPU, copy the data to the GPU, and send the GPU memory handles to the client. Although we get a slow down, this case assumes there is no model sharing across pipelines, and therefore uncommon in cloud setting.
6.2 Speedup Breakdown
To understand where the new bottlenecks are for the inference using TrIMS, we look at System 3 where we achieve the lowest speedup and measure the (a) time to perform inference computation, (b) time to initialize the model (this includes copying the data to the GPU when not using TrIMS), (c) model loading from disk, and (d) model sharing overhead introduced by TrIMS. As can be seen in Figure 9, without using TrIMS an average of of the time is spent loading and initializing the model while only is spent performing computation. When using TrIMS we eliminate the model loading from disk and remove the need to perform memory copies to the GPU. Even though we introduce overhead, we still gain a geometric mean speedup.
6.3 Large Model Evaluation
We evaluate our method using large models which are common for medical image analysis, NLP, and time series modeling. We generated the large models by starting with the regular AlexNet and VGG16 networks, keeping their compute graph topology, and rescaling the input dimensions to generate enlarged model. Table 4 shows the models selected for evaluation, their memory footprint, and their input sizes.
Figure 10 shows that by removing model loading overhead, inference on large models becomes compute bound and gives an advantage to faster GPUs. This is why System 1 achieves less speedup than System 2 for the more compute intensive VGG16 network (for example for model 7), since model inference computation accounts for on System 1 and on System 2. We expect this to be a more pronounced bottleneck for lower end GPUs and less of an issue for specialized low latency inference accelerators.
We also observe that TrIMS increases the memory efficiency of the GPU. Without TrIMS, two inferences using model 8 cannot be run concurrently, since they overrun the GPU memory. TrIMS avoiding multiple private copies of the model on the same machine, enabling concurrent runs of large models.
6.4 Workload Modeling
Finally, we perform workload modeling to understand the behavior of TrIMS on multi-tenant oversubscribed system. The workload is selected from the 37 small models shown in Table 3 following a Pareto distribution. Since all the models cannot all be resident on the GPU at the same time — in total having GPU memory footprint — the TrIMS MRM needs to exercise the model reclamation and eviction procedure. Because of limited space, we only present the results for the LRU eviction strategy, but our observations are valid for other eviction strategies.
Figure 11 shows the level iso-efficiency curves for the geometric mean speedup 333We measure the speedup value using the geometric mean across the latency speedup of each model. as we vary the the concurrency level and number of models to run. We can see that even in an oversubscribed setting, we can still service clients concurrently, reduce the overall batch execution time (by up to ), while incurring only a latency penalty for each request. This slowdown is due to the cost of evicting models to accommodate the larger memory footprint, causing subsequent usage of the model miss the GPU cache.
For all three systems, we can observe an over-subscription sweet spot, where the percent of models and number of the concurrent request can be increased while the batch execution is preserved to a speedup of . All systems show a sweet spot when of models are actively being requested. For system 1 and 3, the number of concurrent requests can be increased to 4, and system 2 the same number improves to 6. The difference in the over subscription sweet spot can be explained due to the different compute capabilities between the systems. System 1 and 3 are provision with Pascal generation GPUs while system 2 has the latest Volta generation. Essentially, because we are successful in moving the inference bottleneck from I/O to compute, the sweet spot is determined by the available computing resources. In practice, cloud providers can perform sensitivity analysis to determine the number of models hosted on each server and the number of concurrent requests to service based on the service’s target requirements.
By removing model loading overhead, our speedups is bounded by the framework’s inference pipeline. Frameworks that are optimized for inference garner greater benefits. For older generation or lower end GPUs, compute would likely dominate inference. Therefore, if cloud providers are only interested in maintaining latency, they can utilize these older or lower end GPUs which have a lower initial cost of ownership.
7 Related Works
Recent related work has explored techniques to enable model serving at cloud scale. TesorFlow-Serving provides soft model isolation to guard against concurrent running request interfering with each other performance. TFX uses dedicated thread pool to hide model-loading overhead and provide thread-level user isolation. Clipper combine concurrent streams of DL requests into batches to better utilize the GPU at the cost of longer latency. All of these techniques suffer from their inability to provide user isolation or handle low latency inference.
Recent work [56, 44, 55, 31] leverage CUDA IPC in order to improve various intra-node and inter-node MPI collectives of a single process/application, and thus facilitate the porting to, and improve the performance of HPC applications on GPUs. MVAPICH2 , for instance, supports the use of MPI calls directly over GPU memory. Unlike these works, TrIMS leverages CUDA IPC in order to persist data structures across processes and thus actively seeks to improve IO and memory footprint, instead of multi-GPU coordination.
To reduce DL model inference latency and memory requirement, a large body of work have been performed recently in compacting and accelerating convolutional neural networks (CNNs). Quantization
To reduce DL model inference latency and memory requirement, a large body of work have been performed recently in compacting and accelerating convolutional neural networks (CNNs). Quantization[34, 70, 66, 39, 26] reduces the number of bits required to represent each weight by rescaling the weights to a domain smaller than the 32-bits required for floating point representation (usually 8-bits). Network pruning and sharing [59, 39, 23, 77] reduces redundant parameters which are not sensitive to the performance. Although these model optimization techniques can make the I/O vs. compute problem less severe, they have drawbacks and limited application scope. Quantized model inference suffer from accuracy loss while pruned network can significantly increase the computation intensity due to sparsity, especially on GPUs. Moreover these techniques currently only works with convolutional layers or fully connected layers, does not apply to other type of model inference, such as the fully connected layers used in word embedding. Model optimizations also could leverage TrIMS, enabling the sharing of optimized CNN models.
Although various CPU/GPU virtualization techniques [52, 67, 30] and GPU multi-tenancy [28, 57, 72, 20] can improve system utilization and throughput through time sharing or parallel sharing CPU or GPU, they do not help solving the model loading overhead within inference processes. TrIMS is orthogonal to these techniques and can be integrated into containers as a plugin. Also, in the very same way that NVIDIA Volta added the capability of effectively sharing memory across user-processes without the need of a proxy process (CUDA MPS server). The ability of sharing memory across different VMs (using a third level of virtual memory translation, as CPUs do) would enable TrIMS to work across VMs.
Collocating compute with model serving within FaaS overcomes the network barrier but suffers from high “code start” latency. We propose TrIMS to mitigate the major source of “code start” latency — the model loading overhead — and make building complex latency sensitive pipelines with modular DL components feasible. We do so by decoupling compute from model persistence and leveraging model sharing across user pipelines. TrIMS moves the bottleneck of DL model inference to compute, thus making GPU acceleration more appealing and making specialized novel inference architectures more tractable.
TrIMS was evaluated on three systems that represent current cloud system offerings. We used 45 DL models and show a speedup of up to for small models and up to for large models. When running concurrent inference, we can increase the overall system throughput by up to . Our methodology, when applied to DL frameworks, offers advantages to both cloud providers and users. The isolation along with the significant memory reduction through model sharing enable cloud providers to over-provision hardware resources, thus decreasing the total cost of ownership. The benefits of TrIMS to the cloud providers can be passed down to the users in the form of reducing latency or cost of inference.
TrIMS is a generic memory sharing technique that can be used when computation requires large number of constant parameters to be in situ on the CPU or GPU, while still maintaining isolation between users. As such, the proposed method can be easily generalized to any application or algorithm that spans multiple processes and requires large amount of constant data resources. While we motivated our work with deep learning, other types of applications such as image processing, physical simulation, or in-memory databases can benefit from our approach.
-  Amazon Lambda. http://aws.amazon.com/lambda. Accessed: 2018-8-04.
-  Amazon Rekognition. https://aws.amazon.com/rekognition. Accessed: 2018-8-04.
-  Amazon SageMaker. https://aws.amazon.com/machine-learning. Accessed: 2018-8-04.
-  Azure Cognitive Services. https://azure.microsoft.com/en-us/services/cognitive-services. Accessed: 2018-8-04.
-  Azure Functions. https://azure.microsoft.com/en-us/services/functions. Accessed: 2018-8-04.
-  CUDA Unified Memory. https://devblogs.nvidia.com/tag/unified-memory. Accessed: 2018-8-04.
-  Google Cloud AI. https://cloud.google.com/products/machine-learning. Accessed: 2018-8-04.
-  Google Cloud Functions. https://cloud.google.com/functions. Accessed: 2018-8-04.
-  Google Protocol Buffers. https://developers.google.com/protocol-buffers. Accessed: 2018-8-04.
-  IBM OpenWhisk. http://www.ibm.com/cloud-computing/bluemix/openwhisk. Accessed: 2018-8-04.
-  IBM Watson. https://www.ibm.com/watson. Accessed: 2018-8-04.
-  Machine Learning on AWS. https://aws.amazon.com/machine-learning. Accessed: 2018-8-04.
-  Node allocation for online prediction. https://cloud.google.com/ml-engine/docs/tensorflow/prediction-overview#node-allocation. Accessed: 2018-8-04.
-  Nvidia Inference Technical Overview. https://images.nvidia.com/content/pdf/inference-technical-overview.pdf. Accessed: 2018-8-04.
-  Online versus Batch Prediction. https://cloud.google.com/ml-engine/docs/tensorflow/online-vs-batch-prediction. Accessed: 2018-8-04.
-  Requesting Real-time Predictions. https://docs.aws.amazon.com/machine-learning/latest/dg/requesting-real-time-predictions.html. Accessed: 2018-8-04.
-  TensorFlow Serving. https://www.tensorflow.org/serving. Accessed: 2018-8-04.
-  Using the Model to Make Predictions. https://docs.aws.amazon.com/machine-learning/latest/dg/using-the-model-to-make-predictions.html. Accessed: 2018-8-04.
-  Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensorflow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI’16, pages 265–283, Berkeley, CA, USA, 2016. USENIX Association.
-  Rachata Ausavarungnirun, Vance Miller, Joshua Landgraf, Saugata Ghose, Jayneel Gandhi, Adwait Jog, Christopher J Rossbach, and Onur Mutlu. Mask: Redesigning the gpu memory hierarchy to support multi-application concurrency. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems, pages 503–518. ACM, 2018.
-  Denis Baylor, Eric Breck, Heng-Tze Cheng, Noah Fiedel, Chuan Yu Foo, Zakaria Haque, Salem Haykal, Mustafa Ispir, Vihan Jain, Levent Koc, et al. Tfx: A tensorflow-based production-scale machine learning platform. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1387–1395. ACM, 2017.
-  Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274, 2015.
-  Wenlin Chen, James Wilson, Stephen Tyree, Kilian Weinberger, and Yixin Chen. Compressing neural networks with the hashing trick. In International Conference on Machine Learning, pages 2285–2294, 2015.
-  Yunpeng Chen, Jianan Li, Huaxin Xiao, Xiaojie Jin, Shuicheng Yan, and Jiashi Feng. Dual path networks. In Advances in Neural Information Processing Systems, pages 4470–4478, 2017.
-  François Chollet. Xception: Deep learning with depthwise separable convolutions. arXiv preprint, 2016.
-  Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in neural information processing systems, pages 3123–3131, 2015.
-  Daniel Crankshaw, Xin Wang, Guilio Zhou, Michael J Franklin, Joseph E Gonzalez, and Ion Stoica. Clipper: A low-latency online prediction serving system. In NSDI, pages 613–627, 2017.
-  CUDA Multi-Process Service(MPS). https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf. Accessed: 2017-3-30.
-  Thomas W Dinsmore. In-memory analytics. In Disruptive Analytics, pages 97–116. Springer, 2016.
-  Jose Duato, Antonio J Pena, Federico Silla, Juan C Fernandez, Rafael Mayo, and Enrique S Quintana-Orti. Enabling cuda acceleration within virtual machines using rcuda. In High Performance Computing (HiPC), 2011 18th International Conference on, pages 1–10. IEEE, 2011.
-  Iman Faraji and Ahmad Afsahi. Hyper-q aware intranode mpi collectives on the gpu. In Proceedings of the First International Workshop on Extreme Scale Programming Models and Middleware, pages 47–50. ACM, 2015.
-  Denis Foley and John Danskin. Ultra-performance pascal gpu and nvlink interconnect. IEEE Micro, 37(2):7–17, 2017.
-  Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In , pages 580–587, 2014.
-  Yunchao Gong, Liu Liu, Ming Yang, and Lubomir Bourdev. Compressing deep convolutional networks using vector quantization. arXiv preprint arXiv:1412.6115, 2014.
-  Google Translate: Breaking language barriers in emerging markets. https://goo.gl/TkffQq. Accessed: 2017-18-04.
-  Robert Grandl, Srikanth Kandula, Sriram Rao, Aditya Akella, and Janardhan Kulkarni. Do the hard stuff first: Scheduling dependent computations in data-analytics clusters. arXiv preprint arXiv:1604.07371, 2016.
-  Robert Grandl, Arjun Singhvi, and Aditya Akella. F2: Separating compute from data in cluster computing. arXiv preprint arXiv:1703.10272, 2017.
-  gRPC. https://www.grpc.io. Accessed: 2018-8-04.
-  Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  Maurice Herlihy, Nir Shavit, and Moran Tzafrir. Hopscotch hashing. In International Symposium on Distributed Computing, pages 350–364. Springer, 2008.
-  Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016.
-  Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
-  Feng Ji, Ashwin M Aji, James Dinan, Darius Buntinas, Pavan Balaji, Rajeev Thakur, Wu-chun Feng, and Xiaosong Ma. Dma-assisted, intranode communication in gpu accelerated systems. In High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems (HPCC-ICESS), 2012 IEEE 14th International Conference on, pages 461–468. IEEE, 2012.
-  Yangqing Jia. Caffe2. https://www.caffe2.ai, 2017.
-  Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, pages 675–678. ACM, 2014.
-  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
-  Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. arXiv preprint arXiv:1312.4400, 2013.
-  Ji Liu, Esther Pacitti, and Patrick Valduriez. A survey of scheduling frameworks in big data systems. International Journal of Cloud Computing, pages 1–27, 2017.
-  Lei Lu, Hui Zhang, Evgenia Smirni, Guofei Jiang, and Kenji Yoshihira. Predictive vm consolidation on multiple resources: Beyond load balancing. In Quality of Service (IWQoS), 2013 IEEE/ACM 21st International Symposium on, pages 1–10. IEEE, 2013.
-  Dirk Merkel. Docker: lightweight linux containers for consistent development and deployment. Linux Journal, 2014(239):2, 2014.
-  Roberto Morabito, Jimmy Kjällman, and Miika Komu. Hypervisors vs. lightweight virtualization: a performance comparison. In Cloud Engineering (IC2E), 2015 IEEE International Conference on, pages 386–393. IEEE, 2015.
-  Mvapich2. http://mvapich.cse.ohio-state.edu/userguide/gdr/2.2. Accessed: 2017-3-30.
-  Christopher Olston, Noah Fiedel, Kiril Gorovoy, Jeremiah Harmsen, Li Lao, Fangwei Li, Vinu Rajashekhar, Sukriti Ramesh, and Jordan Soyke. Tensorflow-serving: Flexible, high-performance ml serving. arXiv preprint arXiv:1712.06139, 2017.
-  Antonio J Pena and Sadaf R Alam. Evaluation of inter-and intra-node data transfer efficiencies between gpu devices and their impact on scalable applications. In Cluster, Cloud and Grid Computing (CCGrid), 2013 13th IEEE/ACM International Symposium on, pages 144–151. IEEE, 2013.
-  Sreeram Potluri, Hao Wang, Devendar Bureddy, Ashish Kumar Singh, Carlos Rosales, and Dhabaleswar K Panda. Optimizing mpi communication on multi-gpu systems using cuda inter-process communication. In Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2012 IEEE 26th International, pages 1848–1857. IEEE, 2012.
-  Dipanjan Sengupta, Raghavendra Belapure, and Karsten Schwan. Multi-tenancy on gpgpu-based servers. In Proceedings of the 7th international workshop on Virtualization technologies in distributed computing, pages 3–10. ACM, 2013.
-  Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
-  Suraj Srinivas and R Venkatesh Babu. Data-free parameter pruning for deep neural networks. arXiv preprint arXiv:1507.06149, 2015.
Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi.
Inception-v4, inception-resnet and the impact of residual connections on learning.In AAAI, volume 4, page 12, 2017.
-  Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich, et al. Going deeper with convolutions. Cvpr, 2015.
-  Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2818–2826, 2016.
-  Nathan R Tallent, Nitin A Gawande, Charles Siegel, Abhinav Vishnu, and Adolfy Hoisie. Evaluating on-node gpu interconnects for deep learning workloads. In International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems, pages 3–21. Springer, 2017.
-  Arnold Tharrington, Wael R Elwasif, and Don Maxwell. Experiences evaluating functionality and performance of ibm power8+ systems. In High Performance Computing: ISC High Performance 2017 International Workshops, DRBSD, ExaComm, HCPM, HPC-IODC, IWOPH, IXPUG, P^ 3MA, VHPC, Visualization at Scale, WOPSSS, Frankfurt, Germany, June 18-22, 2017, Revised Selected Papers, volume 10524, page 254. Springer, 2017.
-  Lisa Torrey and Jude Shavlik. Transfer learning. Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques, 1:242, 2009.
-  Vincent Vanhoucke, Andrew Senior, and Mark Z Mao. Improving the speed of neural networks on cpus. In Proc. Deep Learning and Unsupervised Feature Learning NIPS Workshop, volume 1, page 4. Citeseer, 2011.
-  Virtual gpu. https://www.nvidia.com/en-us/design-visualization/technologies/virtual-gpu. Accessed: 2017-3-30.
-  RL Vogt, PR Kotta, and CN Meissner. Science and technology review march 2017. Technical report, Lawrence Livermore National Laboratory (LLNL), Livermore, CA, 2017.
-  Tobias Weyand, Ilya Kostrikov, and James Philbin. Planet-photo geolocation with convolutional neural networks. In European Conference on Computer Vision, pages 37–55. Springer, 2016.
-  Jiaxiang Wu, Cong Leng, Yuhang Wang, Qinghao Hu, and Jian Cheng. Quantized convolutional neural networks for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4820–4828, 2016.
-  Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 5987–5995. IEEE, 2017.
-  Tsung Tai Yeh, Amit Sabne, Putt Sakdhnagool, Rudolf Eigenmann, and Timothy G Rogers. Pagoda: Fine-grained gpu resource virtualization for narrow tasks. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 221–234. ACM, 2017.
-  Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
-  Jianming Zhang, Shugao Ma, Mehrnoosh Sameki, Stan Sclaroff, Margrit Betke, Zhe Lin, Xiaohui Shen, Brian Price, and Radomir Mech. Salient object subitizing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4045–4054, 2015.
-  Jianming Zhang, Stan Sclaroff, Zhe Lin, Xiaohui Shen, Brian Price, and Radomir Mech. Unconstrained salient object detection via proposal subset optimization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5733–5742, 2016.
-  Michael Zheludkov, Timur Isachenko, et al. High Performance in-memory computing with Apache Ignite. Lulu. com, 2017.
-  Hao Zhou, Jose M Alvarez, and Fatih Porikli. Less is more: Towards compact cnns. In European Conference on Computer Vision, pages 662–677. Springer, 2016.