Datacenters  host latency critical user-facing applications, such as web search  and web service . These applications have strict Quality of Service (QoS) requirement in terms of tail latency, and require frequent bug fixing and feature updating. To meet these requirements, service design shifts from a monolithic architecture to a microservice architecture , where a complex user-facing service is decomposed into multiple loosely coupled microservices. Each microservice provides a specialized functionality. A microservice-based application involves the interoperation of multiple microservices, each of which can be implemented, deployed, and updated independently without compromising the application’s integrity. Such independence improves the application’s scalability, portability, and availability. Considering these advantages, the microservice architecture has been regarded as the widely accepted and employed software architecture by Internet giants such as Netflix, Amazon, Apple and eBay [26, 46, 31].
, and deep learning) are also shifting towards the microservice architecture (referred as “GPU microservices”). Figure 1 shows an example of deploying an application that has three microservice stages on GPU. In the figure, multiple microservices run on a spatial multitasking GPU concurrently, since the current Volta Multi-Process Service(MPS)  allows multiple applications to share GPU computational resources for better resource efficiency. Observed from this figure, the back pressure effects caused by the dependencies between the microservices result in expensive overhead . The cascading QoS violations will quickly propagate through the entire service, which leads to worse consequences of QoS violations. Therefore, even though the quality-of-service (QoS) requirements of user-facing applications are similar for microservices and monoliths, the tail latency required for each individual microservice is much stricter than for traditional monolith applications.
Besides guranteeing the QoS of microservices, it is cost efficient to maximize the supported peak load of a user-facing application with limited resources, and minimize resource usage of a service with varying load. There are some prior researches on characterizing and managing resources for CPU microservices [45, 3, 15, 26]. Benefit from the containerized deployment pattern  of CPU microservices, such interference could be encapsulated and resolved at the container level. A container may be imposed with certain limits on the CPU and memory resources consumed by a microservice.
However, prior analysis and resource management policies do not apply for GPU microservices. While CPU microservices contend for CPU and memory bandwidth, GPU microservices contend for SMs, global memory capacity and bandwidth, and PCI-e bandwidth (as shown in Figure 1). In addition, there is no containerized environment that enables fine-grained resource sharing for spatial multitasking GPUs. Balancing the throughput of the microservices to improve the microservice pipeline efficiency and eliminate the backpressure effect is challenging on GPU. Laius  is state-of-the-art work that manages resource on spatial multitasking GPUs. It improves the GPU utilization by co-locating user-facing applications and batch applications when ensuring the QoS of the user-facing applications. However, Laius is not able to handle GPU microservices that show complex dependency relationship, because it assumes independent relationship between that the co-located tasks.
We find that the communication overhead between microservices, the pipeline efficiency (determined by the number of SMs allocated to each microservice, and the number of instances in each microservice stage), and the global memory bandwidth contention together determine the tail latencies of GPU microservices. We have two insights: 1) the communication between GPU microservices result in the long end-to-end latencies due to the limited PCIe bandwidth; 2) the global memory capacity of an GPU becomes one of the main limitations for the microservice co-location, because each microservice occupies large global memory space.
While there is no standard GPU microservice benchmarks, we first develop Camelot suite , a benchmark suite that includes both real and artifact GPU microservices. The real-system workloads include end-to-end services that cover natural language processing (NLP), deep neural network (DNN) and image processing. We use cutting-edge model such as LSTM
, a benchmark suite that includes both real and artifact GPU microservices. The real-system workloads include end-to-end services that cover natural language processing (NLP), deep neural network (DNN) and image processing. We use cutting-edge model such as LSTM, Bert , VGG , and DC-GAN  to build the real-system benchmarks, and the benchmarks are programmed with python, C/C++, and CUDA. The artifact benchmark is comprised of compute intensive, memory intensive and PCI-e intensive microservices. We can emulate various end-to-end services using the artifact benchmark.
Because the load of a user-facing service varies (diurnal load pattern ) and the contention scenario is only known at runtime, an online method is required to manage the GPU microservices. We therefore propose a runtime system named Camelot to manage GPU resources online. In Camelot, a global memory-based communication mechanism enables fast data transfer between microservices on the same GPU; two contention-aware resource allocation policies identify the optimal GPU resource allocations that minimize the resource usage or maximize the throughput while ensuring the required QoS. The allocation decisions are made based on the pipeline effect of microservices and the runtime contention behaviors. To enable the effective resource allocation, we also propose a performance predictor that precisely predicts the global bandwidth usage, duration, and throughput of each microservice under various resource configurations. This paper makes three main contributions.
Comprehensive characterization of GPU microservices.
The characterization reveals the challenges in managing GPU microservices. We will open source both the benchmark suite and the runtime system111The source code is available at github. Currently the link is hidden due to the double-blind review but available by request..
A global memory-based communication mechanism for GPU microservices. Adopting the mechanism, the microservices on the same GPU communicate directly without the expensive CPU-GPU data copies.
A lightweight GPU resource allocation policy. The policy considers communication overhead, global memory capacity, shared resource contention, and pipeline stall when managing the GPU resources.
We implement Camelot and evaluate it on a GPU server with two Nvidia 2080Ti GPUs, and a DGX-2 machine with Nvidia V100 GPUs. According to our experimental results, Camelot effectively increases the supported peak load by up to 73.9% and 64.5% compared with EA and Laius, and reduces the GPU resource usage by 46.5% compared with equal allocation and 35% compared with Laius at low load while ensuring the required QoS.
Ii Related work
There have been some efforts on related topics: resource management and scheduling in datacenters, benchmark suites for user-facing services, and microservice architecture.
Microservice Architecture. Yu et al. proposed a microservice benchmark suite DeathStarBench , and used it to study the architectural characteristics of microservices. Li et al.  presented a data flow-driven approach to semi-automatically decompose cloud services into microservices. Zhou et al.  identified the gap between existing benchmarks and industrial microservices, and proposed a medium-size microservice benchmark system. There also exist some efforts on the measuring the performance of microservice-based applications [47, 1, 21, 3]. Gribaudo et al.  provided a simulation-based approach to explore the impact of microservice-based architectures in terms of performances and dependability, given a desired configuration. However, these researches are for CPU microservices and are not applicable for GPU microservices.
Resource scheduling for CPU microservices. There has been a large amount of prior work on improving the utilization while avoiding QoS violations for CPU microservices. Bao et al.  analyzed the performance degradation of microservices from the perspective of service overhead and develops a workflow-based scheduler to minimize end-to-end latency and improves utilization. Based on the characteristics of the workload, HyScale and ATOM [18, 24] designed resource hybrid controllers that combine horizontal and vertical scaling to dynamically resource division to improve the corresponding time of microservices. Considering the complexity of performance prediction, Seer  proposed an online performance prediction system.
Resource management on GPUs. DART  employed a pipeline-based scheduling architecture with data parallelism, where heterogeneous CPUs and GPUs are arranged into nodes with different parallelism levels. Laius  allocated the computation resource to the co-located applications for maximizing the throughput of batch applications while guaranteeing the required QoS of user-facing services. Baymax  reorders the GPU kernels for ensuring QoS at co-location on time-sharing accelerators. However, none of them considers the dependence relationship between microservices as Camelot does. Ignoring the characteristics of the microservice architecture makes results in low resource utilization compared with Camelot.
Iii Representative Microservices
In this section, we describe Camelot suite, a GPU microservice benchmark suite that includes four representative end-to-end user-facing GPU microservices. Besides, we design an artifact benchmark comprised of compute-, memory- and PCIe-intensive microservices for extensive evaluation. We build Camelot suite based on four guidelines:
Functional integrity - The benchmarks should reflect the real world requirements, show full functionality, and are deployable on real systems.
Programming Heterogeneity - The benchmarks should allow programming language and framework heterogeneity, with each tier developed in the most suitable language, only requiring a well-designed API for microservices to communicate with each other.
Modularity - According to Conway’s Third Law , the artifact benchmarks should be independent and modularized. This modularity prevents vague boundaries and sets up the “inter-operate, not integrate” relationship between the artifact benchmarks.
Representativeness - The computational parts of the microservices should come from popular open-source applications and state-of-the-art approaches used in academic and industrial.
Iii-a Real system GPU Microservices
According to the above concepts, we choose user-facing applications that uses common deep learning techniques and implement them in microservice architecture. Table I lists the end-to-end user-facing applications that cover a wide spectrum of real applications based on GPU microservices.
|Img-to-img ||Face recognition||FR-API ||PYTHON&|
|Image enhancement||FSRCNN ||CUDA|
Image feature extraction
|VGG ||C++ &|
|Text-to-imag||Semantic understanding||LSTM ||C++&|
|Image generation||DC-GAN ||CUDA|
|Text-to-text ||Text summarization||BERT ||PYTHON&|
|Text translation||Opennmt ||CUDA|
Figure 2 illustrates the tiered view of Camelot Suite spanning the query taxonomy it supports, and the end-to-end applications in Camelot Suite. They are widely-used in natural language proessing (text-to-text), image proessing (img-to-img), image generation (text-to-img), and image caption (img-to-text). Their functionalities are described as follows.
Natural language processing applications belong to the text-to-text class [38, 28] and consist of two GPU microservices. The first microservice is text summarization task, with Bert. Text summarization is designed to turn text or collections of text into short summaries that contain key information. The second microservice is sentence translation, which aims to translate the text summary output from the first stage into another language.
Image processing applications belong to the img-to-img class [2, 50, 12]. The first stage of the application is the face recognition service based on an open-source project named “face-recognition”. The second part is the image enhancement service, which is implemented using the FSRCNN model. This is an image-based microservice where users send requests and upload images to the microservices. Then face recognition service first recognizes the face location information in the image and cut out the tiny faces. Next, the image enhancement service further processes the tiny face to generate a high pixel (6464) face image.
Image generation applications generate new images according to the text, and belong to the text-to-img class. For example, if the input to the neural network is “flowers of pink petals”, the output will be an image containing these elements. The task consists of two parts [39, 16, 55, 52, 17]: (1) Use natural language processing to understand the description in the input, with LSTM. (2) Generate a network to output an accurate image that expresses the text, with deep convolutional generative adversarial network (DC-GAN).
. The benchmark involves two models: (1)The feature extraction model. Given an image, it extracts significant features, which are usually represented by a vector of fixed length. VGG is usually used for feature extraction. (2) The language model. For image description, a neural network such as a language model can predict a sequence of words in a description based on the extracted features of the network. A common method is to use a cyclic neural network, such as Long and Short Term Memory Network (LSTM), as a language model.
Iii-B Artifact Benchmarks for Extensive Study
The artifact benchmarks are ported from three PCI-e intensive, compute-intensive and memory-intensive workloads in Rodinia . By connecting the artifact benchmarks as needed, we are able to build various end-to-end GPU microservices. The arithmetic intensities of the compute intensive microservice and the memory intensive microservice can be configured accordingly. Figure 5 shows the scalability of the microservices with different compute intensities and memory intensities. In the figure, is configured to be more compute intensive than and , is more memory intensive than and . The two microservices are sensitive to the resource allocation, thus are suitable to study resource management for general GPU microservices.
Iv Investigating GPU Microservices
We use the real system benchmarks in Camelot suite to investigate the effectiveness of the current service deployment methods for GPU microservices. Specifically, we seek to answer two research questions. 1) Can the current deployment methods effectively utilize GPU resources? 2) If no, what are the main factors that result in the inefficiency?
Iv-a Inefficient Microservice Pipeline
We use two Nvidia RTX 2080Ti GPUs as the experimental platform to perform the investigation. Because our study does not rely on any specific feature of 2080Ti, it applies for other spatial multitasking GPUs.
Standalone deployment policy deploys each microservice on a standalone GPU, and relying on the cross-GPU data copies to perform the communication between the microservices. In this experiment, we gradually increases the load of each benchmark until its 99%-ile latency achieves the QoS target, and report the peak throughput (i.e., query-per-second, QPS) of the benchmark in Figure (a)a. Let and represent the time spent by a user query on the two microservice stages when the latency of the query reaches the QoS target. The bar “Total” in the figure shows the peak throughput of the benchmark, “Stage1” and “Stage2” show the achievable throughputs of the two microservice stages while making sure that their processing time are shorter than and respectively.
As shown in Figure (a)a, the peak supported throughput of a benchmark is determined by the microservice stage that has the lowest throughput. For instance, the peak throughputs of image-to-image and image-to-text are determined by the first microservice stage and the second microservice stage respectively. Therefore, the standalone deployment policy results in the low peak throughput of GPU microservices due to the inefficient microservice pipeline. This is because it does not consider the differences between the microservices.
Balanced deployment policy is designed base on the fact that a user-facing application achieves the highest throughput when the throughputs of its microservice stages are identical. The policy allocates the computational resources (i.e., SMs) to the microservices in a fine-grained manner accordingly. The fine-grained allocation is enabled by the Nvidia Volta MPS technique . To achieve the balanced deployment, for each benchmark, the throughput and processing time of each microservice stage are profiled offline, the SM allocation is carefully tuned so that the throughputs of the two stages are identical, while still ensuring that the aggregated processing time is shorter than the QoS target. For instance, if some SMs of the GPU for Stage2 of the img-to-img benchmark can be allocated to Stage1, the peak throughput of img-to-img can be improved.
Figure (b)b shows the QoS violation of the benchmarks when the balanced deployment policy is adopted. In the figure, the stars represent the normalized 99%-ile latencies of the benchmarks (corresponding to the right -axis). The bars “stage1 (offline)”, “stage2 (offline)”, “stage1 (co-located)”, “stage2 (co-located)” represent the offline-profiled processing time of the first and the second microservice stages, and the actual processing time of the first and the second microservice stages respectively (the left -axis).
We get two observations from Figure (b)b. As for the first observation, all the benchmarks suffer from QoS violation with the balanced deployment policy. This is mainly because the microservices on the same GPU contend for PCIe bandwidth, global memory bandwidth (Figure 1), although the SMs are explicitly allocated. The unstable runtime contention behavior results in the long tail latency. As for the second observation, the actual processing time of both the two stages are longer than their offline-profiled processing time due to the shared resource contentions. The unbalanced performance degradations due to the contention also result in the inefficient pipleine in consequence. Our evaluation in Section VIII-D also verifies the necessity to manage global memory bandwidth contention for GPU microservices.
The current service deployment policies result in low peak throughputs or QoS violations of GPU microservices due to the inefficient microservice pipeline, without tuning the SM allocation online based on runtime contention behaviors.
Iv-B Large Communication Overhead
Besides the inefficient pipeline, the communication overhead between microservices contributes to the long end-to-end latency. As shown in Figure 1, microservices communicate through the main memory. When a microservice sends the result to the next microservice , its data is first transferred from the global memory used by to the main memory, and then transferred back to the global memory used by , even if and are on the same GPU. This is because is not allowed to access ’s data directly.
Figure 13 shows the breakdown of the end-to-end latencies of the queries in the benchmarks. As shown in the figure, the communication time takes a large percentage of the end-to-end latency for all the real applications. The data transfer time (host to device/device to host) takes 32.4% to 46.9% of the end-to-end latency. If the long communication time is eliminated, we can greatly reduce the end-to-end latency of user queries. In this case, the supported peak load can be further increased, and the required GPU resource decreases to support a low load.
Iv-C Limited Global Memory Space
While the current machine learning models often use large batch size to improve the throughput, the models are large in capacity. In this scenario, microservices are hard to be co-located on the same GPU due to the limited global memory space. As an example, Figure14 shows the global memory usage and the corresponding GPU utilization when the first microservice of img-to-img uses different batch sizes. As shown in the figure, the global memory of a GPU is only able to host the microservice with batchsize smaller than 256, while the GPU utilization is lower than 25%. In this scenario, we are not able to allocate the remaining free computational resource of the GPU to other microservices.
Unified memory technique that automatic swaps data between the main memory and the global memory can enable the reallocation. However, it incurs heavy data transfer through PCIe bus . The transfer significantly slows down the communication between microservices (discussed in Section VI). The limited global memory space of GPUs also contribute to the inefficiency of microservice pipelines.
Besides the computational resources on each of the GPUs, the resource allocation for improving the pipeline effect of GPU microservices has to consider the global memory space as one of the main constraints.
V The Camelot Methodology
In this section, we propose Camelot, a runtime system that maximizes the supported peak load of GPU microservices with limited GPUs and minimizes resource usage at low load while ensuring the QoS requirements.
V-a Design Principles of Camelot
Based on the investigation in Section IV, we design Camelot based on three design principles.
(1) Camelot should minimize the communication overhead between microservices. The CPU-GPU data transfer between microservices results in the long end-to-end latency. In addition, the PCI-e bandwidth contention between microservices instances also leads to increased communication overhead and long latency.
(2) Camelot should maximize pipeline efficiency while achieving the required QoS online. The pipeline efficiency is affected by both the percentage of SM resources allocated to each microservice and the runtime contention behaviors, since the microservices on the same GPU contend for the shared resources (e.g., global memory bandwidth).
(3) Camelot should schedule microservices across multiple GPUs considering the limited global memory space. Since the global memory space is one of the resource bottlenecks for GPU microservices, Camelot should be able to use multiple GPUs to host a end-to-end microservice-based application. Same to the SMs, the GPU memory space is one of the main constraints when scheduling the microservices.
V-B Overview of Camelot
Figure 15 shows the design overview of Camelot. As shown in the figure, Camelot adopts a global memory-based communication mechanism to reduce the communication overhead between microservices on the same GPU. For Camelot, we propose two contention-aware resource allocation policies that maximize the supported peak load of an end-to-end microservice-based application with limited GPUs and minimize the resource usage at low load respectively, while ensuring the desired 99%-ile latency target.
The global memory-based communication eliminates the back and forth data transmissions for microservice communication between CPU memory and the global memory of GPU (Section VI). It achieves the purpose by only passing the handle of the to be transferred data in the global memory to the receiver. The mechanism resolves the long communication overhead that results in the long end-to-end latency.
The two resource allocation policies allocate GPU computational resources (i.e., SMs) to the microservices based on the performance of each microservice with various resource configurations, and the runtime contention behaviors (Section VII). The challenging part here is that Camelot needs to constraint the degradation due to the runtime contention. Otherwise the user-facing service would suffer from QoS violation. By considering both the performance and the contention, the two policies handle the ineffective pipeline effect and the shared resource contention.
Specifically, when a user query is submitted, it is processed in the following steps. 1) The query is pushed into a query wait queue and wait to be issued to the GPU. 2) Once enough queries are received or the first query in the queue tend to suffer from QoS violation, the queries are batched and issued. 3) According to the batch size, Camelot calcualates the GFLOPs (floating point operations) of the batch of queries, and predicts the global memory usage, global memory bandwidth usage, processing time, and throughput (executed requests per second) of each microservice stage under various computational resources. The prediction is done based on an offline-trained performance model. 4) Based on the prediction, Camelot identifies the percentages of the computational resource that should be allocated to each microservice and determines the number of instances for each microservice stage. 5) When co-locating these microservice instances, Camelot considers the reduced communication overhead with the global memory-based communication, the contention on the global memory bandwidth, and the limited global memory space on each GPU. Camelot uses the process pool technique proposed in Laius  to realize the dynamic SM allocation.
Vi Reducing Communication Time
In this section, we present a global memory-based communication mechanism that enables fast communication between microservices on the same GPU.
Vi-a Characterizing the Contention on PCIe Bus
Figure 18 compares the traditional main memory-based communication and the proposed global memory-based communication mechanism between microservices. During the execution of GPU microservices, since the input of the next stage depends on the output of the previous stage, the results of a microservice stage must be transferred to the next stage. As shown in Figure 18(a), adjacent microservices in the pipeline (e.g., and , and ) communicate with each other by copying data back and forth between GPU global memory and the CPU memory. The default communication mechanism results in the long communication latency and the low data transfer bandwidth, especially when multiple microservices co-run on the same GPU.
To show the impact of the default communication mechanism, we perform an experiment that runs multiple instances of a PCIe-intensive microservice concurrently on a GPU. The functionality of is copying 5GB data from the main memory to the global memory. In the experiment, each instance of is allocated only 10% of the computational resource to eliminate the impact of the contention on the SMs. Figure 19 shows the data transfer time over PCIe bus of an instance of . In the figure, the -axis shows the number of the instances of on the GPU.
As shown in Figure 19, the data transfer time increases when more than three instances are co-located. The increased data transfer time is due to the contention on the PCIe bandwidth. While the theoretical peak bandwidth of 16x PCI-e 3.0 bus used in our platform is 15,800MB/s and the effective bandwidth is 12,160MB/s , and a single memcpy task uses PCIe bandwidth of 3,150MB/s according to our measurement. If the memcpy task transfers data from pinned memory, a single such memcpy task is able to consume all the PCIe bandwidth.
In this scenario, if a GPU hosts more than PCIe-intensive microservice instances, the microservices contend for the limited PCIe bandwidth and suffer from the long communication time. The long communication time results in the long end-to-end latency of user queries.
Vi-B Global Memory-Based Communication
Observed from Figure 18(a), the data that should be passed from to is already in the global memory space of , although the data is not accessible for . If is able to share the data with , the expensive memcpy (from device to host, and from host to device) can be eliminated. We design a global memory-based communication mechanism as shown in Figure 18(b) to achieve this purpose. In more detail, adopting the global memory-based mechanism, the result of a microservice is temporarily stored in the global memory. Another microservice is able to access the data from the global memory directly without copying data back and forth between the global memory and the main memory.
Figure 20 illustrates the design of global memory-based communication mechanism. As shown in the figure, when a microservice needs to pass its result to microservice on the same GPU, its process on the host passes a global memory handle (8 bytes) to the process of on the CPU side. Once gets the data handle, it is able to directly access the data from the global memory. We implement the mechanism using the CUDA IPC (inter-process communication) technique provided by Nvidia. The sender process gets the IPC handle for a given global memory pointer using cudaIpcGetMemHandle(), passes it to the receiver process using standard IPC mechanisms on the host side, and the receiver process uses cudaIpcOpenMemHandle() to retrieve the device pointer from the IPC handle.
Figure 21 shows the communication time between two microservices on the same GPU using the default and the global memory-based mechanisms. In the figure, the two microservices do not contend for the PCIe bandwidth. Observed from this figure, the global memory-based mechanism greatly reduces the communication time when the to be passed data is larger than 0.02MB. The larger the to be transferred data, the larger the performance gain is achieved with the global memory-based mechanism. In addition, if the to be transferred data between two microservices are small (e.g., only 2 bytes), the traditional memory-based mechanism shows shorter time. This is mainly because CUDA IPC incurs slight fix overhead when probing, transferring, and decoding the IPC handle in the global memory-based communication mechanism.
Besides reducing the communication time, the mechanism also reduces the global memory space usage of the microservices. With the traditional mechanism, and save two copies of the transferred data. On the contrary, with the global memory-based mechanism, only saves a single copy of the transferred data. and also save a IPC handle of 8 bytes respectively. While the transferred data between microservices are often larger than 8 bytes, the global memory-based mechanism does not consume extra global memory space. Instead, it reduces the global memory usage.
It is worth noting that the microservices on different GPUs are not able to communicate through the global memory-based mechanism. Therefore, the microservices that require heavy communication should be placed on the same GPU.
Vii Allocating GPU Resources
In this section, we present two contention-aware resource allocation policies for GPU microservices. The first policy maximizes the supported peak load of GPU microservices with limited GPUs while avoiding QoS violation. The second policy minimizes GPU resource usage while ensuring the QoS, in case that the load of a service is low.
Vii-a Low Overhead Performance Prediction
Camelot predicts the processing duration, the global memory bandwidth usage, and the throughput of each microservice to support the two resource allocation policies. The throughput represents the number of queries that can be processed per second at a microservice. For each microservice, we train it a performance model that predicts its processing duration, global memory bandwidth usage, and throughput.
The model for a microservice takes its input batchsize and percentage of computational resources as the input features, as they seriously affect the microservice’s performance. The input batchsize reflects the workload of a query, and the percentage of computational resource reflects the computational ability used to process the query. And these features can be collected by profiling tools such as Nsight Compute provided by Nvidia . To collect training samples for a microservice, we submit queries with different batch sizes, execute them with different computational resource quotas and collect the corresponding duration. During the profiling, queries are executed in solo-run mode to avoid interference due to shared resource contention.
Since the QoS target of a user query is hundreds of milliseconds to support smooth user interaction 
, it is crucial to choose the modeling technique that shows both high accuracy and low complexity. We evaluate a spectrum of broadly used low latency algorithms for the microservice performance prediction: Linear regression (LR)
, Decision Tree (DT)
, and Random forest (RF).
To evaluate the accuracy of the three modeling techniques, we use 70% of the collected samples to train the models and use the rest for testing. Figure 22 present the prediction errors of the duration, global memory bandwidth usage, and throughput of each microservice in Camelot suite with LR, DT, and RF. In general, DR and RF show high accuracy for predicting the microservice performance. Besides accuracy, we also measure the execution time of different prediction models. The time of predicting with DT is shorter than 1 ms, while the the RF model runs higher than 5 ms. We therefore choose DT as the modeling technique to train the performance models. Besides, Camelot also predicts the FLOPs (floating point operations) and the required global memory space of the microserives with different workloads. LR is able to precisely capture such linear relationship.
We do not use black box methods, such as Reinforcement Learning or Bayesian optimization , to predict the performance of microservices online, because in-production GPUs lacks the ability to obtain runtime statistics online with low overhead. In a datacenter, it is acceptable to profile a service and build a new model before running it permanently. Similar to prior work on datacenters , the profiling is done offline so it does not incur runtime overhead.
Vii-B Case 1: Maximizing Peak Load
The peak load of an end-to-end service is determined by the smallest peak load of its microservices. Therefore, the design principle here is maximizing the smallest throughput of the microservices in an end-to-end service, while still ensuring the end-to-end latency shorter than the QoS target. Camelot tunes the number of microservice instances for each microservice stage, and the SM resource quota for each microservice instance. Other resources (such as global memory bandwidth) cannot be explicitly allocated.
We formalize the above problem to be a single-objective optimization problem, where the objective function is maximizing the smallest throughput of the microservices, and the constraints are global memory capacity, global memory bandwidth, computational resources on the GPUs and the QoS target of microservices. In addition, the number of instances for each microservice stage and the resource quotas allocated to each process can be derived from an optimization problem related to its feasible solutions.
The constraints in the optimization problem are as follows. First, to avoid global memory bandwidth contention, the accumulated global memory bandwidth required by all the microservices on a GPU should be less than the available global memory bandwidth of the GPU. Second, the accumulated computational resource quotas allocated to concurrent instances should not exceed the overall available computational resources. Third, the number of microservice instances on a GPU should not exceed 48 ( Volta MPS allows at most 48 client-server connections per-device). Fourth, the total time required for the total user-facing application should be smaller than the QoS target.
|Variable||Varible description||Provided by|
|the th part of Microservice||Benchmarks|
|the computational resource quotas|
|allocated to the th microservice||Section VII-B|
|the batchsize of Microservice||scheduler|
|the total number of GPUs||Section VII-C|
|the available global memory bandwidth||Nvprof|
|the maximal client CUDA contexts|
|supported by Volta MPS server per-device||Volta MPS|
|the overall computational resources respectively||Nvprof|
|the number of the -th microservice’s processes||scheduler|
|the predicted throughput of||Section VII-A|
|the predicted global memory bandwidth|
|usage of||Section VII-A|
|the global memory footprint of|
|with batch size||Section VII-A|
|the amount of calculations of|
|with batchsize||Section VII-A|
|the GFLOPS of the used GPU||Nsight compute|
|the global memory capacity of the used GPU||Nvprof|
Assume a user-facing GPU application has microservice stages. Equation 1 shows the object and the constraints in the optimization problem. Table II lists the variables used to maximize the supported peak load by solving the optimization problem in Equation 1.
Vii-C Case 2: Minimizing Resource Usage
In this policy, Camelot first minimizes the number of GPUs required to support the low load, and then minimizes the resource usage in each of the GPUs. This design choice is able to reduce the search space for resolving the optimization problem described later.
To determine the minimum number of GPUs required, Camelot already predicts the number of floating point operations and the global memory footprint of microservices with different loads (C(i,s) and M(i,s) in Table II). Based on the designed GFLOPS (Giga floating-point operations per second) and the global memory capacity of a GPU, Equation 2 calculates the minimum number of GPUs required. In the equation, and represent the GFLOPS and the global memory capacity of the used GPU respectively. Observed from the equation, the minimum number of GPUs required is calculated under constraints of both the computational ability and the global memory space.
Equation 3 shows the object and the constraints that further reduces the resource usage in the GPUs. When choosing the batch size to run microservices, Camelot considers the global memory footprint of different batch sizes(M(i,s)). When global memory resources are scarce, excessive batch size will put pressure on the global memory space. Therefore, batch size should also be considered as a variable when determining resource allocation in the next stage.
By solving the two optimization problems, Camelot finds out the resource quota for each microservice stage, and the number of instances for each microservice stage. We currently adopt Simulated Annealing algorithm  to resolve the optimization problems.
In more detail, we have a vector of length 2N called : [n1, n2, .., p1, p2 …pn], where N is the number of microservice stages.
For microservice stage-i, we will deploy ni instances and will allocate pi percentage computing resources for each instance. The amount of computing resources of the entire GPU is 100%.
Similar to the traditional simulated annealing algorithm, Camelot iterates continuously to search for an optimal result for .
In each iteration, the current state () randomly moves in one direction and get a new state candidate ().
Camelot will check if the new state meets constraints such as memory bandwidth (as shown in formula 3 ).
If the new state is valid, Camelot calculates the throughput of the new state and compares it with the global optimal throughput. If the new state’s throughput is higher, Camelot updates the global optimal throughput.
If not, Camelot still has the possibility (Acceptance Probability) to update the global optimal throughput as the new state (
). If the new state is valid, Camelot calculates the throughput of the new state and compares it with the global optimal throughput. If the new state’s throughput is higher, Camelot updates the global optimal throughput. If not, Camelot still has the possibility (Acceptance Probability) to update the global optimal throughput as the new state ()’s throughput. The acceptance probability decreases with more iterations.
Vii-D Deployment scheme across multiple GPUs
Distributing microservice instances to multiple GPUs contains two steps. The first step is also searching for the number of instances for each microservice stage and the computing resource quotas allocated to each instance. The second step is to find a deployment scheme according to the number of instances for each stage and the computing resource quotas in the first step. However, it is impractical to search exhaustively for the optimal deployment scheme for all instances. To speed up the entire search progress, we use a specific deployment strategy to quickly find out a reasonable deployment scheme as shown in Figure 23.
A GPU has multiple resource dimensions including computing resource, global memory capacity, global memory bandwidth and PCIe bandwidth, etc. When deploying the instances of a microservice stage, we sort the remaining GPUs according to their available resources. The partial ordering of resources during GPU sorting is related to the characteristics of the microservice. According to previous experiments in Section IV, we prove that for GPU microservices, the global memory capacity will become the major resource bottleneck. Therefore, Camelot sets the global memory capacity as the highest priority resource in the deployment scheme. For example, for applications that take up a lot of global memory space, they will be sorted according to the size of the remaining global memory when sorting. If the remaining global memory is the same size, then they will be sorted according to other resource dimensions.
GPUs with fewer resources have higher priority and will try to deploy instances on the GPU with higher priority first. In this case, Camelot avoids excessive fragmentation of the resources available in the resource pool. In addition, deploying instances of the same stage on the same GPU as much as possible can share models between multiple instances, reducing the consumption of GPU global memory, which is often the most stressful resource during allocation.
Viii Evaluation of Camelot
In this section, we evaluate the effectiveness of Camelot in maximizing the supported peak load and minimizing resource usage at low load, while ensuring the required QoS.
We evaluate Camelot on a machine equipped with two Nvidia RTX 2080Ti GPUs and a DGX-2 machine  that equipped with Nvidia V100 GPUs. Table III summarizes the detailed software and hardware experimental configurations. Camelot does not rely on any special hardware features of 2080Ti or V100, and is easy to be set up on other GPUs with Volta or Turing architecture. The peak global memory bandwidths of the 2080Ti and V100 GPUs are 616 GB/s and 897 GB/s, respectively . They are used as constraints in the resource allocation policies. We use both the real system benchmarks and the artifact benchmarks in Camelot suite as user-facing GPU microservices. Except the large scale evaluation in Section VIII-E, we report the experimental results on the machine equipped with two 2080Ti GPUs.
|Hardware||Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz|
|Two Nvidia GeForce RTX 2080Ti|
|Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz|
|NVIDIA DGX-2 with 16 Tesla V100s-SXM3|
|Software||Ubuntu 16.04.5 LTS with kernel 4.15.0-43-generic|
|CUDA Driver 410.78 CUDA SDK 10.0 CUDNN 7.4.2|
While we do not find prior work on resource management for GPU microservices, we compare Camelot with the Even allocation (“EA” for short) policy, and Laius  that is proposed for managing the applications co-located on spatial multitasking GPUs. EA evenly allocates all the GPU resources to the microservices in a user-facing applications. On a spatial multitasking GPU, Laius predicts the computational resource required by a user-facing query and dynamically reallocates the remaining computational resources to batch applications for maximizing their throughputs. While Laius is designed for single GPU situation, we schedule the microservices of a benchmark on a single GPU with Laius. The total throughput of the benchmark with Laius is calculated by aggregating the throughputs on all the GPUs.
Viii-a Maximizing the Supported Peak Load
In this subsection, we evaluate Camelot in maximizing the supported peak load while ensuring the required QoS with a given number of GPUs.
Figure 28 shows the supported peak loads of the benchmarks normalized to their QoS targets with EA, Laius and Camelot, while ensuring the 99%-ile latency target. In the figure, the -axis shows the batch size of processing user queries. Camelot increases the supported peak loads of the benchmarks by 12% to 73.9% compared with EA, and by 10% to 64.5% compared with Laius.
EA results in the low peak loads of the benchmarks because it does not consider the pipeline effect of the microservices. While the peak load of a benchmark is determined by the the peak load of the microservice stage that shows the lowest throughput, the resource allocation does not balance the throughputs of the microservice stages. In addition, the benchmarks achieve slightly higher peak load with Laius compared with EA. This is mainly because we already optimize Laius to balance the throughputs of the microservice stages. However, Laius still performs worse than Camelot, because it does not schedule microservice instances across multiple GPUs as Camelot does. In this case, the microservices suffer from higher contention with Laius compared with Camelot. In addition, the benchmarks suffer from the long communication overhead without the global memory-based communication in EA and Laius.
In more detail, Figure 31 shows the number of instances in each microservice stage, and the percentage of SMs allocated to each microservice instance with Camelot. The 16 test cases in Figure 28 are referenced to be 1-16 for simplicity in Figure 31. Observed from this figure, for the microservice stage that has long processing time (e.g., stage 1 for img-to-img), Camelot automatically creates more instances for it to increase its total throughput. In the way, Camelot improves the pipeline efficiency of GPU microservices.
Viii-B Minimizing Resource Usage
Figure 32 shows the normalized GPU resource usage of the benchmarks at low load and the corresponding 99%-ile latency with Camelot and Laius. We choose to use 30% of the peak load to be the low load in the experiment as reported by Google’s research . In this figure, the resource usage is normalized to the scenario that each microservice stage uses an individual GPU. The expeirment with other loads show similar result.
Observed from Figure 32, Camelot reduces the GPU resource usage by 46.5% on average while ensuring the QoS of all the benchmarks. Camelot is effective in this scenario because it precisely predicts the duration of a microservice with different GPU resource configurations, and schedules microservice instances considering the runtime shared resource contention (global memory bandwidth, and PCIe bandwidth). Laius also reduces the resource usage compared with the naive deployment by 20.2% on average. However, because it does not optimize the inter-microservice contention and does not adjust the number of instances for each microservice stage, it requires more resource than Camelot to ensure the QoS of user-facing applications. Camelot reduces the GPU resource usage by 35%, while Laius results in slight QoS violation for 3 out of the 4 benchmarks.
Viii-C Adapting to Different Loads
In this subsection, we evaluate Camelot in adapting the different loads. For each benchmark, we report its resource usages and the corresponding 99%-ile latencies under four different loads with Camelot in Figure 33. In the figure, the load of level is higher than the load of level , if .
Observed from this figure, Camelot reduces more resource usage when the load is lower, and always guarantee the QoS of the benchmarks. Camelot is able to fine tune the GPU resource allocation based on the load, and the contention between the microservices on the same GPU.
Viii-D Effectiveness of Constraining Global Memory Bandwidth Contention
Camelot predicts the global memory bandwidth usage of all the microservices, and makes sure that the accumulated bandwidth usage of the concurrent tasks is smaller than the peak global memory bandwidth of the GPU. To show the effectiveness of this constraint, we implement Camelot-NC, a system that disables the constraint in Camelot.
Figure 33 also shows the the 99%-ile latency of the benchmarks with Camelot-NC. Observed from this figure, user-facing services in 10 out of the 16 test cases suffer from QoS violation with Camelot-NC. For instance, the 99%-ile latency of img-to-img is up to 1.55X of its QoS target with Camelot-NC. The QoS violation is due to the unmanaged global memory bandwidth contention.
Viii-E Generalizing for Complex Microservices
Besides the real-system benchmarks, we create more benchmarks using the artifact benchmark in Camelot suite (3 microservices with different compute intensities, 3 microservices with different memory access intensities, and 3 microservices with different PCIe intensities) to evaluate Camelot for complex microservices. The microservices are denoted by , , , , , , , , and respectively. c/m/p is more PCIe/compute/memory intensive than c/m/p, if .
Figure 34 shows the supported peak loads of the 27 artifact benchmarks with EA, Laius, and Camelot. In the figure, “++”represents a benchmark that is built by pipelining a PCIe-instensive microservice , a compute-intensive microservice and a memory-instenvie microservice . Observed from this figure, on average, Camelot improves the supported peak load of the 27 benchmarks by 44.91% compared to EA, and by 39.72% compared with Laius.
Corresponding to Figure 34, Figure 40 shows the resource allocation with Camelot for the 27 benchmarks. Observed from this figure, Camelot launches different numbers of instances for different microservice stages, and allocates different percentages of the SMs to the microservices. For instance, Camelot launches 1 instance of Microservice-1, 2 instances of Microservice-2, and 5 instances of Microservice-3 in the first benchmark. In addition, Camelot allocates different percentages of the SMs to the same microservice when it is linked in different benchmarks. It reveals that Camelot is able to automatically adjust the resource allocation based on the features of the microservices.
Figure 41 shows the resource usages and the corresponding 99%-ile latencies of the 27 benchmarks at low load with Camelot. Camelot significantly reduces the resource usage by 61.6% on average. In addition, the GPU resource allocations vary for the 27 benchmarks. This is because Camelot adjusts the resource allocation based on the characters of the pipelined microservices. To conclude, Camelot is generalizable for complex microservices.
Viii-F Large Scale Evaluation on DGX-2
We also evaluate Camelot on a large-scale DGX-2 machine in maximizing the supported peak load. We do not show the result of minimizing the resource usage here because it is the same to the one on RTX 2080Ti.
Figure 39 shows the supported peak loads of the benchmarks normalized to their QoS targets with EA, Laius and Camelot, while ensuring the 99%-ile latency target. In the figure, the -axis shows the batch sizes of processing user queries. Observed from this figure, Camelot increases the supported peak load by 50.1% for all the benchmarks on average compared with EA, while guaranteeing their 99%-ile latency within the required QoS target. Camelot is scalable on large-scale GPU machines.
Viii-G Overhead of Camelot
Offline overhead. The overhead of training models offline for predicting microservice performance is acceptable. We collect the training samples of all the microservices within a single day using a single GPU. We can further speed up the sample collection by using multiple GPUs. As for the online predicting, each prediction completes in 1 ms, which is much shorter than the QoS target of a service. Resource allocation overhead. As stated in Section VII, Camelot needs to solve the optimization problem using the simulated annealing algorithm to identify the appropriate resource allocation. Our measurement shows that this operation completes in 5ms. Communication overhead. Camelot need to setup global memory-based communication for microservices that require data transfer. The setup operation based on CUDA IPC technique for a pair of microservices is only done once when the end-to-end service is launched. The setup operation completes in 1ms. To conclude, the overhead of Camelot is acceptable for real-system deployment.
For GPU microservices, the main memory-based communication between the microservices, the pipeline inefficiency, and the global memory bandwidth contention result in their poor performance. To this end, we propose Camelot, a runtime system to manage GPU resources online. Camelot uses a global memory-based communication mechanism to eliminate the large communication overhead. We also propose two contention-aware resource allocation policies that considers the pipeline efficiency and shared resource contention. Experimental results show that Camelot increases the peak supported load by up to 64.5%, and reduces 35% resource usage at low load while achieving the desired 99%-ile latency target compared with the state-of-the-art work.
-  (2015) Performance evaluation of microservices architectures using containers. In the 14th International Symposium on Network Computing and Applications, pp. 27–34. Cited by: §II.
-  (2018) Finding tiny faces in the wild with generative adversarial network. In , pp. 21–30. Cited by: §III-A, TABLE I.
-  (2019) Performance modeling and workflow scheduling of microservice-based applications in clouds. IEEE Transactions on Parallel and Distributed Systems. Cited by: §I, §II, §II.
-  (2009) The datacenter as a computer: an introduction to the design of warehouse-scale machines. Synthesis lectures on computer architecture 4 (1), pp. 1–108. Cited by: §I, §I, §VIII-B.
-  (2016) A random forest guided tour. Test 25 (2), pp. 197–227. Cited by: §VII-A.
-  (2002) A taxonomy of web search. In ACM Sigir forum, Vol. 36, pp. 3–10. Cited by: §I.
-  (2009) Rodinia: a benchmark suite for heterogeneous computing. In International Symposium on Workload Characterization (IISWC), pp. 44–54. Cited by: §III-B.
-  (2016) Baymax: qos awareness and increased utilization for non-preemptive accelerators in warehouse scale computers. ACM SIGPLAN Notices 51 (4), pp. 681–696. Cited by: §I, §II.
-  Conway’s law.. Note: http://www.melconway.com/Home/Conways_Law.html. Cited by: 3rd item.
-  (2013) The tail at scale. Communications of the ACM 56 (2), pp. 74–80. Cited by: §VII-A.
-  (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §I, TABLE I.
-  (2016) . In European conference on computer vision (ECCV), pp. 391–407. Cited by: §III-A, TABLE I.
-  Facial recognition api for python. Note: https://github.com/ageitgey/face_recognition Cited by: TABLE I.
-  (2019) An open-source benchmark suite for microservices and their hardware-software implications for cloud & edge systems. In the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 3–18. Cited by: §I, §I, §II.
-  (2019) Seer: leveraging big data to navigate the complexity of performance debugging in cloud microservices. In the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 19–33. Cited by: §I, §II.
-  (2019) Perceptual pyramid adversarial networks for text-to-image synthesis. Cited by: §III-A.
-  (2019) Perceptual pyramid adversarial networks for text-to-image synthesis. Cited by: §III-A.
-  (2019) ATOM: model-driven autoscaling for microservices. In the 39th International Conference on Distributed Computing Systems (ICDCS), pp. 1994–2004. Cited by: §II.
-  (2008) Understanding performance of pci express systems. Xilinx WP350, Sept 4. Cited by: §VI-A.
-  (2017) Performance evaluation of massively distributed microservices based applications. In the 31st European Conference on Modelling and Simulation (ECMS), pp. 598–604. Cited by: §II.
-  (2016) Microservices for scalability: keynote talk abstract. In the 7th ACM/SPEC on International Conference on Performance Engineering, pp. 133–134. Cited by: §II.
Reinforcement learning: a survey.
Journal of Artificial Intelligence Research4, pp. 237–285. Cited by: §VII-A.
-  (2015) Deep visual-semantic alignments for generating image descriptions. In the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp. 3128–3137. Cited by: §III-A.
-  (2019) HyScale: hybrid and network scaling of dockerized microservices in cloud data centres. In the 39th International Conference on Distributed Computing Systems (ICDCS), pp. 80–90. Cited by: §II.
-  (2015) Deep learning. nature 521 (7553), pp. 436. Cited by: §I.
-  (2019) A dataflow-driven approach to identifying microservices from monolithic applications. Journal of Systems and Software 157, pp. 110380. Cited by: §I, §I, §II.
-  (2015) An evaluation of unified memory technology on nvidia gpus. In the 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), pp. 1092–1098. Cited by: §IV-C.
-  (2019) Text summarization with pretrained encoders. arXiv preprint arXiv:1908.08345. Cited by: §III-A.
-  (2010) Pregel: a system for large-scale graph processing. In the ACM International Conference on Management of data (SIGMOD), pp. 135–146. Cited by: §I.
-  (2001) Semantic web services. IEEE intelligent systems 16 (2), pp. 46–53. Cited by: §I.
-  Microservices workshop: why, what,and how to get there.. Note: http://www.slideshare.net/adriancockcroft/microservices-workshop-craft-conference Cited by: §I.
-  Nvidia night compute.. Note: https://docs.nvidia.com/nsight-compute/NsightCompute/index.html Cited by: §VII-A.
-  (2017) NVIDIA tesla v100 gpu architecture.. Note: https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf Cited by: §VIII.
-  (2015) Multi-process service.. Note: https://docs.nvidia.com/deploy/mps/index.htmltopic_6_1_2 Cited by: §I, §IV-A.
-  (2019) NVIDIA dgx-2 system user guide.. Note: https://docs.nvidia.com/dgx/dgx2-user-guide/index.html Cited by: §VIII.
OpenNMT: an open source neural machine translation system. Note: https://opennmt.net/ Cited by: TABLE I.
-  (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: §I, TABLE I.
Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683. Cited by: §III-A, TABLE I.
-  (2016) Generative adversarial text to image synthesis. arXiv preprint arXiv:1605.05396. Cited by: §III-A, TABLE I.
A survey of decision tree classifier methodology. IEEE transactions on systems, man, and cybernetics 21 (3), pp. 660–674. Cited by: §VII-A.
-  (2014) Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In the Fifteenth annual conference of the international speech communication association, Cited by: §I, TABLE I.
Linear regression analysis. Vol. 329, John Wiley & Sons. Cited by: §VII-A.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §I, TABLE I.
-  (2012) Practical bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems, pp. 2951–2959. Cited by: §VII-A.
-  (2019) Softsku: optimizing server architectures for microservice diversity@ scale. In the 46th International Symposium on Computer Architecture (ISCA), pp. 513–526. Cited by: §I.
-  The evolution of microservices.. Note: https://www.slideshare.net/adriancockcroft/evolution-of-microservices-craft-conference Cited by: §I.
-  (2016) Workload characterization for microservices. In the international symposium on workload characterization (IISWC), pp. 1–10. Cited by: §II.
-  (1987) Simulated annealing. In Simulated annealing: Theory and applications, pp. 7–15. Cited by: §VII-C.
-  (2015) Show and tell: a neural image caption generator. In the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp. 3156–3164. Cited by: §III-A, TABLE I.
-  (2018) The devil of face recognition is in the noise. In the European Conference on Computer Vision (ECCV), pp. 765–780. Cited by: §III-A.
-  (2019) Pipelined data-parallel cpu/gpu scheduling for multi-dnn real-time inference. In Real-Time Systems Symposium (RTSS), pp. 392–405. Cited by: §II.
-  (2019) Semantics disentangling for text-to-image generation. In the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2327–2336. Cited by: §III-A.
-  (2019) Laius: towards latency awareness and improved utilization of spatial multitasking accelerators in datacenters. In Proceedings of the ACM International Conference on Supercomputing (ICS), pp. 58–68. Cited by: §I, §II, §V-B, §VII-A, §VIII.
-  (2018) Poster: benchmarking microservice systems for software engineering research. In the 0th International Conference on Software Engineering: Companion (ICSE-Companion), pp. 323–324. Cited by: §II.
-  (2019) Dm-gan: dynamic memory generative adversarial networks for text-to-image synthesis. In the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5802–5810. Cited by: §III-A.