The potential benefits of edge computing paradigm and related distributed system solutions, have been particularly linked with the breakthroughs achieved in the fast growing development of deep learning (DL) techniques designed to boost automation in all application domains. With that in mind, this vibrant research area has been more and more focusing on integrating edge computing with deep learning  and the associated challenges due to resource constraints [14, 2]. Recent hardware developments are making more and more possible to run highly computationally demanding algorithms in the edge .
Among myriad of open research issues, the models for machine learning (ML) inference latency and ML model selection optimization in edge computing, along with related task placement are of particular importance. This is because the related such models  developed for cloud computing cannot be directly applied in edge computing. The DNN placement problem in the edge needs to consider in particular the communication delay between nodes and the hardware heterogeneity of devices. To the best of our knowledge there has been no study of the DNN application selection, placement and inference serving problem in consideration of edge computing. This paper presents the first DNN Model Variant Selection and Placement (MVSP) in edge computing networks. We provide a mathematical formulation of the problems of ML placement and inference service, considering inference latency of different model-variants, communication latency between nodes and utilization cost of edge computing nodes (resources). Our model also includes a discussion on the potential effects of hardware sharing, with GPU edge computing nodes shared between different model-variants, on inference latency.
An illustration of the DNN application placement problem is presented in Figure 1 with the arrival of inference requests from IoT nodes to the edge computing layer. IoT nodes are assumed to be devices with processing and sensing capabilities, but not enough to run DNN models. In this system abstraction, edge computing layer, consisting of edge nodes with GPUs for running ML models, serves as an inference service system to the requests from IoT nodes. For the illustrated system we focus on designing a placement strategy of ML models, taking into account different possibilities of model-variants and how to forward requests coming from IoT nodes.
Ii System Model
Ii-a Reference Edge Computing Network Model
In order to analyze MVSP problem in edge computing network we define a system model which will consider inference latency of different model-variant with shared and unshared access to GPUs, node communication latency and utilization cost. The considered system consists of IoT nodes, e.g. smart phone, security camera and smart car cameras, and edge nodes, e.g. access points. Let and denote the set of indexes of IoT nodes and edge nodes, respectively. Edge nodes are able to host various ML applications designated to serve the inference requests coming from IoT nodes. Every edge node has a computing unit specific for inference serving tasks e.g. CPU, GPU and TPU as well as memory capacity . We assume that we have
different ML models that can be used for different tasks such as face recognition and object detection. Each ML modelcan have variants with different sizes and inference latencies per request and can be deployed via a VM or a container. We denote by the set of ML models and the set of variants of model . Each model variant () has a minimum memory requirement to be loaded and can process at most with a stable performance. Each IoT node can define its own latency requirement for each infered model as well as the number of inference requests . The notations used in this paper are summarized in Table I
. We introduce a binary variableto indicate the forwarding decision of requests of model-variant () from IoT node to edge node . The placement decision of model-variant () in an edge node is defined by an integer variable , which indicates the number of deployed instances.
Figure 2 shows an illustrative example of a network of IoT nodes () and edge nodes () for the above described system. In this example, each edge node stores the 3 ML models and can instantiate them during loading various model-variants by changing the batch size parameter, which affects the instance size and throughput. The figure shows how from this set of 3 models, the optimally selected model-variants would be placed in edge nodes after the placement decisions have been made, along with served inference requests. For example, after placement edge node E1 has two loaded models, inception_v2 and inception_v4 and five served inference requests, four for inception_v2 and one for inception_v4. Serving of inference requests is achieved by assigning them to the appropriate edge node by considering the latency requirements, capacity and the cost of using servers. We assume that each IoT node consumes these 3 different ML models with different request rates.
Ii-B Latency Model
We consider two types of latencies: communication latency between IoT and edge nodes, and inference latency of model-variants in edge nodes. We denote by the communication latency between IoT node and edge node . The inference latency of a model-variant () running exclusively on edge node is denoted by . We define as the inference latency of a model-variant () running on edge node . For mathematical model of this latency we include the effects that sharing with other model-variants can have, as well as observing the case with unshared access to GPU. With that in mind, we assume in our formulation that an edge node can be shared by at most model-variants. The average latency per request is given by:
The communication latency between node and node is considered as the sum of the delay on each link in the shortest path in both directions (sending request and receiving response). The delay on each link is assumed to have a random value with an average , including all possible existing delays in the link i.e. transmission, queuing, propoagation and processing. We denote by the set of links in the shortest path between node and .
We model the inference latency of a model-variant () running on edge node , such that: the inference latency of a model-variant increases linearly in terms of the latency of co-located model-variants. A discussion on resource sharing is provided in subsection II-G. The expression of inference latency is given by:
where the inference latency is the sum of the inference latency of a model-variant running exclusively on an edge node (), the additional latency created by replication and the additional latency created by co-locating a different model on the same node.
IoT nodes (users) are assumed to express their latency requirements for the inference of a model with a latency requirement constraint given by:
This constraint assumes that the round trip time (RTT) cannot exceed a maximum value of latency given by the user as a requirement . In our case, RTT is the sum of the communication delay (cumulative delay among the path) and the processing delay of inference request in the edge node.
Ii-C Utilization Cost Model
The utilization cost model is an abstract formulation of all the costs induced from the utilization of edge resources, assuming that such a cost would increase with the increase of resource utilization. As an example, the utilization cost can represent the power consumption and energy efficiency measurements in a unity of power (Watt), considering different hardware components such as CPU, GPU, memory, and I/O.  shows some inference benchmarks of several DNN model-variant using Jetson AGX Xavier, Nvidia GPU.
For the sake of generality, we define a continuous variable denoting the utilization cost of a node . The average utilization cost of all edge nodes is given as:
The cost of every edge node is related to its memory utilization . Similarly to , the utilization cost follows an exponential function of the utilization. We denote by a set of linear functions tangent to . Using the set of linear functions , we approximate the utilization cost as follows:
The following constraints insure that the variable gets a value approximately equal to .
Constraint (8) shows the definition of an edge node utilization , as the sum of utilization of all possible model-variants i.e. in terms of required memory per loaded model. Then we divide the obtained sum by the memory capacity of a node (mainly GPU memory).
Ii-D Loading and Scaling Model
A model-variant can be loaded in a specific node to serve requests coming from users (IoT nodes). When the requests load increases the deployed model-variant may not be able to serve users. Considering this scenario, the system may replicate the model instance on top of a new VM or container which scale up the throughput (based on container technologies) or use a different model-variant with a bigger batch size (i.e has higher fps).
The following constraints can insure that the load on a specific model variant on a specific node cannot exceed the maximum load . Moreover, by conserving the maximum load, this constraint can scale-up the number of model-variant replicas (called variant replication in ), or choose to use another model variant that has less inference latency (called variant upgrading in ) regarding the minimization of the average latency (eq. 1).
The following constraints assure that the number of model-variant instances is bigger than only if at least one node is sending inference requests.
The number of instances shared by a node can be at most :
The memory capacity constraints can be defined as follows:
Ii-E Problem Formulation
where denotes weight of the average latency in the objective function. The first constraint in the problem (13) insures that a request from a specific IoT node can be processed only by one edge node. Constraint (4) assures that RTT cannot exceed the maximum tolerated latency. Constraint (7) and (8) are used to compute utilization cost per edge node. Constraint (9) insures that the load assigned to a specific model-variant deployed in a specific edge node does not exceed its maximum processing capacity. Constraints (10) and (11) defines the values of binary and integer variables used in the model. Constraint (12) assures the satisfaction of memory capacity per edge node.
Ii-F Complexity Analysis
MVSP problem is NP-hard.
MVSP is a mixed integer program with quadratic terms in the objective and in the constraints which is complex to solve. The quadratic terms can be linearized using standard linearization techniques presented in  to obtain a solvable MILP. MVSP is NP-hard because it combines two NP-hard problems which are the model-variant allocation problem and the inference assignment problem. The model-variant allocation problem can be obtained by a model relaxation which minimizes the cost under the capacity constraint (12). The problem is equivalent to a two-dimensional bin-packing problem , where edge nodes are the bins and the DNN model-variants are the objects to pack. The inference assignment problem can be obtained by relaxing the model : we keep the constraint (9), remove the variables , and minimize the average latency. The problem is equivalent to the Generalized Assignment Problem, which is NP-hard . ∎
|Set of IoT nodes|
|Set of Edge nodes|
|Set of models|
|Set of variants of model .|
||Inference Latency of a request on variant of model in node|
|The set of links in the shortest path between the node and|
|Communication Latency from node to node|
|Interference Coefficient of variant of model co-located with other model-variants.|
|request rate from node on model|
|Maximum load on model variant|
|Inference Latency requirement of requests on model from node|
|Memory required for loading the variant of model|
|Memory capacity of node|
|Number of deployed instances of model variant in node|
|Utilization cost of node|
|Maximum number of model-variants per edge node|
|Weight of the average latency in the objective function|
Ii-G Discussion on Resource Sharing
For inference serving systems which deploy ML models, devices like GPU, TPU, dedicated accelerators are used due to their high performances and as of now most of them work exclusively for one ML model at a time . In the literature, recent works have been proposed. Google Research has adapted DNN inference to run on top of mobile GPU . Similarly, Amazon Web Services proposed a solution to run inference models on integrated GPUs at the edge . But for this kind of applications, besides running inference models on GPU accelerators, it is necessary to consider GPU sharing as well, in order to allocate resources efficiently, opening another area of research. This approach is set to improve on the low utilization and scaling performances of unshared access to a GPU. That idea of GPU sharing can be promising as seen in , where authors studied the performance of temporal and spacial GPU sharing and , which presented a GPU cluster manager enabling GPU sharing for DL jobs.
The resource sharing such as previously mentioned GPU sharing impacts the inference analysis. The resource allocation required to deploy ML algorithms is complex task, especially in edge computing. To this end, the emerging new container-based lightweight virtualization technologies allow for separating the model instances that would run in parallel in the same machine. In general this means that resource management systems can scale-up and scale-down allocated resources based on the load variation using these new virtualization technologies. How to effectively share resources across various ML models is an open issues, not only in the context of scalability but also due to the additional latency in ML Inference. As an example, studying the impact of GPU sharing on the performance of ML models is highly important especially on how to scale-up and scale-down resources and how to choose the best model-variant. Due to the complexity of the GPU analysis, which requires a detailed study of numerous existing benchmarks with different ML models, different batch sizes, and GPU memory limitations for our application interference model, in this paper we only use a simplified analysis of the effects that replication and co-locating of model-variant can have on the inference latency.
To this end, we propose one scenario for calculating the inference latency due to resource sharing. We assume parallel usage of hardware in terms of resource sharing. This approach still remains unexplored in edge computing. For model simplicity, we consider that the inference latency of a model-variant increases linearly in terms of the latency of co-located model-variants.
where is the inference latency of model-variant in presence of ,…, and are the inference latency of and running exclusively in the device, respectively. is a coefficient called interference coefficient of model variant in presence of model variant
. This coefficient is introduced to estimate the latency of co-located models in terms of the latency of models running exclusively in the hardware.
Iii Numerical Analysis
In this section, we evaluate our optimization model using two problem instances and , based on MANIAC mobile ad hoc network. Table II shows the topology of each studied problem. For each problem, we choose the DNN models randomly from a pre-defined list. The communication latency is obtained from , which was estimated to have a random value with an average
per link. The inference latency of each model-variant was measured on a GPU GTX 1050 Ti using tensorflow framework. As described in subsectionII-G we consider the case in which the inference latency increases linearly in terms of the latency of co-located model-variants.
We set weight equal to , node memory capacity and interference coefficient . We test our optimization model using different request rates, which correspond to the average of requests per node (e.g where
denotes the random variable of request rates). TableIII shows the optimization results of latency and cost for two problems with assumed GPU sharing and =2 co-located model-variants. By modifying parameters reported in this table we want to observe the response of the analysed network in terms of latency, cost and utilization. This includes latency measurements for different number of co-locations, as well as measuring the inference load impact on latency, cost and utlization, reported in following subsections.
|(, , , )|
|(, , , )|
Iii-a Impact of the number of co-location
In our model, we set the maximum number of co-located DNN model-variants ( means that no GPU sharing is allowed). We set the configuration parameters similar to the previous experiment of , and we vary the number of co-location from 1 to 4. Figure 3 shows that the average latency decreases with an increase of until it reaches a maximum value. For our use case, the studied network is small, which allows the optimization to converge when is equal to 4. Increasing the value in this network does not further improve the results, but would be interesting to test in a larger network. For low load increasing model co-location decreases the average latency by 33% of millisecond-scale per request, and for high load, by 21%. This result proves that GPU sharing can improve the average latency of inference requests. Decreasing the inference latency by optimally managing the DNN model placement is an interesting result because it allows the system to satisfy latency-critical applications like augmented reality and online games.
In this paper we used a simplified analysis of the effects that replication and co-locating of model-variant can have on the inference latency. This will be extended in future work to include different scenarios of how the inference latency could potentially behave as a result of resource sharing.
Iii-B Impact of Inference Load
We set the configuration parameters similar to the experiment of in III, , and we vary the average load per node. Figure 4 shows that the average latency varies slightly for low loads while the cost is linearly increasing: when the average load per node increases from 5.5 to 22 (300% increase), the average latency increases from 18 ms to 19 ms (5.5% increase), however the cost increases from 0.05 to 0.20 (300% increase), and the utilization from 38% to 63% (65% increase). This result means that the optimization tends to keep the allocation decision of DNN models while upgrading its variant type to bigger ones, which have higher throughput and higher memory size. Then, when the load is high, the average latency starts to sharply increase: when the average load per node increases from 22 to 33 (50% increase), the average latency increases from 19 ms to 27 ms (42% increase), the cost increases from 0.2 to 0.4 (100% increase), and the utilization from 63% to 76% (20% increase). These results mean that the optimization tends to satisfy inference requests by allocating new model-variants in distant edge nodes that have enough capacity to host the instances.
Iii-C Trade-off between the Average Latency and Cost Function
We set the configuration parameters similar to the previous experiment, setting the average request load to 27.5, and we alter the weight of the latency and cost in the objective function to evaluate the trade-off between the two objectives. Figure 5 shows the opposite behavior of the average latency and the average utilization cost. When we consider only the cost (), the optimization tends to allocate the least possible number of instances of each model that can satisfy inference requests. This results in a high value of latency due to the assignment of IoT nodes to distant edge nodes: average latency is 31 ms, average utilization is 67% and average cost is 0.23. When we increase the value of , we consider more the latency in the decision making. An increase in cause an increase in the value of the cost and a reduction in the average latency, until a maximum value in which the two objectives converge (). The intersection of Pareto optimal curves, i.e. curves of the two goals: average latency and average cost, happens when is equal to 0.04 with latency equal to 22.7 ms, cost equal to 0.3 and utilization equal to 70%. Setting the configuration at the intersection point of goals, decreases the cost by 40% compared to a configuration that set a slightly higher value of . It is worth it to mention that we show the average utilization curve in the figure because it represents a significant metric while it cannot replace the average cost as the intersection between goals is different than the intersection between latency and utilization. The optimization tends to allocate multiple instances from the same model on different edge nodes (possibly with different variant types in each edge node depending on the possible capacity).
In this paper, we studied the DNN Model Variant Selection and Placement (MVSP) in edge computing networks. A mathematical model was proposed to formulate the problem considering the inference latency of different model-variants, the communication latency between nodes, and the utilization cost of edge computing nodes (resources). Also, we considered the effects of hardware sharing on inference latency regarding GPU edge computing nodes shared between different model-variants. We studied the placement results of the optimization and its effect on the average latency and cost. We showed that GPU sharing is a valuable approach to handle the increase of inference request rate. Results show that: for low load increasing model co-location decreases the average latency by 33% of millisecond-scale per request, and for high load, by 21%. We plan to further work on extending our model to consider more GPU sharing scenarios, and analyze parameters like multiple frameworks, and multiple hardware devices, as well as implementing heuristic solutions.
-  (2018) Balancing the migration of virtual network functions with replications in data centers. In NOMS 2018-2018 IEEE/IFIP Network Operations and Management Symposium, pp. 1–8. Cited by: §II-C.
-  (2018-06) A configurable cloud-scale dnn processor for real-time ai. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), Vol. , pp. 1–14. External Links: Cited by: §I.
Converting the 0-1 polynomial programming problem to a 0-1 linear program. Operations research 22 (1), pp. 180–182. Cited by: §II-F.
-  (2019) Tiresias: a gpu cluster manager for distributed deep learning. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19), pp. 485–500. Cited by: §II-G.
-  (2019) Convergence of edge computing and deep learning: A comprehensive survey. CoRR abs/1907.08349. External Links: Cited by: §I.
-  (2018) Dynamic space-time scheduling for gpu inference. arXiv preprint arXiv:1901.00041. Cited by: §II-G.
-  Jetson agx xavier: deep learning inference benchmarks. Note: https://developer.nvidia.com/embedded/jetson-agx-xavier-dl-inference-benchmarksAccessed: 2020-01-15 Cited by: §II-C.
-  (2019) On-device neural net inference with mobile gpus. arXiv preprint arXiv:1907.01989. Cited by: §II-G.
-  (2007) Using decomposition techniques and constraint programming for solving the two-dimensional bin-packing problem. INFORMS Journal on Computing 19 (1), pp. 36–51. Cited by: §II-F.
-  (2018) SDN controller placement with delay-overhead balancing in wireless edge networks. IEEE Transactions on Network and Service Management 15 (4), pp. 1446–1459. Cited by: §III.
-  (2018-10) Boosting edge computing performance through heterogeneous manycore systems. In 2018 International Conference on Information and Communication Technology Convergence (ICTC), Vol. , pp. 922–924. External Links: Cited by: §I.
-  (2019) INFaaS: managed & model-less inference serving. arXiv preprint arXiv:1905.13348. Cited by: §I, §II-D, §II-G.
-  (2019) A unified optimization approach for cnn model inference on integrated gpus. arXiv preprint arXiv:1907.02154. Cited by: §II-G.
-  (2019) Machine learning at facebook: understanding inference at the edge. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 331–344. Cited by: §I.
-  The generalized assignment problem and its generalizations. St. Mary’s College of Maryland, St. Mary’s City, MD, USA, Tech. Rep.[Online]. Available: http://faculty. smcm. edu/acjamieson/f12/GAP. pdf. Cited by: §II-F.