Container-based virtualization offers a lightweight mechanism to host and manage large-scale distributed applications for big data processing, edge computing, stream processing, among others. Multiple tenants encapsulate applications’ environments in containers, abstracting away details of operating systems, library versions, and server configurations. With containers, data center (DC) management becomes application-oriented  in contrast to server-oriented when using virtual machines. Several technologies are used to provide connections between containers, such as virtual switches, bridges, and overlay networks . Yet, containers are a catalyst for network management complexity. Network segmentation, bandwidth reservation, and latency control are essential requirements to support distributed applications, but container management frameworks still lack appropriate tools to support Quality-of-Service (QoS) requirements for network provisioning .
We argue that container networking must address at least three communication scenarios, despite the orchestration framework used by the DC: highly-coupled container-to-container communication, group-to-group communication, and containers-to-service communication. Google’s Kubernetes offers a viable solution to group network-intensive or highly coupled containers, by using pods. A pod is a group of one or more containers with shared storage and network, and pods must be provisioned on a single host. Because the host bus conducts all data transfers within a pod, communication latency is more constant, increasing the network throughput, achieving values superior to default network switching technologies. However, for large-scale distributed applications, multiple pods must be provisioned and eventually allocated on distinct servers.
This paper advances the field on network-aware container scheduling, a primary management task on container-based DCs , by jointly allocating compute and communication resources to host network-aware requests. The network-aware scheduling is analogous to the virtual network embedding (VNE) problem . Given two graphs, the first one representing user requested containers and all corresponding communications requirements, and the second denoting the DC hosting candidates (servers, virtual machines, links, and paths), one must find a map for each vertex and edge from the request graph to a corresponding vertex and edge on DC graph. Vertices and edges carry weights representing process and bandwidth and constraints. The combined scheduling of containers and network QoS requires a multi-criteria decision, based on conflicting constraints and objectives.
We formally define in this paper the scheduling problem encompassing the network QoS as a Mixed Integer Linear Programming (MILP). We later propose two Graphics Processing Unit (GPU)-accelerated multi-criteria algorithms to process large-scale requests and DC topologies.
The paper is organized as follows. §II describes the problem formulation, while §III defines an optimal MILP for joint container and network QoS requirements allocation. Following, §IV presents the evaluation of the proposed MILP highlighting the efficiency and limitations of network-aware scheduling. Then, §V describes the implementation of two GPU-accelerated algorithms to speed up the scheduling process, and both algorithms are compared with traditional approaches in §VI. Related work is reviewed in §VII and §VIII concludes.
Ii Problem Formulation
Ii-a Dc Resources and Tenants Requests
Data center resources (bare metal or virtualized) are represented by , where denotes the physical servers, and
contains all physical links composing the network topology. A vector is associated with each physical server, representing the available capacities () where denotes resources as RAM and CPU. In addition, represents the available bandwidth between physical servers and . Thus, a tenant request is given by , with being a set of containers and the communication requirements among them. Also, as in Kubernetes, each container is associated with a pod.
Containers from a pod must be hosted by the same physical server (sharing the IP address and port space). A group of pods is defined in a tenant’s request, and a container is connected to a pod group , indicated by . Instead of requesting a fixed configuration for each QoS requirement, containers are specified as minimum and maximum intervals. For a container , the minimum and maximum values for any are respectively defined as and . The same rationale is applied to containers interconnections (): minimum and maximum bandwidth requirements are given by and .
A container orchestration framework has to determine whether to accept or not a tenant request. The allocation of containers onto a DC is decomposed into nodes and links assignments. The mapping of containers onto nodes is given by , while the mapping of networking links between containers onto paths is represented as . Table I summarizes the notation used is this paper.
|DC graph composed of servers and links.|
|Resource capacity vector of server .|
|All direct paths (physical and logical) on DC topology.|
|Bandwidth capacity between servers and , .|
|Request, composed of containers and links.|
|,||Minimum and maximum resources capacities for container .|
|,||Minimum and maximum bandwidth requirement between containers and , .|
|Set of containers composing a pod .|
Energy consumption. To reduce energy consumption, we pack containers in as few nodes as possible, allowing to power off the unused ones. We call this technique consolidation, and we reach it by minimizing the DC fragmentation, defined as the ratio of the number of active servers (e.g., those hosting containers) to the total number of DC resources. Server fragmentation is given by , while the same rationale is applied for links, , where and denote the number of active servers and links, respectively.
Quality-of-Service. A container can be successfully executed with any capacities configuration in the intervals specified as minimum and maximum. However, optimal performance is reached when the maximum values are used. In this sense, utility functions can be applied for representing the improvement on container’s configuration. In short, the goal is to maximize Eq. (1) and (2) for each container , where and represent the capacity effectively allocated for vertices and edges, respectively.
Iii Optimal MILP for Joint Container and Network QoS Allocation
Iii-a Variables and Objective Function
A set of variables (Table II) are proposed to find a solution for joint allocation of containers and bandwidth requirements, as well as to achieve maps , and
. The binary variableaccounts the mapping of containers on servers. The containers’ connectivity () applies the same rationale. For identifying the amount of resources allocated to a container , the float vector is introduced. Bandwidth allocation follows the same principle and is accounted by float variable .
|Bool||Container is mapped on server .|
|Bool||Connection is mapped on link .|
|Float||Resource () capacity vector allocated to container on server .|
|Float||Bandwidth allocated to connection on link .|
|Bool||Server is hosting at least one container.|
|Bool||Link is hosting at least one connection.|
The objectives (§II-B) are reached by the minimization of Eq. (III-A). Two additional binary variables are used to identify if DC resources are hosting at least one container or link, and . Value is set just for active servers, as given by . Physical links follow the same idea, . Finally, the importance level of each term is defined by setting .
DC Capacity, QoS Constraints and Integrity of Pods. A DC server must support all hosted containers, as indicated by Eq. (4), while the bandwidth of link must support all containers transfers allocated to it, as given by Eq. (5). Eq. (III-B) guarantees the allocation of a resources capacities from min-max intervals for a containers . The same rationale is applied for on Eq. (III-B).
Finally, containers are optionally organized in pods. For guaranteeing the integrity of pods specifications, Eq. (8) indicates that all resources from a pod () must be hosted by the same server ().
Binary and Allocation Constraints. A container must be hosted by a single server (), while each virtual connectivity between containers is mapped to a path between resources hosting its source and destination as given by . However, on large scale DC topologies, servers are interconnected by multiple paths composed of at least one switch hop. In order to keep the model realistic with current DCs, we rely on network management techniques, such as SDN  to control the physical links usage and populate the with updated information and available paths.
Iv Evaluation of the Optimal MILP for Network-Aware Containers Scheduling
The MILP scheduler and a discrete event simulator were implemented in Python using CPLEX optimizer (). For composing the baseline was used the native algorithms offered by containers orchestrators, Best Fit (BF) (binpacking) and Worst Fit (WF) (spread). As BF and WF natively ignore the network requirements, we included a shortest-path search after the allocation of servers to host containers for conducting a fair comparison.
Iv-a Metrics and Milp Parametrization
The MILP objective function, Eq. (III-A), is composed of terms to represent the tenant’s perspective (the utility of network allocation and the queue waiting time) and the DC fragmentation (the provider’s perspective). Although a minimum value is requested for each container parameter, the optimal utility function expects the allocation of maximum values (). The MILP-based scheduler is guided by the value to define the importance of each term composing the objective function. For demonstrating the impact of defining , we evaluated configurations . Configurations with and define the baseline for comparisons: by setting the MILP optimizes the problem regarding the fragmentation perspective only, while represents the opposite; more importance is given to containers and network utilities.
Iv-B Experimental Scenarios
Iv-B1 DC Configuration
A Clos-based topology (termed Fat-Tree) is used to represent the DC [2, 10]. The factor guides the topology indicating the number of switches, links, and hosts used to compose the DC. A fat-tree build with -port switches supports up to servers. The DC is configured with , and composed of homogeneous servers equipped with cores and GB RAM, while the bandwidth capacity for all links is defined as Gbps.
A total of
requests is submitted with resources specifications based on uniform distributions for containers capacities, submission time, and duration. Each request is composed ofcontainers with a running time up to events from a complete execution of events. For composing the pods, up to of containers from a single requested are grouped in pods. For the network, the bandwidth requirement between a pair of containers is configured up to Mbps, besides requests with Mbps requirement representing applications without burdensome network requirements. The values for CPU and RAM configuration are uniformly distributed up to and , respectively.
Iv-C Results and Discussion
BF and WF algorithms have a well-defined pattern for all network utility metric. For requests with low network requirements (up to Mbps), both algorithms tend to allocate the maximum requested value for network QoS. An exception is observed for BF with network-intensive requests (up to Mbps) as the algorithm gives priority to minimum requested values for consolidating requests on DC resources. With regarding the network-aware MILP scheduler, even for requests with focusing on decreasing the DC fragmentation, the scheduler allocated maximum values for network requests, following the BF and WF algorithms. However, the impact of parametrization is perceived for network-intensive requests. The MILP configuration with shows that the algorithm can jointly consider requests utility and DC fragmentation. The results in Fig. 1(b) show that scheduling network-intensive requests increases the network DC fragmentation. The provisioning delays (Figure 1(a)) explain this fact: the MILP scheduler decreases the queue waiting time for network-intensive requests when compared to BF and WF.
In summary, it is evident that network QoS must be considered by the scheduler to decrease the queue waiting time and to reserve utility’s dynamic configurations. Moreover, the results obtained from MILP configured with demonstrated the real trade-off between fragmentation and utility, or in other words, provider’s and tenant’s perspectives.
V GPU-Accelerated Heuristics
Although MILP is efficient to model and highlights the impact of network-aware scheduling, solving this problem is known to be computationally intractable  and practically infeasible for large-scale scenarios. Therefore, we developed two GPU-accelerated multi-criteria algorithms to speed up the joint scheduling of containers and network with QoS requirements. We selected two multi-criteria algorithms: Analytic Hierarchy Process (AHP) and Technique for Order Preference by Similarity to Ideal Solution (TOPSIS), chosen due to their multidimensional analysis, being able to work with several servers simultaneously. Also, AHP and TOPSIS provide a structured method to decompose the problem and to consider trade-offs in conflicting criteria. Following the notation used to express the MILP (Table I), both algorithms analyze the same set of criteria for a given server . In addition, the sum of all bandwidth capacity with source on (given by ) and the current server fragmentation () are accounted and included on capacity vector. The multi-criteria algorithms analyzed all variables described in Section II-B as attributes.
V-a Weights Distribution
AHP and TOPSIS algorithms are guided by a weighting vector to define the importance of each criteria. While the MILP has to indicate the importance level of each term in the objective function, the multi-criteria function decomposes into a vector . Tab. IV presents different compositions to the MILP objective.
The multi-criteria analysis with clustering configuration optimizes the problem aiming at DC consolidation (equivalent to on MILP formulation) through the definition of high importance level () to fragmentation criteria, while the other criteria share equally the last . In other hand, the execution with network configuration ( from MILP formulation), the bandwidth criteria receive a higher importance level () while the other criteria share equally the last . This configuration makes the scheduler select servers that have the highest residual bandwidth. Finally, the flat configuration sets the same importance weight for all criteria (following the rationale on MILP).
The AHP is a multi-criteria algorithm that hierarchically decomposes the problem to reduce the complexity, and performs a pairwise comparison to rank all alternatives . In short, the hierarchical organization is composed of three main levels. The objective of the problem is configured at the top of the hierarchy, while the set of criteria is placed in the second level, and finally, in the third level represents all the viable alternatives to solve the problem.
In our context, the selection of the most suitable DC to host a container is performed in steps. In the first step two vectors ( and ) are built combining all criteria and alternatives (second and third level of AHP hierarchy) applying the weights defined in Table IV. In other words, while . The representation based on a vector was chosen to exploit the Single Instruction Multiple Data (SIMD) GPU-parallelism. Later the pairwise comparison is applied for all elements into the hierarchy. If , the value is attributed to; In addition, if the cell value is , is set; and otherwise. The same rationale is applied for , indexed by . Later, both vectors are normalized. At this point, the algorithm calculates the local rank of each element in the hierarchy ( and ), as described in Eqs. (9) and (10), . Finally, the global priority () of the alternatives is accounted to guide the host selection, as given by .
The Technique for Order Preference by Similarity to Ideal Solution (TOPSIS) is based in the shortest Euclidean Distance from the alternative to the ideal solution . The benefits of this algorithm are three-fold: (i) can handle a large number of criteria and alternatives; (ii) requires a small number of qualitative inputs when compared to AHP; and (iii) is a compensatory method, allowing the analysis of trade-off criteria.
The ranking of the DC candidates is performed in steps. Initially, the evaluation vector correlates DC resources () and the criteria elements (): , which is later normalized. The next step is the application of weighting schema on values: . Based on , two vectors are them composed with the maximum and minimum values for each criteria, represented by (the upper-bound solution quality) and (the lower-bound). TOPSIS requires the calculation of Euclidean distances between and upper- and lower-bounds, composing and . Finally, a closeness coefficient array is accounted for all DC servers, , and afterwards the resulting array is sorted on decreasing order, indicating the selected candidates.
V-D Gpu Implementation
The AHP and TOPSIS are decomposed in GPU-tailored kernels following a pipeline execution. The first kernel is in charge is acquiring DC and network-aware containers requests, while the remaining kernels perform the comparisons using the parallel reduction technique. A special explanation is required for selecting physical paths to host containers interconnections. After the selection of the most suitable server for each pod presented in the tenant’s request, the virtual links between the containers must be set. A modified Dijkstra algorithm is used to compute the shortest path that has the maximum available bandwidth between the hosting servers. The modified Dijkstra is implemented as a single kernel to allow multiple executions, where each thread calculates a different source and destination pair. As the links between every two nodes in the DC are undirected, the GPU implementation uses a specific array representation to reduce the total space needed. The main principle of the data structure of this algorithm is that where is the source and the destination, and the paths and are the same.
Vi Evaluation of GPU-Accelerated Heuristics
The GPU-accelerated scheduler and a discrete event simulator were implemented in C++, using GCC compiler and CUDA framework .
Vi-a Experimental Scenarios
The evaluation considers a DC composed of of homogeneous servers equipped with cores, GB RAM and interconnected by a Fat-Tree topology () and bandwidth capacity of Gbps for all links. A total of requests were submitted to be scheduled, each composed of containers with a running time up to events from a complete execution of events. For composing the requests, up to of containers from a single request are grouped in pods, while the bandwidth requirement between a pair of containers is configured up to Mbps (a heavy network requirement).
Vi-B Results and Discussion
Results are summarized by Table V and Figures 2(a) and 2(b), showing data for the runtime, utility of network and container requests, provisioning delays correlated to the DC fragmentation and DC network fragmentation, respectively.
Figure 2(a) shows that the multi-criteria algorithms have a small variation for request delay, grouping the data in high fragmentation percentages, while the WF induces delay in requests regardless the DC fragmentation. In turn, the BF algorithm imposes higher delay to requests resulting in a small fragmentation percentage, below of network fragmentation. WF and BF generate a long requests queue impacting directly in the total computational time needed to schedule all the tenants’ requests.
Regarding the container’s utility (Table V), the multi-criteria algorithms give priority to schedule requests mixing between the maximum and minimum requirements, increasing the number of containers in the DC. The WF tends to allocate the maximum value for the requests, while the BF tends to give the minimum values of the requests. While the multi-criteria algorithms increase the number of containers in the DC reducing the total delay, the network fragmentation have similar behavior with the WF algorithm, as shown in Figure 2(b). Meanwhile the BF keeps the network fragmentation small due to the long delays that it applies in the requests. It is possible to observe that the multi-criteria algorithms present better consolidation results when compared to the WF and BF algorithms, due to their capacity to allocate more requests in the DC keeping the fragmentation similar to WF. It is possible to conclude that the network weighting schema is essential to perform a joint scheduling of container and network requirements. It is important to emphasize: the GPU-accelerated algorithms can schedule the requests with bandwidth requirements atop a large-scale DC in a few seconds. Specifically, TOPSIS outperformed BF, WF, and AHP results.
Vii Related Work
, but the problem complexity and search space often create opportunities for heuristic-based solutions.
Guerrero  proposed a scheduler for container-based micro-services. The containers workload and the networking details were analyzed to perform the DC load balance. Guo  proposed a scheduler to optimize the load balancing and workload through the neighborhood division in a micro-service method. Both proposals were analyzed on small-scale DCs as the problem complexity imposes a barrier on real-scale use. The GenPack  scheduler employs monitoring information to define the appropriated group of a container based on the resource usage, avoiding resources disputes among containers. A security-concerned scheduler was proposed by , based on bin-packing executing a BF approach. GPU-accelerated algorithms can be applied to speed-up these heuristics reaching large-scale DCs .
A joint scheduler based on priority-queue, AHP and Particle Swarm Optimization (PSO) is proposed by . The requests are sorted by their priority level and waiting time, and then the tasks are sent to the AHP to be ranked and then serving as an input to PSO. The results show a reduction on makespan up to when compared to PSO. In addition,  proposed a VM scheduler based on TOPSIS and PSO. The scheduler was compared with meta-heuristics using the metrics: makespan, transmission time, cost and resource utilization, achieving an improvement up to 75% when compared to traditional schedulers. Although many multi-criteria solutions appear in the literature, we were unable to find schedulers dealing with containers, pods, and their virtual networks.
Network requirements are disregarded or partially attended by major of reviewed schedulers. Even well-known orchestrators (e.g., Kubernetes) consider the network as second-level and not critical parameters. Containers are used to model large-scale distributed applications, and it is evident that network allocation can impact on applications performance .
We investigated the joint scheduling of network QoS and containers on multi-tenant DCs. A MILP formulation and experimental analysis reveal that a network-aware scheduler can decrease DC network fragmentation and processing delays. However, solving a MILP is known to be computationally intractable and practically infeasible for large-scale scenarios. We then developed two GPU-accelerated multi-criteria algorithms, AHP and TOPSIS, to schedule requests on a large-scale DC. Both network-aware algorithms outperformed the traditional schedulers with regard to DC and tenant perspectives. Future work includes the scheduling of batch requests and a distributed implementation for increasing the fault tolerance.
The research leading to the results presented in this paper has received funding from UDESC and FAPESC, and from the European Union’s Horizon 2020 research and innovation programme under the LEGaTO Project (legato-project.eu), grant agreement No 780681.
-  (2016) An efficient dynamic priority-queue algorithm based on ahp and pso for task scheduling in cloud computing. In HIS, pp. 134–143. Cited by: §VII.
-  (2015) Jupiter rising: a decade of clos topologies and centralized control in google’s datacenter network. In Sigcomm ’15, Cited by: §IV-B1.
-  (2016) Borg, omega, and kubernetes. Queue 14 (1), pp. 10:70–10:93. External Links: Cited by: §I, §I.
-  (2019-03-16) QVIA-sdn: towards qos-aware virtual infrastructure allocation on sdn-based clouds. Journal of Grid Computing. External Links: Cited by: §III-B, §VII.
-  (2018-03-01) Genetic algorithm for multi-objective optimization of container allocation in cloud architecture. J GRID COMPUT 16 (1), pp. 113–135. External Links: Cited by: §VII.
-  (2018) A container scheduling strategy based on neighborhood division in micro service. In NOMS 2018-2018 IEEE/IFIP Network Operations and Management Symp., pp. 1–6. Cited by: §VII.
-  (2017-04) GENPACK: a generational scheduler for cloud data centers. In IEEE Int. Conf. on Cloud Engineering (IC2E), Vol. , pp. 95–104. External Links: Cited by: §VII.
-  (1981) Multiple attribute decision making. Lecture Notes in Economics and Mathematical Systems, Springer. Cited by: §V-C.
-  (2018-11) Tackling virtual infrastructure allocation in cloud data centers: a gpu-accelerated framework. In 2018 14th Int. Conf. on Network and Service Management (CNSM), Vol. , pp. 191–197. External Links: Cited by: §VII.
-  (2009) PortLand: a scalable fault-tolerant layer 2 data center network fabric. SIGCOMM Comput. Commun. Rev. 39, pp. 39–50. External Links: Cited by: §IV-B1.
-  TOPSIS–pso inspired non-preemptive tasks scheduling algorithm in cloud environment. Cluster Computing, pp. 1–18. Cited by: §VII.
-  (2019-02) Parametrized complexity of virtual network embeddings: dynamic & linear programming approximations. SIGCOMM Comput. Commun. Rev. 49 (1), pp. 3–10. External Links: Cited by: §I, §V, §VII.
-  (2005) Making and validating complex decisions with the AHP/ANP. J SYST SCI SYST ENG 14 (1), pp. 1–36. Cited by: §V-B.
-  (2018) An analysis and empirical study of container networks. In IEEE INFOCOM 2018-IEEE Conf. on Computer Communications, pp. 189–197. Cited by: §I, §VII.
-  (2018-07) SGX-aware container orchestration for heterogeneous clusters. In IEEE 38th Int. Conf. on Distributed Comp. Systems, External Links: Cited by: §VII.