DeepAI
Log In Sign Up

Optimal Job Scheduling and Bandwidth Augmentation in Hybrid Data Center Networks

Optimizing data transfers is critical for improving job performance in data-parallel frameworks. In the hybrid data center with both wired and wireless links, reconfigurable wireless links can provide additional bandwidth to speed up job execution. However, it requires the scheduler and transceivers to make joint decisions under coupled constraints. In this work, we identify that the joint job scheduling and bandwidth augmentation problem is a complex mixed integer nonlinear problem, which is not solvable by existing optimization methods. To address this bottleneck, we transform it into an equivalent problem based on the coupling of its heuristic bounds, the revised data transfer representation and non-linear constraints decoupling and reformulation, such that the optimal solution can be efficiently acquired by the Branch and Bound method. Based on the proposed method, the performance of job scheduling with and without bandwidth augmentation is studied. Experiments show that the performance gain depends on multiple factors, especially the data size. Compared with existing solutions, our method can averagely reduce the job completion time by up to 10

READ FULL TEXT VIEW PDF
12/17/2018

Coflow Scheduling in Data Centers: Routing and Bandwidth Allocation

In distributed computing frameworks like MapReduce, Spark, and Dyrad, a ...
08/24/2018

Hybrid Job-driven Scheduling for Virtual MapReduce Clusters

It is cost-efficient for a tenant with a limited budget to establish a v...
12/14/2019

Approximating Bounded Job Start Scheduling with Application in Royal Mail Deliveries under Uncertainty

Motivated by mail delivery scheduling problems arising in Royal Mail, we...
11/14/2019

Optimal Server Selection for Straggler Mitigation

The performance of large-scale distributed compute systems is adversely ...
12/03/2018

Design and Provision of Traffic Grooming for Optical Wireless Data Center Networks

Traditional wired data center networks (DCNs) suffer from cabling comple...
03/26/2020

A Flexible Job Shop Scheduling Representation of the Autonomous In-Space Assembly Task Assignment Problem

As in-space exploration increases, autonomous systems will play a vital ...
10/15/2020

Scheduling Opportunistic Links in Two-Tiered Reconfigurable Datacenters

Reconfigurable optical topologies are emerging as a promising technology...

I Introduction

Data transfer has a significant impact on application performance in data-parallel computing frameworks such as MapReduce [dean2008mapreduce], Pregel [malewicz2010pregel] and Spark [zaharia2012resilient]. These computing frameworks all implement a data partitioning model, in which jobs are decomposed into finer-grained tasks, and massive amounts of intermediate data between their computation stages need to be transferred through the network before generating the final results. For many applications in production environment, the data transfers account for more than 50% of the job completion times [chowdhury2011managing]. With the rapid growth of the processed data size, the network resource has become an increasingly significant bottleneck in the performance of cloud computing.

Traditional data center networks (DCNs) which consist of copper and optical fiber cables provision the link capacity between racks in a fixed manner. During a job’s execution, however, data flows trend to be bursty when multiple tasks are ready for data transmission and hence exhibit dynamic patterns. When the traffic between two racks exceeds the provisioned capacity, congestion will occur. Such static link capacity allocation restrict the support of parallel data transfer and therefore slow down the subsequent tasks’ execution duration the job execution.

To support the dynamic allocation of network resources, many efforts have recently been made to deploy the wireless communication technologies into wired DCNs to enable dynamic bandwidth augmentation, such as mmWave links [terzi202160] and free-space optics (FSO) [celik2019optical]. 60GHz antennas and FSO transceivers can provide Gigabit transmission capability with low-latency switching time. By leveraging mmWave MIMO beamforming, a large number of beams can be scheduled with extremely small switching delay [abari2016millimeter]. And the reconfiguration delay of FSO was shown to be only 12 while supporting 18,432 fanouts [ghobadi2016projector]. As a result, these reconfigurable wireless technologies demonstrate the potential for providing additional bandwidth by dynamically establishing wireless links on demand to offload traffic and reduce the job completion time.

In order to intuitively show both the advantages and challenges of using wireless transmission for reducing job completion time, an example job consists of five tasks is presented in Fig. 1. Assume the transmission capacity of all wired links and wireless transceivers between racks are 10 Gbps. With only wired links, the intermediate data during each stage must be transmitted sequentially, resulting the prolonged job completion time. By using dynamically established wireless links to transmit data on task task and task task, of the job completion time can be reduced. Thus, an appropriate wireless bandwidth augmentation scheme can greatly speed up the job execution. However, it also requires the job scheduler and transceivers to make joint decisions under coupled computing and communication constraints.

Fig. 1: An example to illustrate the advantages and challenges of using dynamically established wireless links to reduce job completion time.

In our previous work [luo2019energy], a flow routing and antenna scheduling scheme is proposed for hybrid DCNs without considering computing tasks. There are many important works focus on enhancing flow scheduling performance using wireless technologies, with the aim of minimizing the network congestion [han2015rush], relieving hotspots [halperin2011augmenting], enhancing the network flow throughput [cui2013dynamic], or reduce the length of flow paths [li2019energy]. However, these studies assume that the computing tasks have already been assigned and hence the endpoints of flows are predetermined, without jointly scheduling the computation and communication. The most related work to ours is [ao2021joint]. Reference [ao2021joint] studies the joint wireless links scheduling and computing task assignment problem and obtains substantial performance gain. However, its model assumes tasks are independent and can be processed simultaneously, without considering dependency constraints between adjacent tasks.

In this work, we aim to jointly schedule dependency constrained tasks and wireless transceivers in hybrid DCN. We identify that such a problem is a complex mixed integer non-linear programming problem, which is not solvable by existing optimization methods. To overcome this, we transform it into an equivalent problem based on the coupling of its bounds, the revised data transfer model and non-linear constraints reformulation, such that the optimal solution can be acquired efficiently by the Branch and Bound method. Through numerical experiments, we find the performance gain introduced by wireless augmentation depends on multiple factors, especially the data size. Compared with existing solutions, our method can averagely reduce the job completion time by up to

under the setting of production scenario.

Ii System Model

Consider a hybrid DCN consists of a set of racks. Each rack is composed of a number of servers for computation and storage, and equipped with reconfigurable wireless transceiver for bandwidth augmentation. The racks are connected with both the wired links with fixed capacity and the dynamically established wireless links. We assume the orthogonal channel allocation and progressive directional antenna are used, such that the total wireless bandwidth is shared by wireless links among racks via FDMA without interference.

In this work, we consider periodic jobs, which are loaded everyday and their detailed knowledge can be profiled from historical logs111According to [ren2012workload], periodic jobs can be optimized and account for of the workload in Hadoop cluster at Taobao.. Each job is described by a directed acyclic graph (DAG) , as in job scheduling systems like Fuxi [zhang2014fuxi]. is the set of computing tasks, and is the set of directed edges representing the dependency between adjacent tasks. Each task specifies its unit size of resources, e.g., {1 core CPU, 1GB Memory}, thus its processing time can be measured as . Each edge specifies the data size from task to task . The required bandwidth of transmitting data across racks is specified as .

Upon receiving the job, the job scheduling system will check the free resources among racks, and try to allocate computing and bandwidth resources which meet the job’s resource requirements. Let be the set of feasible racks. The allocated wired bandwidth between each pair of racks must be guaranteed as . For wireless resources, the available wireless bandwidth is divided into multiple orthogonal subchannels denoted by a set , and each subchannel has a bandwidth of . With the allocated bandwidth resources, the transferring time of the data on edge through wired links is calculated as , and the transferring time through wired links is calculated as . Otherwise, if adjacent task and are assigned to the same rack, the delay of transferring the data locally is denoted as .

Iii Problem Formulation

Iii-a Common Constraints for Computing Task Assignment

We define the binary variable

and the continuous variable for each task . Specifically, means task is assigned to rack , and denotes task ’s start time. Inherently, the following constraints must be satisfied:

Iii-A1 Non-repetition Constraints

Each task must be assigned to one rack and processed only once, namely,

(1)

Iii-A2 Non-preemption Constraints

To prevent computing resource overload, each rack is allowed to process the job’s one task at a time. Once started, a task cannot be interrupted by any others until its completion. For ,

(2)

Expression , in C1 represent the selected rack for task , , respectively. Constraint (2) guarantees if two computing tasks and are assigned to the same rack, there is no resource competition between them.

Iii-A3 Precedence Constraints

A computing task only starts after the completion of all its precedent tasks, namely,

(3)

Remark 1: Note that constraint (3) is relatively slack, due to the fact that it ignores the data transfer time between adjacent tasks, which will be discussed in the next subsection.

Iii-B Constraints for Intermediate Data Transfers

Coupled with the assignment decisions of computing tasks, the intermediate data among tasks may be transfered locally without occupying cross-rack links, or be transmitted externally through either wired or wireless links. For clarity, we define binary variable for each edge , namely,

(4)

where means task and are assigned to the same rack. In this case, the data on edge will be transferred locally (i.e., within a rack) with delay , namely,

(5)

Otherwise, if , task and will be assigned to different racks, and the network flow between racks will occur.

Heterogeneous network flow scheduling constraints: We define binary variables and , in which means the data on edge is transferred via wired links. means the data is assigned to wireless subchannel . The start time of data transmission from task to is denoted as the continuous variable . Firstly, the data on edge can only start to be transmitted until the completion of computing task , namely,

(6)

Iii-B1 Data Transmitted Through Wired Links

If the data on edge is transmitted through wired links, the subsequent task can only start after it receives all the data, namely,

(7)

To prevent congestion, for each pair of different network flow and transferred via wired links, there is

(8)

where is required.

Iii-B2 Data Transmitted Through Wireless Links

Similarly, if the data is transmitted through wireless subchannels, the subsequent task must wait until the data transfer ends. ,

(9)

To prevent wireless interference, each subchannel is allowed to transfer one network flow during any period of time, and once started, the data transmission cannot be interrupted until its completion. For ,

(10)

Expression , in C2 indicate the selected subchannel for transferring data on edge , , respectively. Therefore, constraint (10) guarantees if two network flows are transferred over the same subchannel, there is no interference during their transmission.

Iii-C Problem Formulation

The objective is to minimize the job completion time. Thus the original problem can be formulated as follow,

where , and .

It is observed that OP is a complex Mixed Integer Non-linear Programming (MINLP) with a large number of coupled constraints, which is not directly solvable by existing optimization methods. The exhaustive search for the optimal solution is intractable, due to the huge solution space imposed by logical and disjunctive constraints. Even for a common scale OP (e.g., job size in production cases[ren2012workload]), searching for the optimal solution is non-trivial, and the time complexity is unacceptable. In the next section, we will transform OP into an equivalent problem based on combination of multiple steps, which paves the way for adopting the sophisticated optimization methods to acquire its optimal solution efficiently.

Iv The Optimal Job Scheduling and Bandwidth Augmentation Scheme

To make it possible to solve OP

within a reasonable time, we first use heuristics to estimate its upper and lower bound. Next, we introduce the generalized data transfer model to linearize the coupled constraints between task assignment decisions and data transfers. Then, disjunctive reformulation technique combined with multiple auxiliary variables is adopted to convert the resource constraints of problem

OP into their linearized forms, which allows us to acquire its optimal using the Branch and Bound method.

Iv-a Heuristic-based Bounds Estimation

Upper Bound: For any given job, a feasible scheduling scheme can be obtained by assigning all its tasks to a single rack. In this case, tasks are processed in a topological sort order without cross-rack data transmission. Thus its job completion time can be calculated as . We define as the upper bound of OP by assuming any ”good” scheduling schemes cannot be worser than this scheme, namely, .

Lower Bound: The lower bound of OP can be obtained by summing up the processing time of computing tasks and the local data transfer delays along the longest branch of the given job. For simplicity of illustration, we present an example in Fig. 2. Fig. 2(a) is an example DAG job graph, while Fig. 2(b) is the converted cost graph by transforming all of the node’s cost into their outgoing edge’s costs. Then the longest path algorithm can be used to calculate the distance from start node to each task (i.e., the earliest start time of task ) as . Finally, the longest branch length can be obtained as , which is the lower bound of OP. The detailed procedure is presented in Algorithm 1.

Fig. 2: An example for calculating the longest branch of DAG job graph.

Input: Job , and .
Output: The longest branch length of job .

1:Define as the cost of edge .
2:for each task do
3:    Initialize as the distance from start to .
4:    for each outgoing edge of task do
5:        Set .
6:Topologically sort in .
7:for each task in topological sort order do
8:    Update .
9:return .
Algorithm 1 The Longest Branch Algorithm

Iv-B Generalized Representation of Data Transfer

Fig. 3: Illustration of generalized data transfer model.

Depending on the assignment decisions of adjacent tasks, the intermediate data on each edge between adjacent tasks is either available in local disks, or transferred through wired or wireless links. To eliminate the logical constraints associated with variables and cover different cases of data transfers, we devise the generalized representation of data transfer model by introducing the virtual channel with infinite bandwidth and the wired channel , in which each edge is associated with a ”single pole triple throw switch”. As illustrated in Fig. 2, the case of local availability of data is viewed as data transmitted over an infinite channel without resource conflict but a constant delay, since there is no cross-rack data transmission needed when the adjacent tasks are assigned to the same rack. Thus, for each intermediate data, it will be transferred through a channel from there types of network resources denoted by the set . By adopting the generalized data transfer model, the intermediate data on edge must be transferred on one of communication channels from , namely,

(11)

where indicates the intermediate data on edge is transferred via communication channel .

Therefore, if the data on edge is transferred through wired links, ; if the data is transferred through wireless subchannel, ; otherwise, the data is transferred through local disk and then .

Iv-C Constraints Decoupling and Reformulation

With the bounds and the generalized data transfer model, we can linearize OP based on the disjunction reformulation technique. We define auxiliary variable for each , in which denotes task is assigned to rack and begins to process at time , otherwise . Similarly, we define as auxiliary variable for in which denotes that the intermediate data on edge is assigned to channel and begins to transfer at time , otherwise . The following constraints can bind these variables, ,

(12)
(13)

Here acts as a big constant, and is a small constant commonly used in the logical constraints reformulation of MINLP and can be set as 0.1 in practice.

Next, let be the task assignment indicator. Specifically, indicates that task and are assigned to the same rack , otherwise . Further, let be the precedence indicator such that, if task starts no later than , . Similarly, for data transmission, we define the binary variables as the contention indicator, if the data on edge and compete for the network channel ; and as precedence indicator between network flows such that, if the data on begins to transfer no later than the data on , , where . Eventually, the following constraints are required to construct the indicator variables.

(14)
(15)
(16)
(17)

Iv-C1 Computing resource constraint reformulation

To ensure the execution of any two computing tasks on the same rack does not overlap, the computing resource constraints can be linearized by utilizing disjunctive programming formulation technique as follows, i.e., ,

(18)
(19)

Iv-C2 Communication resource constraint reformulation

Similarly, to ensure the data transmission does not conflict over wired links or wireless subchannels, constraints (20)-(23) should be satisfied, i.e., ,

(20)
(21)
(22)
(23)

Iv-C3 Precedence constraints reformulation

To coordinate the computing task execution and bandwidth augmentation and maintain the consistency of task and data transfer decisions, each task or data transfer can only start after all of its precedent tasks are completed, i.e.,

(24)
(25)

where is explicitly guaranteed earlier in constraint (11). Additionally, since if the adjacent tasks of an edge are assigned to the same rack, the intermediate data will be transferred locally without occupying network resources. Thus, the coupling constraints between the assignment of tasks and data transfer can be written as:

(26)

As such, all of the constraints in OP are linearized and the problem can be reconstructed as follow,

s.t.

As a result, we transform the MINLP into a linearized one with the help of its bounds and the generalized data transfer model, thus the OP can be solved by solving RP. Note that, OP and RP are equivalent since the satisfaction of all constraints in RP indicate the satisfaction of the ones of OP, and vice versa. RP can be optimally solved by the Branch and Bound (B&B) algorithm [wolsey1999integer], making it possible to jointly schedule jobs and wireless transceivers efficiently.

Iv-D Decomposition and Acceleration

To further speed up the solving procedure of using the B&B, we decompose the RP into multiple feasibility sub-problems. The feasibility sub-problems are derived from RP conditioned on the moving upper bound , which is formulated as

where is the updated upper bound of . During each iteration, we assume that the sub-problem is feasible, and start with an interval which is known to contain the optimal solution value . We then solve the feasibility sub-problem at its midpoint to determine whether the optimal solution is in the lower or upper half of the interval, and narrow the interval accordingly. Each iteration the interval is bisected, so the width of the interval after iterations is . Repeat this procedure until the width of the interval is small enough. Eventually, the in OP can be acquired as , and .

V Simulation Results

We implemented the proposed method using Gurobi [gurobi] and evaluated the performance gain introduced by wireless links through numerical simulations. Similar to [giroire2019network], we randomly generated three types of jobs, i.e., simple MapReduce workflows, one-stage MapReduce workflows and random workflows with computing tasks whose processing time are uniformly chosen from [1,100]. The network factor , which represents the ratio between the average data transfer time and the average processing time, is defined to set data transfer time. The larger the network factor, the higher the data size. As in [halperin2011augmenting, han2015rush, li2019energy], we assume both wired and wireless links have a transmission rate of 10 Gbps, and focus on scenarios where each of allocated wireless subchannels can fulfill the job’s specified bandwidth requirement as the wired links.

In Fig. 4, we compare our method with six different wired-links-only job scheduling baselines in terms of job completion time. Specifically, the Random Scheduling scheme distributes computing tasks randomly, while the List Scheduling scheme is from [rayward1987uet]. The Partition Scheduling, Generalized List (G-List) Scheduling and G-List-Master Scheduling schemes are from [giroire2019network]. The Optimal Scheduling scheme with only wired links is derived from our method by dropping wireless resources. We fix the network factor to mimic the scenario where approximately half of the time is spent on data transfers as reported in [chowdhury2011managing]. The task number of each job is chosen from , aligning with the production job statistics from [ren2012workload] that the majority of jobs contain tasks less than . It can be observed that when the racks (computing resources) are insufficient, the performance gain introduced by wireless links is relatively small. As the available rack number increases, adding wireless sub-channels can reduce the job completion time by up to . However, adding more than one wireless subchannel contributes relatively less to job performance.


Fig. 4: Average job completion time with and without wireless subchannels as a function of the number of available racks for jobs with ten tasks.

Fig. 5: Average performance gain of adding wireless resources versus network factor ratio on jobs with different number of tasks.

In Fig. 5, we fix the available rack number as , and vary the network factor from 0.1-10 to show the impact of the increased data size on average performance gain. As is seen from the figure, with the increase of the network factor, the performance gain increases at first and then decreases. The reason is when data size is small, the benefit of optimizing data transfer is slight. Under this scenario, increased data size may cause worser tardiness and thus wireless augmentation can bring higher benefits. As the network factor continues to increase, the data transfer time becomes even longer than task processing time, in this scenario, it might be better to assign all computing tasks of a job to a single rack to avoid data transfer. Besides, with fixed network factor (e.g., the red dashed vertical line), the larger the task number, the higher the performance gain can be achieved by wireless bandwidth augmentation. And adding more wireless resources brings reduced gains.

Vi Conclusion

In this work, we investigated the joint job scheduling and bandwidth augmentation in hybrid data centers. We observed the wireless-augmented job scheduling problem is an MINLP, which is not solvable by existing optimization methods. Thus, we linearized the original model with help of its bounds, the revised data transfer model and the disjunctive reformulation technique, such that it can be solved optimally by the Branch and Bound method. Simulation results showed that jointly scheduling the tasks and wireless transceivers can significantly reduce the job completion time. In our future work, we will study job scheduling problems that involving more real-world constraints for online scenarios.

References