Coflow Scheduling in Data Centers: Routing and Bandwidth Allocation

12/17/2018 ∙ by Li Shi, et al. ∙ Stony Brook University 0

In distributed computing frameworks like MapReduce, Spark, and Dyrad, a coflow is a set of flows transferring data between two stages of a job. The job cannot start its next stage unless all flows in the coflow finish. To improve the execution performance of such a job, it is crucial to reduce the completion time of a coflow which can contribute more than 50 While several schedulers have been proposed, we observe that routing, as a factor greatly impacting the Coflow Completion Time (CCT), has not been well considered. In this paper, we focus on the coflow scheduling problem and jointly consider routing and bandwidth allocation. We first provide an analytical solution to the problem of optimal bandwidth allocation with pre-determined routes. We then formulate the coflow scheduling problem as a Mixed Integer Non-linear Programming problem and present its relaxed convex optimization problem. We further propose two algorithms, CoRBA and its simplified version: CoRBA-fast, that jointly perform routing and bandwidth allocation for a given coflow while minimizes the CCT. Through both offline and online simulations, we demonstrate that CoRBA reduces the CCT by 40 compared to the state-of-the-art algorithms. Simulation results also show that CoRBA-fast can be tens of times faster than all other algorithms with around 10 CoRBA-fast very applicable in practice.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, we have witnessed a significant improvement on the IT infrastructures, like high performance computing systems, ultra high-speed networks and large-scale storage systems. Benefiting from these improvements, our ability of collecting, storing and processing data has also been dramatically enhanced. In a data center owned by big corporations like Google or Twitter, every day, hundreds terabytes of data can be transferred into its data storage system [1, 2] and processed/analyzed by some computing frameworks such as MapReduce [3] and Spark [4]. By applying such data analysis on many different areas like physics, biology, medicine, manufacture and finance, we have greatly changed the world we live in. In such a context, one of the most important goals pursued by engineers and researchers is improving the execution performance of those data processing jobs.

To achieve this goal, a critical problem to solve is how to optimize data transferring time. In many computing frameworks, jobs consist of a sequence of processing stages. Between two consecutive stages, there is usually a set of flows which move output data of the previous stage to the nodes executing the later stage. A job cannot start its next stage until all flows in this set finish. We usually refer such a set of flows as a coflow [5]. Since the transfer of a coflow can occupy more than 50% of the job completion time [6], optimizing the Coflow Completion Time (CCT) is important for improving the execution performance of jobs.

Fig. 1: Scheduling 3 flows in a network with 3 servers and 5 routers.

Recently many mechanisms [7, 8, 9] have been proposed to provide bandwidth guarantee to network flows. With such ability, we can guarantee the flow completion time (FCT), i.e., the completion time of a single flow. However, scheduling a coflow in a fashion that minimizes its completion time is still a complex problem which involves both routing and bandwidth allocation at the level of the whole set of flows.

To illustrate this problem, consider the scenario shown in Fig. 1, in which a coflow with 3 individual flows is waiting to be scheduled in a network with 3 servers and 5 routers. The goal is minimizing the CCT. To schedule this coflow, there is a need to determine a route for each flow and allocate a certain amount of bandwidth along each route. Fig. 2 shows three schedules generated by different scheduling strategies. Fig. 2(a) shows a schedule in which each flow is routed via the maximum capacity path and bandwidth is fairly allocated to flows using the same link, while Fig. 2(b) shows a schedule using the same routes but allocating bandwidth based on flow volume. We can see that by appropriately allocating bandwidth, the completion time of the largest flow is successfully reduced, which leads a decreases on the CCT. However, the schedule shown in Fig. 2(b) is not the optimal one: The determined routes share the same link, which causes bandwidth competition and limits the CCT. This is because routing is performed at the level of individual flows rather than the whole set of flows in this schedule. Whereas, Fig. 2(c) shows the optimal schedule in which we co-schedule all flows, reduce route overlap and allocate more bandwidth.

(a) Schedule 1: Maximum capacity path + Fair sharing
(b) Schedule 2: Maximum capacity path + Volume-proportional sharing
(c) Schedule 3: Jointly consider routing and bandwidth allocation
Fig. 2: Three flow schedules generated by different strategies.

From this example, we can see that routing and bandwidth allocation together determine the CCT. We can obtain the minimum CCT only if we find out the optimal solution on both routing and bandwidth allocation. However, in practice, we may need to schedule a large number of flows in a network with thousands or tens of thousands of nodes. In such cases, there exists a vast amount of possible routing plans and for each routing plan there are very many ways to allocate bandwidth. Searching the optimal solution in such a huge solution space is not an easy problem to solve.

While several approaches have been proposed to schedule coflows [6, 5, 10, 11, 12, 13], they either do not jointly consider routing and bandwidth allocation [6, 5, 10, 11, 12] or perform routing based on a limited set of candidate paths [13]. In this paper, we focus on the coflow scheduling problem in which we jointly consider routing and bandwidth allocation for a given coflow. In summary, our main contributions include

  • [align=left,leftmargin=*]

  • We study the problem of optimal bandwidth allocation with pre-determined routes, formulate it as a convex optimization problem, and provide an analytical solution. By solving this problem, we essentially reduce the dimension of the coflow scheduling problem: For any routing plan, we can always optimally allocate bandwidth. (Section 4)

  • We formulate the Coflow Scheduling (CoS) problem as a Mixed Integer Non-linear Programming (NLMIP) problem incorporating both routing constraints and bandwidth allocation constraints. We then present a relaxation and an equivalent solvable convex optimization problem. (Section 5)

  • We propose an algorithm, called Coflow Routing and Bandwidth Allocation (CoRBA), that solves the CoS problem. With more practical consideration, we further propose CoRBA-fast, a simplified version of CoRBA. (Section 6)

  • We evaluate the performance of CoRBA and CoRBA-fast by comparing them with the state-of-the-art algorithms through both offline and online simulations. The simulation results show that CoRBA can achieve 40%-500% smaller CCT than the existing algorithms. The results also show that CoRBA-fast can be hundreds times faster than CoRBA with up to 10% performance degradation, which makes it a good choice in practical use. (Section 7)

  • We further discuss about the applicability of CoRBA-fast by comparing its running time with the transferring time of coflows in some typical MapReduce jobs. (Section 8)

2 Related Work

A significant amount of research has focused on the area of flow transmissions in data center networks. In this section, we discuss some of the works most relevant to our problems.

Coflow schedulers. The problem of coflow-aware flow scheduling has attracted significant attention [5, 6, 10, 11, 12, 13]. Orchestra [6] is a global control architecture that manages the transfer of a set of correlated flows. Chowdhury et al. [5] propose the concept of coflow, a network abstraction that describes the traffic pattern of prevalent data flows. Barrat [10] is a decentralized flow scheduler which groups flows of a task and schedules them together with scheduling policies like FIFO-LM. Varys [11] is a coflow scheduler that addresses the inter-coflow scheduling problem, supports deadlines and guarantees coflow completion time. Aalo [12] is a recently proposed coflow scheduler that schedules coflows without any prior knowledge and supports pipelining and dependencies in multi-stage DAGs. While the coflow schedulers introduced above focus on scheduling individual coflows or groups of coflows, however, they neglect routing, an important factor that impacts the coflow completion time.

RAPIER [13] is a recently proposed coflow-aware network scheduling framework that integrates both routing and bandwidth allocation. In RAPIER, scheduling a single coflow is formulated as a linear programming problem in which the route of each flow in the coflow is selected from a set of candidate paths given as input. In contrast to RAPIER, in this paper, we propose algorithms that consider the bandwidth availability of the whole network and route flows via the best paths instead of picking up a path from a set of candidates. We show the superior performance of our algorithms by comparing them with the optimization algorithm used in RAPIER.

Flow schedulers. Much research work has also been performed on reducing the average flow completion time [14, 15, 16, 17, 18, 19]. Rojas et al. [14] give a comprehensive survey on existing schemes for scheduling flows in data center networks. PDQ [16] is a flow scheduling protocol which utilizes explicit rate control to allocate bandwidth to flows and enables flow preemption. pFabric [17] is a datacenter transport design that decouples flow scheduling from rate control, in which flows are prioritized and switches implement a very simple priority-based scheduling/dropping mechanism. RepFlow [18]

is a transport design that replicates each short flow. It transmits the replicated and original flows via different paths, which reduces the probability of experiencing long queueing delay and therefore decreases the flow completion time. PIAS 

[19] is an information-agnostic flow scheduling scheme minimizing the FCT by mimicking the Shortest Job First strategy without any prior knowledge of incoming flows. While these existing schemes can reduce the FCT by using different strategies, they are not coflow-aware and therefore have different optimization objective compared to our algorithms.

3 Problem Definition

In this paper, we consider scheduling a given coflow on a given data center network. We now introduce the input, output, and objective of this problem.

Input. The input contains a data center network and a coflow. The network is modeled as a graph , where is the set of nodes, in which each server and router corresponds to one node; is the set of links, in which presents the link between node and node ; and is the set of available bandwidth on links, in which presents the available bandwidth of . Note that each link in is unique, i.e., the link between and is modeled as either or .

The coflow is a set of flows. We denote it by and define each flow as , where / is the source/destination node of this flow and is the data volume, i.e., the total amount of data to be transferred. Like prior works [10, 12, 11, 13], we assume that the information of a coflow can be captured by upper layer applications [5] or using existing prediction techniques [20].

Output. The output contains a set of routes, one for each flow in , and a certain amount of bandwidth allocated to each route. We define the set of routes as in which is the route selected for flow . We also define as the amount of bandwidth allocated to the route .

Objective. Our objective is minimizing the CCT that is the completion time of the last finished flow. Let denote the completion time of flow , then our objective is

(1)

4 Optimal Bandwidth Allocation with Pre-determined Routes

We start from the problem of Optimal Bandwidth Allocation (OptBA) with pre-determined routes, in which the route of each flow has already been determined and we need to allocate bandwidth to these routes while minimizing the CCT. With the solution of this problem, we can optimally allocate bandwidth corresponding to any routing plan, which essentially reduces the dimension of the coflow scheduling problem.

To model this problem, we define as a binary constant which has the following value

(2)

We formulate the OptBA problem as
OptBA
Objective:

(3)

Subject to

(4)
(5)

Remarks:

  • Constraints (4) are bandwidth availability constraints, which limit that for any link , the overall amount of bandwidth allocated to flows using this link cannot exceed the available bandwidth of this link.

  • Constraints (5) are domain constraints ensuring the bandwidth allocated to each flow to be non-negative.

Convexity. We observe that the functions in constraints (4) and (5) are all affine on and thus convex. At the same time, the function is convex, because its second derivative is nondecreasing when is larger than 0. Therefore, according to [21], the objective function, i.e., the pointwise maximum function of , is also a convex function. As a result, the OptBA problem is a convex optimization problem.

4.1 Analytical Solution

While existing convex optimization algorithms can be used to solve the OptBA problem, we develop an analytical solution which is more efficient. Specifically, we define as the amount of bandwidth allocated to flow on link when we proportionally distribute the available bandwidth of link to all flows that are using this link, i.e.,

(6)

Let

be an vector

, in which is the minimum value of all existing , i.e.,

(7)

in which is the set of links on the route of flow . We then have that the vector is an optimal solution of the OptBA problem. To prove this, we first prove the following lemma.

Lemma 1.

Assume that for the vector , flow has the largest finish time, i.e., . Also assume that

(8)

Then for every other flow using link , we have

(9)
Proof.

To begin with, we assume that there exists a flow using link but getting its allocated bandwidth when another link is considered, i.e.,

(10)

Next, define as and we have

(11)

On the other hand, based on Equation (8), we have

(12)

Now using the Inequity (11) and Equation (12), we get

(13)

which conflicts with the assumption the flow has the largest completion time. Therefore, there does not exist a flow satisfying Equation (10). We have proved the lemma now. ∎

Based on Lemma 1, we have the following theorem.

Theorem 1.

The vector defined in Equation 7 is an optimal solution of the OptBA problem.

Proof.

Assume that flow has the largest completion time and obtains its value when link is considered. Then with Lemma 1, for every flow using , there exists

(14)

Therefore, for each flow , we have

(15)

Now assume that another vector is the optimal solution and flow has the largest completion time. We then have

(16)

Naturally, we have On the other hand, from Equation (14), we can get

(17)

Putting them together, we have

(18)

which is infeasible. As a result, there does not exist a feasible solution better than . Therefore, the vector defined in Equation (7) is an optimal solution of the OptBA problem. ∎

5 The Coflow Scheduling (CoS) Problem

In this section, we formulate the CoS problem as a NLMIP problem, present a non-linear relaxation, and further transform the relaxation to a solvable convex optimization problem.

We start with defining variable as an integer variable with three possible values (-1, 0, and 1):

(19)

Note that there exists either or , corresponding to the existence of either or . We then have
CoS
Variables

  • : bandwidth allocated to the path selected for .

  • : an integer variable defined by Equation (19).

  • : the completion time of flow .

Constants

  • : the number of flows in the set .

  • : the available bandwidth of the link .

  • : the set of neighbor nodes of node .

Objective:

(20)

Subject to

(21)
(22)
(23)
(24)
(25)
(26)

Remarks:

  • Constraints (21) ensure that data is sent out from the source of any flow through only one link, because a positive value of and a negative value of mean that data is going out from the source node via link or . Constraints (22) ensure that data is transferred into the any destination node through only one link.

  • Constraints (23) ensure flow conservation, i.e., for any flow and any intermediate node , the number of links via which data is transferred into node should be equal to the number of links via which data is sent out. Note that constraints (21)-(23) together enforce that only one path is selected to transfer data for each flow.

  • Constraints (24) enforce that the overall bandwidth allocated to flows in coflow on any link does not exceed the total available bandwidth of that link.

  • Constraints (25) and (26) are domain constraints.

5.1 A Relaxation of the CoS Problem and An Equivalent Convex Optimization Problem

To solve the CoS problem, we first consider its relaxation in which the integer variable is relaxed to a real variable, i.e., changing the constraint (26) to

(27)

We name the relaxed problem as CoS-Relax.

Subsequently, we transform the CoS-Relax problem into an equivalent convex optimization problem. We start from defining variable as the CCT, i.e.,

The objective function (20) is tranformed to

(28)

We further define variable as

(29)

By substituting into constraint (24), we have

(30)

Next, we define variable as

(31)

By substituting and into constraint (28), (30) and all other constraints, we get an equivalent problem of CoS-Relax, named CoS-Relax-Cvx, as shown below.
CoS-Relax-Cvx

(32)

Subject to

(33)
(34)
(35)
(36)
(37)
(38)

All constraints in the CoS-Relax-Cvx problem are affine, except the constraints (37) which are convex constraints, because the absolute value of is a convex function and the sum of convex functions is still convex. Therefore, the CoS-Relax-Cvx problem is a convex optimization problem which can be solved by using convex optimization algorithms [21].

6 The CoRBA Algorithm and Its Simplified Version: CoRBA-fast

To solve the CoS problem, we propose an algorithm, called Coflow Routing and Bandwidth Allocation (CoRBA).

The CoRBA algorithm has three phases: (I) Obtaining a solution of the CoS-Relax problem by solving the CoS-Relax-Cvx problem; (II) Determining an initial solution of the CoS problem based on the solution of the CoS-Relax problem; (III) Utilizing a local search procedure to further optimize the obtained initial solution. Algorithm 1 shows the pseudocode of CoRBA. We now introduce the details of each phase.
Phase I: Solve the relaxed problem. To begin with, CoRBA solves the CoS-Relax-Cvx problem by using convex algorithms [21]. Let the solution be , , and .

Subsequently, CoRBA calculates the solution of CoS-Relax, denoted by and , using Equations (29) and (31).

1:function CoRBA(Network , Coflow )
2:Phase I:
3:   Solve CoS-Relax-Cvx and get ;
4:   Calculate and using Equations (29) and (31);
5:Phase II:
6:   for each  do
7:      max capacity path using as link capacity;
8:      if on ; otherwise, ;
9:   end for
10:   Calculate using Equation (40);
11:   CCT ;
12:Phase III:
13:   Update according to and ;
14:   while true do
15:     ;
16:     for each  do
17:        ;
18:        Add back to along route of ;
19:        Set each as unavailable;
20:         new max capacity path;
21:        Reset according to new ;
22:        Re-calculate , , and CCT;
23:        if is reduced then break;
24:        else Reverse all changes made for ;
25:     end for
26:     if no is reduced then break;
27:   end while
28:   return , , and CCT;
29:end function
Algorithm 1 The CoRBA Algorithm

Phase II: Obtain an initial solution of the CoS problem. In this phase, the CoRBA algorithm obtains an initial solution of the CoS problem in two steps:

  • [align=left,leftmargin=*]

  • First, it determines the value of , i.e., obtaining a route for each flow. Specifically, for each flow , CoRBA uses as the capacity of and find the max capacity path [22] (denoted by ) between the source and destination of this flow. CoRBA then sets as 1 for each link on that path and set its value as 0 for all other links, i.e.,

    (39)
  • Second, CoRBA determines the value of , i.e., the amount of bandwidth allocated to each flow. Because the route of each flow is already determined, the CoS problem is naturally reduced to the OptBA problem which is already solved in section 4. Therefore, according to Equation (7), CoRBA calculates the value of as

    (40)

Phase III: Local search. In this phase, CoRBA utilizes a local search procedure to further improve the initial solution.

In this local search procedure, CoRBA runs in iterations. In each iteration, it starts with identifying all flows with the largest completion time and putting them into a set named . Subsequently, CoRBA iteratively considers each flow in the set . For each flow in , the CoRBA algorithm puts all links that are on the route of this flow and that do not have any available bandwidth (Due to the initial bandwidth allocation to flows) into set ; it then sets all links in as unavailable and find a new max capacity route for flow ; in the following, it temporarily changes the route of flow to the new route and re-calculates and CCT for current routing plan. If the completion time of flow is reduced in the new solution, the CoRBA algorithm stops considering all other flows in the set and starts a new iteration; otherwise, it reverts all temporary changes that it has made and moves to the next flow in the set .

If CoRBA cannot improve the completion time of any flow in in some iteration, it stops and outputs current solution as the final solution, as it cannot improve the CCT anymore.

6.1 CoRBA-fast: A Simplified Version of CoRBA

The CoRBA algorithm begins with solving the CoS-Relax-Cvx problem which is a convex optimization problem. When the problem’s scale is large enough, solving this problem can be time-consuming. On the other hand, we observe that the local search procedure in the CoRBA algorithm can be used to optimize any feasible coflow schedules. With such observations, we propose CoRBA-fast, a simplified version of CoRBA with less time complexity.

CoRBA-fast has two phases: First, it obtains an initial solution; Second, it utilizes the local search procedure used in the CoRBA algorithm to optimize the initial solution.

To obtain an initial solution, for each flow , the CoRBA-fast algorithm set the route of this flow as the shortest maximum capacity path (i.e., the shortest path with the maximum capacity). Subsequently, CoRBA-fast calculates and CCT based on the determined routes.

7 Performance Evaluation

7.1 Performance Evaluation through Offline Simulations

In this section, we evaluate CoRBA and CoRBA-fast through offline simulations.

7.1.1 Simulation Setup

In a single run of the offline simulation, the proposed algorithms are called to schedule a random coflow on a data center network.
Data center network. For the network, we use a modified FatTree [23] architecture. A -array FatTree network has pods, where each pod has /2 Top-of-Rack (ToR) switches and /2 aggregation switches. core switches are used connect the aggregation switches and each ToR switch connects a set of /2 hosts. To better evaluate our algorithms, we increase the oversubscription ratio and therefore introduce more competition on bandwidth at the aggregation and core levels. To achieve this goal, we multiply the number of hosts in a rack by a factor . By doing so, a -array modified FatTree contains hosts now. In our simulations, we set as 2 and set the each link’s capacity as 10 Gbps.

Noise flows. We introduce noise flows to simulate the complex traffic condition in real data center. Specifically, in each simulation, a set of

noise flows are generated. The source and destination of each noise flow is randomly selected and the duration follows an uniform distribution in

. We randomly select a shortest path as its route and allocate a random amount of bandwidth along that route.
Coflow. We randomly generate a coflow with flows. For each flow, we randomly select two hosts as its source and destination. We further set the maximum possible volume of a flow (denoted by ) as 1000 Gb and determines the volume of a flow (i.e., ) by using an uniform distribution in . In our simulations, we set as 0.7.

Each data point in our simulation results is an average of 20 simulations performed on an Intel 2.5 GHz processor.

(a) Coflow Completion Time
(b) Allocated bandwidth
(c) Average length of flow route
(d) Running time
Fig. 3: Performance of CoRBA and CoRBA-fast when scheduling different numbers of flows in a FatTree network with 1617 nodes.
(a) Coflow Completion Time
(b) Allocated bandwidth
(c) Average length of flow route
(d) Running time
Fig. 4: Performance of CoRBA and CoRBA-fast when scheduling a coflow with 50 flows in FatTree networks with different number of nodes.

7.1.2 Evaluation Metrics

We use four evaluation metrics.

Coflow Completion Time. This metric is the objective of the CoS problem. Hence, it is the most important metric. Allocated bandwidth. This metric is the overall bandwidth allocated to , which equals to . It demonstrates how an algorithm performs on the aspect of bandwidth allocation. Average length of flow route. This metric is the average number of hops on the route of flows in . It gives us a sense about how an algorithm performs on the aspect of routing.

Running time. Running time of an algorithm is also important. It gives a sense of the scalability of that algorithm.

7.1.3 Comparison Algorithms

As a comparison algorithm of CoRBA, we use a modified version of the MinimizeCCT algorithm used in RAPIER [13]. Given a coflow, MinimizeCCT selects a route from a set of candidates for each flow and allocates bandwidth to these flows while minimizing the CCT. Although MinimizeCCT looks similar to the CoRBA algorithm, it does not perform true routing for the coflow.

We modify MinimizeCCT to generate paths as candidates for each flow before scheduling the coflow. We name this modified version as MinCCT. We further propose three variants of MinCCT in which the potential paths are (i) shortest paths, (ii) maximum capacity paths, and (iii) shortest maximum capacity paths. We denote them by “MinCCT-S”, “MinCCT-M”, and “MinCCT-SM” respectively. In our simulations, we set the value of as 5.

7.1.4 Evaluation Results of CoRBA

Performance with Increasing Size of the Coflow. In this simulation, we study how CoRBA and CoRBA-fast perform as the number of flows in the coflow increases from 10 to 100. We uses a 16-array modified FatTree with , which contains 1617 nodes.

Fig. 3(a) shows the CCT generated by each algorithm. While the CCT generally increases along with the expansion of the coflow, CoRBA generates the smallest CCT. When the coflow contains 100 flows, the CCT of CoRBA is 300% smaller than that of MinCCT-M and MinCCT-S, and 40% smaller than that of MinCCT-SM. Meanwhile, CoRBA-fast generates almost the same CCT compared to CoRBA.

Fig. 3(b) shows the total bandwidth allocated to the coflow. As can be seen, CoRBA allocates about 110%-500% more bandwidth than the MinCCT algorithms but about 25% less bandwidth than CoRBA-fast. Fig. 3(c) shows the average length of flow route. We observe that CoRBA and MinCCT-M generates the longest flow route length, followed by CoRBA-fast, MinCCT-SM and MinCCT-S.

Putting Figs. 3(a)3(b) and 3(c) together, we can see that when the size of the coflow increases, CoRBA and CoRBA-fast are able to allocate more bandwidth to the coflow and thereby keep the growth of CCT relatively flat, as observed in Fig. 3(a). We attribute this to their ability of routing based on the bandwidth availability of the whole network. When the coflow expands, CoRBA and CoRBA-fast are able to route flows via different paths and utilize more bandwidth. In contrast, the MinCCT algorithms are restricted by the limited number of candidate paths and cannot sufficiently utilize the available resources in the network, which leads the rapid increase of the CCT.

At last, Fig. 3(d) shows the algorithm running time. When the coflow contains 100 flows, the running time of CoRBA is about 1100 seconds, while that of CoRBA-fast, MinCCT-S, MinCCT-M, and MinCCT-SM is 5 seconds, 70 seconds, 220 seconds, and 80 seconds respectively.

(a) Coflow Completion Time
(b) Ratio CCT to CoRBA
(c) Running time
Fig. 5: Performance of CoRBA and CoRBA-fast in online simulations with increasing number of flows in the coflow.

Performance with Increasing Size of the Network. In this simulation, we demonstrate how our algorithms perform as the number of nodes in the FatTree network increases from 52 to 4500, i.e., the number of pods increases from 4 to 20. We fix the number of flows in the coflow at 50.

Fig. 4(a) shows that the CCT of CoRBA is similar to that of CoRBA-fast, 30% smaller than that of MinCCT-SM, and 200% smaller than that of MinCCT-S and MinCCT-M. Fig. 4(b) shows that CoRBA allocate 100% more bandwidth than MinCCT-SM, 300% more bandwidth than MinCCT-M and MinCCT-SM, but similar amount to CoBRA-fast.

We observe that the bandwidth allocated by CoRBA is dramatically increased when the network just starts expanding and becomes more smooth thereafter. We attribute this to the limitation on how much bandwidth the algorithm can find to utilize. When the network is small, the number of good paths (with large available bandwidth) is also small. Consequently, CoRBA may schedule some flows to use paths with less bandwidth, which limits the CCT. At this stage, an expansion of network generates more good paths and therefore CoRBA can allocate more bandwidth. However, along with the expansion, the number of good paths becomes more than enough and CoRBA is able to schedule most of the flows on these paths. At this stage, the quality of paths is barely improved. As a result, the increase of total allocated bandwidth becomes small. Such a trend then leads the trend of CCT, which is a steep decrease at the beginning but becomes relatively flat thereafter, as shown in Fig. 4(a).

Fig. 4(c) shows the average length of flow routes. MinCCT-M and CoRBA have the longest route length, followed by CoRBA-fast, MinCCT-S and MinCCT-SM. At last, Fig. 4(d) shows the algorithm running time. When the network contains 4500 nodes, the running time of CoRBA is about 2500 seconds, while that of CoRBA-fast, MinCCT-S, MinCCT-M, and MinCCT-SM is 10 seconds, 140 seconds, 600 seconds, and 160 seconds respectively.

Summary. In offline simulations, we examined the performance of CoRBA and CoRBA-fast when scheduling a single coflow. Benefiting from the ability of finding paths with less overlaps and allocating more bandwidth, CoRBA and CoRBA-fast outperforms MinCCT-SM by a factor of 1.3-1.5 and outperforms MinCCT-S and MinCCT-M by a factor of 3-5. Moreover, CoRBA-fast has similar performance with CoRBA but less efficiency on utilizing available bandwidth.

(a) Coflow Completion Time
(b) Ratio CCT to CoRBA
(c) Running time
Fig. 6: Performance of CoRBA and CoRBA-fast in online simulations with increasing arriving rate of noise flows.

7.2 Performance Evaluation through Online Simulations

In this section, we examine long-term performance of the proposed algorithms through online simulations in which we simulate the execution of a flow scheduler using the state-of-the-art scheduling policy.

7.2.1 Simulation Setup

A simulation begins with a 10-array (625 nodes) modified FatTree network without any flows. In the following, coflows start to arrive at a rate following a Poisson distribute with

, and stop arriving at 1800 seconds. The whole simulation finishes, once all coflows are scheduled. Each data point in the results is an average of 10 simulations performed on an Intel 2.5GHz processor.
Scheduling Policy. We use the scheduling policy proposed in RAPIER [13] to schedule arrived coflows. Once a coflow arrives, the scheduling algorithm is called to calculate the CCT of all existing coflows based on their current residual volume. In the following, all coflows are scheduled in the ascending order of their CCT. Once a coflow is scheduled, the CCT of all other coflows are re-calculated based on updated bandwidth availability. The selection procedure is repeated until all existing coflows are addressed. An existing coflow may be preempted by a newly arrived coflow. If scheduling a coflow is failed, this coflow is added into a waiting queue. Meanwhile, if a coflow has been waiting for more than 100 seconds, it gets the privilege to be scheduled regardless of its calculated CCT. At last, once a coflow finishes,the released bandwidth is distributed to existing coflows, aiming to facilitate the flow completion.
Noise flows. The noise flows arrives following a Poisson distribution and the duration of these noise flows follows an uniform distribution in . When a noise flow starts, we randomly selects a shortest path as its route and allocates a random amount of bandwidth.

7.2.2 Evaluation Metrics

We focus on two metrics: (i) the average CCT of coflows and (ii) the algorithm running time.

7.2.3 Evaluation Results of CoRBA

Performance with Increasing Size of the Coflow. We first study how our algorithms perform when the number of flows in the coflow increases. We set the arriving rate of noise flows to follow a Poisson distribution with .

Fig. 5(a) shows the average CCT generated by each algorithm and Fig. 5(b) shows the ratio of the average CCT of other algorithms to that of CoRBA. We observe that the average CCT generated by CoRBA is about 10%, 40%, 60%-80%, and 100%-650% smaller than that generated by CoRBA-fast, MinCCT-SM, MinCCT-S, and MinCCT-M respectively. Fig. 5(c) shows the running time of each algorithm. While CoRBA has the largest running time, CoRBA-fast has the smallest running time which is 4-8 times faster than the three MinCCT algorithms and about 40 times faster than CoRBA .
Performance with Increasing Size of the Coflow. We further study how our algorithms perform when the arriving rate of noise flows increases. We set each coflow containing 30 flows.

Fig. 6(a) shows the average CCT generated by each algorithm and Fig. 6(b) shows the ratio of the average CCT of other algorithms to that of CoRBA. We can see that the average CCT generated by CoRBA is about 10%, 40%, 60%–120%, and 400%-500% smaller than that generated by CoRBA-fast, MinCCT-SM, MinCCT-S, and MinCCT-M respectively. Fig. 6(c) shows the running time of each algorithm. Again, CoRBA has the largest running time and CoRBA-fast has the smallest running time which is 4-8 times faster than the three MinCCT algorithms and around 30 times faster than CoRBA.

Summary. In online simulations, we demonstrate that CoRBA and CoRBA-fast maintain their advantage over the MinCCT algorithms: They are at least 40% (and up to 500%) better than the comparison algorithms. We attribute this to their better performance when scheduling each individual coflow. Meanwhile, CoRBA-fast is much faster than CoRBA but CoRBA performs about 10% better than CoRBA-fast.

8 Discussion

In simulations, it has been shown that CoRBA-fast can be tens of times faster than other algorithms, which makes it a good choice in practice. In this section, we further discuss about its applicability by comparing its running time with the transferring time of flows in some typical MapReduce jobs.

The shuffle stage in MapReduce jobs generates coflows. In shuffle-heavy jobs, like tera-sort and ranked-inverted-index, the shuffle volume can be more than 200 GB [24, 25]. Considering the simulation case that scheduling a coflow with 100 flows in a network with 1617 nodes, the running time of CoRBA-fast is about 6 seconds and the allocated bandwidth is around 60 Gb/s. In this case, if the coflow contains 200 GB data, its transferring time is about 27 seconds, which is nearly 5 times of the running time of CoRBA-fast. Whereas, the MinCCT algorithms allocate 10-20 Gb/s bandwidth and the coflow transferring time is 80-160 seconds. We can see that CoRBA-fast significantly reduces the coflow transferring time and its running time is generally small compared to the reduced transferring time. When examining other simulation cases, we get similar results. Therefore, we believe that the use of CoRBA-fast is very applicable in practice.

9 Conclusion

In this paper, we focused on how to schedule—route and allocate bandwidth to—a coflow with the goal of minimizing the its completion time. We first studied the problem of optimal bandwidth allocation with pre-determined routes and provided an analytical solution. Subsequently, we formulated the coflow scheduling problem as a NLMIP problem, presented a relaxation together with an equivalent convex optimization problem, and proposed an algorithm called CoRBA and a simplified version called CoRBA-fast to solve the problem.

We evaluated the CoRBA and CoRBA-fast algorithms by comparing them with the state-of-the-art algorithms in both offline and online simulations. In offline simulations, we demonstrated that when scheduling a single coflow, CoRBA and CoRBA-fast 30%-400% smaller CCT than their comparison algorithms. In online simulations, we simulated the execution of a flow scheduler and used the state-of-the-art scheduling policy. The results show that CoRBA and CoRBA-fast are at least 40% and up to 500% better than the comparison algorithms. Meanwhile, the results also show that CoRBA-fast can be tens of times faster than all other algorithms with the cost of about 10% performance degradation compared to CoRBA, which makes CoRBA-fast very applicable in practice.

References

  • [1] J. Lin and D. Ryaboy, “Scaling big data mining infrastructure: the twitter experience,” ACM SIGKDD Explorations Newsletter, vol. 14, no. 2, pp. 6–19, 2013.
  • [2] A. Singh, J. Ong, A. Agarwal, G. Anderson, A. Armistead, R. Bannon, S. Boving, G. Desai, B. Felderman, P. Germano et al., “Jupiter rising: A decade of clos topologies and centralized control in google’s datacenter network,” in Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication.   ACM, 2015, pp. 183–197.
  • [3] J. Dean and S. Ghemawat, “Mapreduce: simplified data processing on large clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008.
  • [4] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark: Cluster computing with working sets.” HotCloud, vol. 10, pp. 10–10, 2010.
  • [5] M. Chowdhury and I. Stoica, “Coflow: A networking abstraction for cluster applications,” in Proceedings of the 11th ACM Workshop on Hot Topics in Networks.   ACM, 2012, pp. 31–36.
  • [6] M. Chowdhury, M. Zaharia, J. Ma, M. I. Jordan, and I. Stoica, “Managing data transfers in computer clusters with orchestra,” ACM SIGCOMM Computer Communication Review, vol. 41, no. 4, pp. 98–109, 2011.
  • [7] L. Popa, G. Kumar, M. Chowdhury, A. Krishnamurthy, S. Ratnasamy, and I. Stoica, “Faircloud: sharing the network in cloud computing,” in Proceedings of the ACM SIGCOMM 2012 conference on Applications, technologies, architectures, and protocols for computer communication.   ACM, 2012, pp. 187–198.
  • [8] L. Popa, P. Yalagandula, S. Banerjee, J. C. Mogul, Y. Turner, and J. R. Santos, “Elasticswitch: practical work-conserving bandwidth guarantees for cloud computing,” in Proceedings of the ACM SIGCOMM 2013 conference on SIGCOMM.   ACM, 2013, pp. 351–362.
  • [9] H. Ballani, K. Jang, T. Karagiannis, C. Kim, D. Gunawardena, and G. O’Shea, “Chatty tenants and the cloud network sharing problem.” in NSDI, 2013, pp. 171–184.
  • [10] F. R. Dogar, T. Karagiannis, H. Ballani, and A. Rowstron, “Decentralized task-aware scheduling for data center networks,” in ACM SIGCOMM Computer Communication Review, vol. 44, no. 4.   ACM, 2014, pp. 431–442.
  • [11] M. Chowdhury, Y. Zhong, and I. Stoica, “Efficient coflow scheduling with varys,” in Proceedings of the 2014 ACM conference on SIGCOMM.   ACM, 2014, pp. 443–454.
  • [12] M. Chowdhury and I. Stoica, “Efficient coflow scheduling without prior knowledge,” in Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication.   ACM, 2015, pp. 393–406.
  • [13] Y. Zhao, K. Chen, W. Bai, C. Tian, Y. Geng, Y. Zhang, D. Li, and S. Wang, “Rapier: Integrating routing and scheduling for coflow-aware data center networks,” in Proc. IEEE INFOCOM, 2015.
  • [14] R. Rojas-Cessa, Y. Kaymak, and Z. Dong, “Schemes for fast transmission of flows in data center networks,” Communications Surveys & Tutorials, IEEE, vol. 17, no. 3, pp. 1391–1422, 2015.
  • [15] C. Wilson, H. Ballani, T. Karagiannis, and A. Rowtron, “Better never than late: Meeting deadlines in datacenter networks,” in ACM SIGCOMM Computer Communication Review, vol. 41, no. 4.   ACM, 2011, pp. 50–61.
  • [16] C.-Y. Hong, M. Caesar, and P. Godfrey, “Finishing flows quickly with preemptive scheduling,” ACM SIGCOMM Computer Communication Review, vol. 42, no. 4, pp. 127–138, 2012.
  • [17] M. Alizadeh, S. Yang, M. Sharif, S. Katti, N. McKeown, B. Prabhakar, and S. Shenker, “pfabric: Minimal near-optimal datacenter transport,” ACM SIGCOMM Computer Communication Review, vol. 43, no. 4, pp. 435–446, 2013.
  • [18] H. Xu and B. Li, “Repflow: Minimizing flow completion times with replicated flows in data centers,” INFOCOM, 2014 Proceedings IEEE, pp. 1581–1589, 2014.
  • [19] W. Bai, L. Chen, K. Chen, D. Han, C. Tian, and H. Wang, “Information-agnostic flow scheduling for commodity data centers,” in 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15), 2015, pp. 455–468.
  • [20] Y. Peng, K. Chen, G. Wang, W. Bai, Z. Ma, and L. Gu, “Hadoopwatch: A first step towards comprehensive traffic forecasting in cloud computing,” in INFOCOM, 2014 Proceedings IEEE.   IEEE, 2014, pp. 19–27.
  • [21] S. Boyd and L. Vandenberghe, Convex optimization.   Cambridge university press, 2004.
  • [22] A. P. Punnen, “A linear time algorithm for the maximum capacity path problem,” European Journal of Operational Research, vol. 53, no. 3, pp. 402–404, 1991.
  • [23] M. Al-Fares, A. Loukissas, and A. Vahdat, “A scalable, commodity data center network architecture,” in ACM SIGCOMM Computer Communication Review, vol. 38, no. 4.   ACM, 2008, pp. 63–74.
  • [24] Y. Guo, J. Rao, and X. Zhou, “ishuffle: Improving hadoop performance with shuffle-on-write.” in ICAC.   Citeseer, 2013, pp. 107–117.
  • [25] M. Li, L. Zeng, S. Meng, J. Tan, L. Zhang, A. R. Butt, and N. Fuller, “Mronline: Mapreduce online performance tuning,” in Proceedings of the 23rd international symposium on High-performance parallel and distributed computing.   ACM, 2014, pp. 165–176.