With the advent of software-defined networking (SDN) and OpenFlow switch protocol, routing and scheduling in modern data center networks is increasingly performed at the level of flows. A flow is a particular set of application traffic between two endpoints that receive the same forwarding decisions. As a consequence of the shift towards centralized flow-based control, efficient algorithms for scheduling and routing of flows and their variants have gained prominent importance [chowdhury+kpyy:coflow, QiuSteinZhong2015, sdn, sdn2, sdnbook].
In order to model the datacenter network carrying the flows, it is common to represent the entire datacenter network as one non-blocking switch (see Figure 1) interconnecting all machines [alizadeh, ballani, kang, QiuSteinZhong2015]. This simple model is attractive because of advances in full-bisection bandwidth topologies [vl2, Niran]. In this model, every input (ingress) port is connected to every output (egress) port. Bandwidth limits are at the ports and the interconnections are assumed to have unlimited bandwidth. We model the datacenter network as a general bipartite graph (which includes the full-bisection as a special case) with capacities at each vertex (port).
In the context of scheduling and client-server applications, response time–also known as flow time or sojourn time–is a very natural and important objective. Indeed, response time is directly related to quality of service experienced by clients [bansal_thesis, whyresponsetime]. In the job scheduling literature, metrics related to response times have been extensively studied in diverse frameworks, including approximation algorithms [Bansal2015flowtime, batra+gk:flow, chekuri+kz:schedule, feige+kl:flow, kellerer+tw:schedule], competitive analysis [bansal+chan, kalyanasundaram+p:schedule_nonclairvoyant, Mastrolilli2003], and queuing-theoretic analysis [biersack+su:schedule, grosof+sh:srpt]. For flow scheduling, however, response time optimization is not as well-understood as completion time optimization; to the best of our knowledge, there is no prior work on approximation algorithms for flow scheduling to optimize response time metrics. In this paper, we study the problem of scheduling flows on a switch network to minimize average response time and maximum response time.
We present the first approximation algorithms for flow scheduling on a bipartite switch network with the objective of minimizing response time metrics.
We present a -approximation algorithm, running in polynomial time, for scheduling unit flows under the average response time metric, for any given positive integer ; that is, our algorithm achieves an average response time of times the optimal assuming it is allowed port capacity that is times that of the original. Our results on average response time appear in Section 3.
We show that it is NP-hard to attain an approximation factor smaller than for the maximum response time metric. We next present a polynomial-time algorithm that achieves optimal maximum response time, assuming it is allowed port capacity that is at most more than that of the optimal, where is the maximum demand of any flow request. For the special case of unit demands, note that this is best possible, given the hardness result. Our results on maximum response time appear in Section 4.
Both of our algorithms are based on rounding a suitable linear programming relaxation of the associated problem. The algorithm for average response time uses the iterative rounding paradigm, along the lines of previous work in scheduling jobs on unrelated machines [Bansal2015flowtime]. A challenge we need to address is that a ”job” in flow scheduling uses two different capacitated ”resources” (ports) simultaneously. We are able to overcome this challenge in the presence of resource augmentation. An important open problem is to determine whether polylogarithmic- or better approximations for average response time are achievable without resource augmentation.
For maximum response time, our hardness reduction is through the classic Timetable problem studied in [timetable]. Our approximation algorithm is achieved by applying a rounding theorem of [Karp87globalwire], and in fact extends to the more general problem in which we need to meet distinct deadlines for individual flows, as opposed to a uniform maximum response time.
Both the algorithms above are offline approximations. In Section 5, we initiate a study of online algorithms for response time metrics.
We present preliminary theoretical results including a resource-augmented constant-factor competitive algorithm for maximum response time, which builds on our offline algorithm. We next present experimental evaluations of natural online heuristics for average and maximum response time metrics.
Our work leaves some intriguing open problems and several directions for future research, which are highlighted in Section 6.
1.2 Related Work
While there is considerable work on scheduling flows on non-blocking switch networks as well as more general topologies, no offline approximation algorithms or online competitive algorithms are known for response time metrics. There is extensive literature on scheduling matchings over high-speed crossbar switches; these studies largely adopt a queuing theoretic framework (e.g., see [giaccone+ps:switch, gong+tlyx:switch, shah+s:schedule]). In [ChowdhuryZhongStoica2014], Chowdhury et al. present effective heuristics for scheduling generalizations of flows, called co-flows, without release times on a non-blocking switch network. Approximation algorithms for average completion time of co-flows on a non-blocking switch are given in [Ahmadi2017, chowdhury+kpyy:coflow, KM-coflow-SPAA16, shafiee+g:coflow, DBLP:conf/spaa/QiuSZ15]. Scheduling over general network topologies is studied in [Jahanjou_spaa, rapier], including approximation algorithms for average completion time.
For response-time minimization, all previous approximation algorithms are in the setting of machine scheduling. In what follows, we use the three-field notation , also known as Graham’s classification [Graham1979287]. In this notation, in the first field specifies the machine configuration. Common cases are for a single machine, for identical parallel machines, for related parallel machines, and for unrelated machine. The second field, , specifies job properties. For instance, denotes presence of release times and denoted preemptive schedules. The third field, , specifies the objective function to be minimized.
Average response time
The single machine preemptive case with release times, , is solvable in polynomial time using the shortest remaining processing time (SRPT) rule [baker:schedule]. Without preemption, is solvable using the shortest processing time (SPT) rule; but, is hard to approximate within a factor of for all [kellerer+tw:schedule]. For two machines or more, is -hard [du+ly:schedule]. Leonardi and Raz show that SRPT is an -competitive algorithm for the problem where is the ratio between the largest and the smallest job processing times [leonardi+r:schedule]. From a technical standpoint, a related paper for our work is that of Garg and Kumar, who consider the problem of minimizing total response time on related machines () and present an offline -approximation algorithm and an online -competitive algorithm [Garg2006]. In a later paper, the same authors consider the problem of minimizing total response time on multiple identical machines where each job can be assigned to a specified subset of machines. They give an -approximation algorithm as well as an lower bound [Garg2007]. The same ideas were used to get an -approximation algorithm for the unrelated case () when there are different processing times [Garg2008]. In the same paper, the authors showed an hardness of approximation for . More recently, Bansal and Kulkarni design an elegant -approximation algorithm for , which provides a basis for our algorithm for average response time [Bansal2015flowtime].
Maximum response time
The problem of minimizing max response time has not been studied extensively. is polynomial-time solvable [Lawler1978]. The first-in first-out heuristic (FIFO) is known to be a -competitive algorithm for and [Mastrolilli2003, bender+cm:schedule]. On the other hand, Ambühl and Mastrolilli give a -competitive algorithm for and show that the FIFO achieves the best possible competitive ratio on two identical machines when preemption is not allowed [AMBUHL2005597]. There is also an -approximation algorithm for [Bansal2015flowtime].
2 Problem Definitions and Notation
We consider two scheduling problems in which flows arrive in fixed intervals on a non-blocking switch. In this model, we are given a switch where is a set of input ports and output ports where each port has a corresponding capacity . is a set of flows with one input port and one output port . Each flow has a corresponding demand and release time . We assume throughout that for any , .
For an given instance , we define a family of functions . We say that schedules flow in round if (for ease of notation, we use ). A function is a schedule of if the following conditions are met: every flow , is entirely scheduled across all rounds (i.e. ), every flow is scheduled only in rounds after its release time (i.e. for all , ), and for all ports the total size of all flows scheduled on port in a given round is no more than ’s capacity (i.e. for all , ). For a given flow and schedule , the response time is the difference in its completion time and its release time, i.e. .
The first problem we study in this model is Flow Scheduling to Minimize Average Response Time (FS-ART) in which we seek to minimize . The second problem we study in this model is Flow Scheduling to Minimize Maximum Response Time (FS-MRT) in which we seek to minimize .
Throughout the paper we use to denote a flow (directed edge) from input port to output port . We use to denote the set of positive integers less than or equal to . For a given problem instance, if the number of input and output ports are identical, we refer to the instance as . The main notation is given in the table below.
|:||’s release time|
|:||’s response time|
|:||’s completion time|
3 Average Response Time
In this section, we study Flow Scheduling to Minimize Average Response Time (FS-ART). Here, we assume that all instances have identical numbers of input and output ports. Specifically, we assume each instance is an switch .
From a complexity viewpoint, FS-ART generalizes various classic scheduling problems. The special case of FS-ART with arbitrary demands, unit capacity, and is equivalent to preemptive single-machine scheduling with release times, which is strongly -hard when the objective function is weighted sum of completion times (). Note that, is polynomial-time solvable while the complexity of is still open.
For , FS-ART instances incur coupling issues, even for unit demands. Each flow requires resources at two ports simultaneously, which makes the problem harder in a different way. In [chromatic-sched], the authors consider the following closely related biprocessor scheduling problem: there are identical machines and unit-sized jobs which require simultaneous use of two pre-specified (dedicated) machines. The objective is to minimize total completion time of jobs. The hardness of this problem is related to the graph that arises from the pre-specified machine pairs (machines correspond to nodes and edges to jobs). The authors in [chromatic-sched] prove that the problem is strongly NP-hard if the graph is cubic. Furthermore, the problem is shown to remain NP-hard even if the graph is bipartite and subcubic (i.e. ), which implies that FS-ART is NP-hard even for unit demands and unit capacities and identical release times for all flows. While constant-factor approximations [chromatic-sched, kub-kraw] are known for the makespan and average completion time metrics, no results are known for response time metrics.
Section 3.1 presents a linear programming approach based on iterative rounding, building on prior work on unrelated machines. Section 3.2 uses this approach to establish the main approximation result of this section.
3.1 A linear-programming approach
In this section, we investigate linear programming approaches used in the context of machine scheduling and adapt them to our setting. On a conceptual level, our problem is harder than parallel/related/unrelated machine scheduling in the sense that we have to deal with simultaneous use of ports, but is easier in the sense that we do not have to worry about the assignment of flows/jobs to machines as each flow specifies its source and destination ports.
Our starting point is the following linear program similar to the one used by Garg and Kumar [Garg2006].
Informally, the variable gives the amount of flow that is scheduled in round . Constraint (2) ensures that each flow is completed across all rounds. Constraint (3) ensures that no port is overloaded in any round. We can rewrite the objective function as where
Suppose that the completion time of flow in schedule is . Then the response time of is . Notice that
That is, is maximized when as much of flow is scheduled in each round as possible to ensure that completes in round . But,
which completes the proof. ∎
We note that the optimal solution to (1) - (4) yields a non-integral schedule which optimizes average response time. Importantly, the solution already takes care of the resource coupling issue (between ports) for us. Unfortunately, it is not clear what is the gap between the LP’s objective function and the true total response time.
We now consider another linear programming formulation first used by Bansal and Kulkarni [Bansal2015flowtime] for the problem of job scheduling on unrelated machines. The authors use an iterative rounding scheme to get a tentative schedule with low additive overload for any interval of time. We do the same. This linear program and the subsequent ones, used in iterative rounding, are all interval-based. In the initial LP, the interval size is 4. However, as we will see, in the subsequent relaxations, the interval size can grow.
As before, the real variable is the amount of flow scheduled in round . Constraint (7) ensures that the total sum of flows scheduled on a given port in any four consecutive rounds is no more than four times the capacity of . Clearly, this new LP is a relaxation of the previous one; consequently, the value of an optimal solution to this LP is a lower bound to the response time for any integral schedule.
Following [Bansal2015flowtime], we use an iterative rounding scheme to get the following result.
The exists a solution satisfying the following properties
For each flow , there is exactly one round for which .
The cost of is at most that of an optimal solution to the LP.
For any port and any time interval ,
Before going further, let’s consider a solution satisfying the three properties in the lemma. Such a solution can be regarded as a sequence of bipartite graphs such that for any given (time) interval , the degree of any vertex in the “combined” graph is at most times the length of the interval, plus a factor of . In order to get a valid schedule, this sequence must be converted to a sequence of matchings. We will examine this step in the next section.
To establish Lemma 3.3, we iteratively relax variable assignments with a sequence of linear programs which we denote by for , where is the initial linear program (5) - (8). We denote the set of flows that appear in by and an optimal solution to by . Let be the set of variables in with non-zero assignments. Let be the set of flows such that, for all , is integral. Let be the set of tight capacity constraints (11) in given . Let . See Figure 2 for an high level picture of the iterative procedure.
In each iteration , we construct as follows.
Find an optimal solution to .
Eliminate zero variables. In other words the variables in are only defined for variables in , the support of .
Fix integral assignments. For all , permanently assign to those rounds such that (i.e. set ) and drop all variables in . We also update .
Define intervals for the current iteration as follows. Fix a port and consider the flows in . Sort all the variables in in increasing order of , breaking ties lexicographically. Next, iteratively group variables into groups as follows. To construct group , start from the earliest non-grouped variable and greedily group consecutive variables until their sum first exceeds . Each group forms an interval which we denote by . The size of the interval is
Note that . Importantly, the length of in time (i.e. as a time interval) can be much larger than its size. On the other hand, for , all intervals are of size 4 as evident in the initial LP.
The linear program relaxation for iteration is
It is not difficult to see that is a relaxation of . Consequently, the second requirement of Lemma 3.3 is satisfied. Also, by construction of , the sequence of iterations results in an integral assignment of all flows and so the first requirement of Lemma 3.3 is satisfied. It remains to bound the number of iterations and calculate the backlog.
Recall that is the set of flows such that variables appear in . Note that, for , these are the non-zero variables which correspond to non-integrally-assigned jobs after solving .
For all , .
Consider a linearly independent set of tight constraints in . Since a tight non-negativity constraint (12) results in a zero variable, the number of non-zero variables, , is at most the number of tight flow constraints (10) plus the number of tight capacity constraints (11). That is
since is the number of flow constraints.
Now, each flow which is not integrally assigned by (i.e. not in ) contributes at least two to . Thus,
Next, we show that which completes the proof. This is accomplished by a simple combinatorial argument. Let’s give tokens to every flow in . Now, each flow , gives a portion equal to of its tokens to the interval that contains . This token distribution is valid since
where we have used the fact that each flow appears in exactly two port constraints. At the same time, each tight capacity constraint for port receives at least tokens since interval sizes are by definition and by assumption. Now, as each job distributes exactly 2 tokens and each tight port constraint receives at least 4, we conclude that . ∎
Lemma 3.5 shows that the number of iterations needed before arriving at an integral solution is no more than . What remains is to bound amount of extra load that any interval has taken on. Recall that denotes the set of flows which are integrally assigned by the optimal solution to . Let be the set of flows which are integrally to port in the interval by the optimal solution of . Furthermore, we define
which is the total size of flows assigned to port in the interval , either integrally or fractionally, by . The following lemma states that the amount of extra load taken on any port in any interval is no more than a constant additive over the load in the previous iteration.
For any period , for any port , and for any round ,
Fix an interval and a port . In each iteration , the “extra” load in this interval can be introduced only if two intervals overlap with the boundaries of .
Consider a maximal set of contiguous intervals , , …, that contain . Note that is the smallest index such that contains some with . Similarly, is the largest number such that contains some with . Since each interval is of size smaller than ,
where the first inequality follows from the port capacity constraints (3), and the second equality follows from the definition of . Consequently, we have that
where the last step uses the fact that .
This completes the proof since, by definition, the claimed inequality (15) is equivalent to
We now establish a bound on the total “extra” load in any interval for the final assignment. Recall that is the final, integral assignment derived from the iterative procedure above.
For any interval and port ,
We fix the interval and port . By construction of , we need only to show that, for all
We prove inequality 17 by induction on .
We now have all the necessary ingredients to prove Lemma 3.3.
Proof of Lemma 3.3.
In the final solution , all flows are integrally assigned. Furthermore, the cost of the final solution is at most that of an optimal solution to the initial linear program (since each iteration, we are relaxing the previous linear program). Finally, by Lemma 3.7, for any time interval and port , the total volume of assigned flows is at most . ∎
3.2 Getting a valid schedule
What we obtain from Lemma 3.3 is, unfortunately, not a valid schedule but what could be called a pseudo-schedule; as noted in Remark 3.4, the total amount of flow passing through a port during a time interval could as much as more than , as allowed by the capacity of the port. In this section we show that we can convert the pseudo-schedule given by Lemma 3.3 into a valid schedule using resource augmentation, i.e., assuming the algorithm is allowed more port capacity than the optimal schedule. It is immediate from Lemma 3.3 that if we augment the capacity of every port by a factor of , then we obtain a valid resource-augmented schedule with optimal average response time. In the following, we show that we can achieve logarithmic-approximate average response time with a small constant blowup in port capacity, for the case of unit demand flows (and arbitrary port capacities).
For any positive integer , there exists a polynomial-time algorithm that, given a set of unit flows over a switch, computes a -approximation for average response time unit-size flows, while incurring a blowup in capacity by a factor of .
Given a set of flows over a switch, by Lemma 3.3, there exists a pseudo-schedule which assigns flows to time slots such that the total response time is at most the cost of an optimal solution to the initial linear program and for any given time interval , and for any port , the total volume of flows assigned to during the interval is at most .
We first prove the desired claim for unit capacities, and then later show how to extend to arbitrary capacities. The pseudo-schedule can be regarded as a sequence of bipartite graphs such that in any given interval , the degree of each vertex in the combined graph is at most for some . Next, we convert this sequence into a sequence of bipartite matchings . To this end, we divide the timeline into consecutive intervals , each of size . Now, starting from the beginning, we schedule flows in each interval before going to the next one. Consider an interval , the degree of each vertex in the combined graph is at most . Applying the Birkhoff-von Neumann Theorem [birkhoff], can be decomposed into at most matchings in polynomial time. By increasing the capacity (bandwidth) of each port to , we can execute matchings in the next available spots (with respect to release times) in at most time steps. Since each flow is delayed by at most steps, the total response time of this schedule is no more than
where the inequality follows from the fact that the number of flows is lower bound on the total response time.
We now show that the above algorithm and argument can be extended to general capacities, using the notion of -matchings111A -matching of a bipartite graph, for a given function from the graph’s vertex set to nonnegative integers, is a subgraph in which the degree of each vertex is at most (e.g., see [gerards:matching]). and a standard transformation between -matchings and matchings [gerards:matching]. In the general case, the pseudo-schedule can be regarded as a sequence of bipartite graphs such that in any given interval , the degree of port in the combined graph is at most for some . Similar to the unit capacity case, we convert this sequence into a sequence of bipartite -matchings, where the function corresponds the port capacities. To this end, we divide the timeline into consecutive intervals , each of size . Now, starting from the beginning, we schedule flows in each interval before going to the next one. For each interval , we construct a bipartite graph as follows. We replicate each port times, and process the edges of in sequence: for edge , we add an edge to between a copy of and a copy of , each of which is chosen in a round-robin manner among the copies of and , respectively. This ensures that the degree of any vertex in is at most . Now, applying the Birkhoff-von Neumann Theorem [birkhoff], can be decomposed into at most matchings in polynomial time. By increasing the capacity (bandwidth) of each port replica in each to , and hence increasing the capacity of each port by a factor of , we can execute matchings in the next available spots (with respect to release times) in at most time steps, with a increase in average response time. ∎
4 Maximum Response Time
In this section, we consider the problem of Flow Scheduling to Minimize Maximum Response Time (FS-MRT). More formally, for a given instance of FS-MRT, our goal is to find the minimum such that there exists a schedule of with maximum response time . Section 4.1 establishes that solving FS-MRT is NP-hard. Section 4.2 provides a tight approximation to FS-MRT via a linear programming relaxation and rounding of a more general problem.
4.1 Maximum Response Time Hardness
In this section, we prove that FS-MRT is NP-hard. This result motivates our approximation algorithm for FS-MRT in the following section.
There is no polynomial time algorithm that solves Flow Scheduling to Minimize Maximum Response Time to within a factor of of optimal, assuming .
We prove Theorem 2 via a reduction from the Restricted Time-table (RTT) problem, which is shown to be NP-hard in [timetable]. We redefine RTT here for completeness.
Definition 4.1 (Restricted Timetable (RTT) problem).
Given the following data:
a collection with and
a function such that
determine if there is a function such that
iff for all and
for all and
for all and
Proof of Theorem 2.
We reduce RTT to the feasibility version of FS-MRT in which we are given a switch and a response time , and our goal is to check whether or not there exists a schedule with maximum response time at most . Let be an arbitrary instance of the RTT problem consisting of ,, and . We reduce to an instance of FS-MRT and . In , there are input ports , , and output ports , . All ports have capacity . We construct the set according to the following steps (in order).
For all and , if then we include an flow from input port to output port .
For each input port , we take the minimum and release all flows adjacent to in round .
For all , we create three new input ports . We include the flows , , and and release these flows in round 4.
For all such that , we create a new output port and three new input ports . We include an flow and release it in round 2. We also include flows , , and and release them in round 3.
For all such that , we create a new output port and three new input ports . We include an flow and release it in round 3. We also include flows , , and and release them in round 4.
For the remainder of the argument, we refer to the set of ports added in step as for , with . For a given function we construct a schedule as follows. For all flows with , we have schedule in round if and only if . For each , there are three flows , and such that and , which schedules in rounds 4, 5, and 6, respectively. For all such that , there are four flows with ports . schedules in round 2 and schedules , and in rounds 3, 4, and 5, respectively. For all such that , there are four flows , and with ports . schedules in round 3 and schedules , and in rounds 4, 5, and 6, respectively.
Suppose satisfies conditions 1, 2, 3, and 4. We show that is a schedule of with maximum response time . By construction of , we have that all flows with a port in are scheduled within three rounds of their release, so we need only to show that all flows with are scheduled and that there is at most one scheduled flow adjacent to every port in every round. The first follows from condition 2 of . Suppose there is port that has two adjacent, scheduled flows in one round. By conditions 3 and 4 of , one of these flows must have one left endpoint and the other . However, all such flows are scheduled in rounds , violating condition 1.
Suppose is a schedule of