I Introduction
Distributed computing networks (such as MapReduce [6], Spark [25], and Dryad [11]) become increasing popular to support dataintensive jobs. The key feature to process a dataintensive job is to divide the job into a group of small tasks that can be processed in parallel by multiple computing machines. Because of the common feature among those distributed computing networks, the term coflow was recently proposed in [3] to represent such a group of tasks belonging to the same job. It turns out that the coflow abstraction not only expresses dataparallel computing networks, but also provides new opportunities to reduce the completion time of jobs.
Moreover, because of the realtime nature of latencyintensive jobs (e.g., online retail [23]), a coflow needs to be completed in a deadline. To maximize the number of realtime coflows that meet the deadline, a scheduling algorithm allocating computing machines to coflows is needed.
Coflow scheduling has been a hot topic since the coflow abstraction was proposed. On one hand, several works developed coflow scheduling systems, e.g. [4, 5, 17]. On the other hand, numerous works established coflow scheduling theory, e.g., [14, 19, 2, 22, 16, 20, 10]. See the recent survey paper [21]. Almost all prior research focused on deterministic networks; in contrast, little attention is given to stochastic networks. Note that a coflow can be randomly generated; moreover, a computing machine can be unreliable because of unpredictable events [1] like hardware failures. In this context, a scheduling algorithm for stochastic realtime coflows in unreliable computing machines is crucial.
Although scheduling for traditional packetbased stochastic networks has been extensively studied (e.g., [18, 9]), their solutions cannot apply to coflowbased stochastic networks. That is because all tasks in a coflow are dependent in the sense that a coflow is not completed until all its tasks are completed, but all packets or tasks in traditional packetbased stochastic networks are independently treated (as also stated in [13]). The most relevant works on scheduling stochastic coflows are [15, 13]. While [15] focused on homogeneous stochastic coflows, [13] extended to a heterogeneous case. The fundamental difference between those works and ours is that we consider stochastic realtime coflows and unreliable computing machines.
In this paper, we consider a master machine and
unreliable computing machines. The master machine is running multiple jobs, which stochastically generates realtime coflows with a hard deadline. Our main contribution lies in developing coflow scheduling algorithms with provable performance guarantees. Leveraging Lyapunov techniques, we propose a feasibilityoptimal scheduling algorithm for maximizing the (feasibility) region of feasible requirements for the average number of completed coflows. However, the feasibilityscheduling algorithm turns out to involve an NPhard combinatorial optimization problem. To tackle the computational issue, we propose an approximate scheduling algorithm that is computationally tractable; furthermore, prove that its feasibility region shrinks by at most
from largest one. More surprisingly, our simulation results show that the feasibility region of the approximate scheduling algorithm is close to the largest one.Ii System overview
Iia Network model
Consider a distributed computing network consisting of a master machine and computing machines . The master machine is running jobs . Fig. 1 illustrates an example network with and . Suppose that data transfer between the master machine and the computing machines occurs instantaneously with no error.
Divide time into frames and index them by . At the beginning of each frame , each job stochastically
generates a coflow, where a coflow is a collection of tasks that can be processed by the corresponding computing machines. Precisely, we use vector
to represent the coflow generated by job in frame , where each element indicates if the coflow has a task for computing machine : if , then the coflow has a task for computing machine ; otherwise, it does not. See Fig. 1 for example. Each task is also stochastically generated, i.e.,is a random variable for all
, , and . By we denote the number of 1’s in vector ; in particular, if , then job does not generate any coflow in frame. Suppose that the probability distribution of random variable
is independently and identically distributed (i.i.d.) over frame , for all and . Suppose that all tasks generated by job for computing machine have the same^{1}^{1}1For the case of timevarying tasks’ sizes, the methodology in this paper can apply if those sizes are i.i.d. over frames. We just need to revise the state defined in the proof of Theorem 5 by including those task sizes. sizes. Moreover, we consider realtime coflows and suppose that the deadline for each realtime coflow is one frame. The realtime system has been justified in the literature, e.g., see [9].Consider timevarying processing speed for each computing machine. Suppose that the processing speed of each computing machine is i.i.d. over frames. With the i.i.d. assumption along with those constant tasks’ sizes, we can suppose that a task generated by job can be completed by machine (i.e., when ) with a constant probability over frames. At the end of each frame, each computing machine reports if its task is completed in that frame. If any task of a realtime coflow cannot be completed in the arriving frame, the coflow is expired and removed from the job.
Unaware of the completion of a task at the beginning of each frame, we suppose that the master machine assigns at most one task to a computing machine for each frame. If two coflows and , for some and , needs the same computing machine in frame , i.e., for some , then we say the two coflows have interference. For example, coflows and in Fig. 1 have the interference.
As a result of the interference, the master machine has to decide a set of interferencefree coflows for computing. Let be the set of interferencefree coflows decided for computing in frame . A scheduling algorithm is a time sequence of the decisions for all frames.
IiB Problem formulation
Let random variable indicate if coflow is completed in frame under scheduling algorithm , where if all tasks of the coflow are completed by the corresponding computing machines in frame ; otherwise. The random variable depends on the random variables , the task completion probabilities , and a potential randomized scheduling algorithm .
We define the average number of completed coflows for job under scheduling algorithm by
(1) 
The goal of this paper is to design a feasibilityoptimal^{2}^{2}2The feasibilityoptimal scheduling defined in this paper is analogy to the throughputoptimal scheduling (e.g., [18]) or the timelythroughputoptimal scheduling (e.g., [9]). scheduling algorithm that will be defined soon.
We refer to vector as a feasible requirement under scheduling algorithm if for all . We define the feasibility region of a scheduling algorithm as follows.
Definition 1.
The feasibility region of scheduling algorithm is the region of all feasible requirements under the scheduling algorithm .
Next, we define the maximum feasibility region as follows.
Definition 2.
The maximum feasibility region is the region of all feasible requirements under a scheduling algorithm.
Unlike the feasibility region of scheduling algorithm , the requirements in the region can be achieved by various scheduling algorithms, i.e., .
We define an optimal scheduling algorithm as follows.
Definition 3.
A scheduling algorithm is called a feasibilityoptimal scheduling algorithm if its feasibility region is .
That is, for any requirement , a feasibilityoptimal scheduling algorithm can meet the requirement .
Iii Scheduling algorithm design
In this section, we develop a feasibilityoptimal scheduling algorithm for managing the stochastic realtime coflows in the unreliable master machines. To that end, we transform our coflow scheduling problem into a queue scheduling problem in a virtual queueing network in Section IIIA. With the transformation, we propose a feasibilityoptimal scheduling algorithm in Section IIIB. However, the proposed feasibilityoptimal scheduling algorithm involves a combinatorial optimization problem. We show that the combinatorial optimization problem is NPhard. Thus, we develop a tractable approximate scheduling algorithm in Section IIIC; meanwhile, we establish its approximation ratio.
Iiia Virtual queueing network
In this section, we propose a virtual queueing network for the original distributed computing network with a requirement . The virtual queueing network consists of queues , and servers , operating under the same frame system as that in Section IIA. For example, Fig. 2 is the virtual queueing network for the distributed computing network in Fig. 1. Let be the queue size at queue at the beginning of frame . Let be the vector of all queue sizes at the beginning of frame .
At the beginning of each frame , there are packets arriving at queue . Then, the packet arrival rate for is , and can represent the arrival rate vector for the virtual queueing network. In each frame , if , then queue connects to server . See Fig. 2 for example.
In each frame, a server can serve at most one queue; in particular, server can complete the service for queue with probability for each frame. At the end of each frame, if all servers connected by a queue can complete their services for that queue, one packet is removed from that queue.
If two queues and , for some and , connect to the same server in frame , i.e., for some , then we say the two queues have interference in frame . For example, queues and in Fig. 2 have the interference in frame 1. Because of the interference in the virtual queueing network, we redefine decision to be the set of queues that are served in frame . A scheduling algorithm is also defined by .
Let indicate if one packet is removed from queue in frame under scheduling algorithm , where if one packet is removed from queue and if no packet is removed from queue . Then, defined in Eq. (1) can represent the packet service rate for queue .
With the above interpretation of the virtual queueing network, the region consists of all arrival rate vectors such that each arrival rate is less than service rate for all under scheduling algorithm . The condition of for all implies that all queues can be stabilized by the scheduling algorithm , i.e., is finite. The region is therefore called the stability region [7] of scheduling algorithm . Moreover, the region consists of all arrival rate vectors such that all queues can be stabilized by a scheduling algorithm, and is called the capacity region [7] of the virtual queueing network. If the stability region of scheduling algorithm is identical to the capacity region, i.e., , then the scheduling algorithm is called a throughputoptimal scheduling algorithm [7] for the virtual queueing network. That is, for any arrival rate vector , a throughputoptimal scheduling algorithm can stabilize the virtual queueing network.
With the transformation, the feasibilityoptimal scheduling problem for the distributed computing network becomes the throughputoptimal scheduling problem for the virtual queueing network. Hence, we will focus on the throughputoptimal scheduling design for the virtual queueing network. We want to emphasize that, unlike traditional stochastic networks (e.g., [18, 9, 7]), each packet in our virtual queueing network can be removed only when all its connected servers complete their services in its arriving frame. Thus, those throughputoptimal scheduling algorithms for the traditional stochastic networks cannot solve our problem. Our paper extends to stochastic networks with multiple required servers; in particular, we develop a tractable approximate scheduling algorithm in Section IIIC.
IiiB Feasibilityoptimal scheduling algorithm
(2) 
In this section, we propose a throughputoptimal scheduling algorithm for the virtual queueing network in Alg. 1, i.e., for any given arrival rate vector , Alg. 1 can stabilize all queues. Thus, the corresponding scheduling algorithm for the distributed computing network can meet the requirement .
At the beginning of each frame , Alg. 1 (in Line 1) updates each queue with the new arriving packets; then, Alg. 1 (in Line 1) decides for that frame according to the present queue size vector . The decision is made for maximizing the weighted sum of the queue sizes in Eq. (2). The term in Eq. (2) calculates the expected service rate for , i.e., the probability that one packet can be removed from , where the indicator function indicates if queue connects to at least one server in frame , and if so, one packet can be removed from queue with probability of . The underlying idea of Alg. 1 is to remove as many packets in expectation as possible.
After performing the decision , Alg. 1 (in Line 1) updates each at the end of frame : if is scheduled, it connects to at least one server, and all its connected server completes their services, then one packet is removed from .
Example 4.
Take Fig. 2 for example. Suppose that and for all , and . According to Line 1, Alg. 1 calculates and . Thus, Alg. 1 decides to serve for frame 1 (i.e., coflow in Fig. 1 is decided for computing in frame 1). If servers , , and can complete their services for queue in frame 1 (i.e., coflow is completed), then one packet is removed from queue at the end of frame 1, i.e., queue has packet at the end of frame 1.
Theorem 5.
Alg. 1 is throughputoptimal for the virtual queueing network, i.e., the corresponding scheduling algorithm for the distributed computing network is feasibilityoptimal.
Proof.
Let vector represent the state of the virtual queueing network in frame . Note that the state changes over frames but its probability distribution is i.i.d., according to the assumption in Section IIA. With the i.i.d. property of the state, the proof can follow the standard argument of the Lyapunov theory in [18]. ∎
IiiC Tractable approximate scheduling algorithm
We show (in the next lemma) that the combinatorial optimization problem in Line 1 of Alg. 1 is NPhard. Therefore, Alg. 1 is computationally intractable.
Lemma 6.
The combinatorial optimization problem in Alg. 1 in frame is NPhard, for all .
(3) 
To study the NPhard problem, we define two notions of approximation ratios as follows. While Def. 7 studies the resulting value of Eq. (2), Def. 8 investigates the resulting stability region.
Definition 7.
Definition 8.
A scheduling algorithm is called a approximate scheduling algorithm to if, for any arrival rate vector , the arrival rate vector lies in the stability region of the scheduling algorithm .
In this paper, we propose an approximate scheduling algorithm in Alg. 2. The procedure of Alg. 2 is similar to that of Alg. 1; hence, we point out key differences in the following.
Unlike Alg. 1 solving the combinatorial optimization problem, Alg. 2 (in Line 2) simply sorts all queues according to the values computed by Eq. (3). If a queue connects to at least one server, then its value computed by Eq. (3) is the value computed by Eq. (2) divided by the square root of the number of its connected servers. In Line 2, we use to denote the sorted queues in descending order of the values from Eq. (3). Moreover, we use (in Line 2) to indicate if connects to server in frame .
The underlying idea of Alg. 2 is to include a queue, in order, to the decision set if that queue does not cause interference. Precisely, Alg. 2 uses a set to record (in Line 2) available servers that are not allocated yet, where set is initialized to be in Line 2. Then, at the th iteration of Line 2, Alg. 2 checks if queue satisfies the two conditions in Line 2: the first condition means that connects to at least one server and the second condition means that its connected servers are all available. If meets the conditions, then it is scheduled as in Line 2. In addition, if is scheduled, then set is updated as in Line 2 by removing the servers allocated to queue . After deciding , Alg. 2 performs the decision in Line 2 for frame , followed by updating the queue sizes in Line 2.
Example 9.
Proof.
See Appendix B. ∎
Remark 11.
We remark that the approximation ratio of is the best approximation ratio to Eq. (2). That is because the combinatorial optimization problem in Alg. 1 is computationally harder than the set packing problem (see Lemma 6) and the best approximation ratio to the set packing problem is the square root (see [8]).
Theorem 12.
Alg. 2 is a approximate scheduling algorithm to .
Proof.
See Appendix C. ∎
Iv Numerical results
In this section, we investigate Algs. 1 and 2 via computer simulations. First, we consider two jobs and two computing machines. Fig. 3 displays the feasibility regions of both scheduling algorithms for various task generation probabilities by job , when and for all , , and are fixed. Fig. 4 displays the feasibility regions of both scheduling algorithms for various task completion probabilities by computing machine , when and for all , , and are fixed. Each result marked in Figs. 3 or 4 is the requirement such that the average number of completed coflows in 10,000 frames for job is no less than and that for job is no less than . The both figures reflect that Alg. 2 is not only computationally efficient but also can fulfill almost all requirements .
Second, we consider more jobs and more computing machines with the same quantities, i.e., . Moreover, all task completion probabilities are fixed to be 0.9, i.e., for all and . Then, Fig. 5 displays the maximum requirements that can be achieved by Alg. 2, when all task generation probabilities are the same. From Fig. 5, the maximum achievable requirement by Alg. 2 appears to decrease superlinearly with the number of computing machines.
V Concluding remarks
In this paper, we provided a framework for studying realtime coflows in unreliable computing machines. In particular, we developed two algorithms for scheduling coflows in shared computing machines. While the proposed feasibilityoptimal scheduling algorithm can support the largest region of jobs’ requirements, it has the notorious NPhard issue. In contrast, the proposed approximate scheduling algorithm is not only simple, but also has a provable guarantee for the achievable requirement region. Moreover, we note that coding techniques have been exploited to alleviate stragglers in distributed computing networks, e.g., [12, 24]. Thus, including coding design into our framework is promising.
Appendix A Proof of Lemma 6
We show a reduction from the set packing problem [8], where given a collection of nonempty sets over a universal set for some positive integers and , the objective is to identify a subcollection of disjoint sets in that collection such that the number of sets in the subcollection is maximized.
For the given instance of the set packing problem, we construct queues and servers in the virtual queueing network. Consider a fixed frame . In frame , for each element we connect queue to server . With the transformation, the set packing problem is equivalent to identifying a set of interferencefree queues in frame such that number of queues in that set is maximized.
Moreover, consider no connection between the queues and the servers until frame , identical arrival rates for all , and identical task completion probabilities for all and . In this context, Eq. (2) in frame becomes
(4) 
because , (due to nonempty sets for all ), and . As a result of the constant in Eq. (4), the objective of the combinatorial optimization problem in Alg. 1 in frame becomes identifying a set of interferencefree queues such that the number of queues in that set is maximized.
Suppose there exists an algorithm such that the combinatorial optimization problem in Alg. 1 in frame can be solved in polynomial time. Then, the polynomialtime algorithm can identify a set for maximizing the value in Eq (4); in turn, solves the set packing problem. That contradicts to the NPhardness of the set packing problem.
Because the above argument is true for all frames , we conclude that the combinatorial optimization problem in Alg. 1 in frame is NPhard, for all .
Appendix B Proof of Lemma 10
Consider a fixed queue size vector in a fixed frame . Let be the value of Eq. (2) for queue . Without loss of generality, we can assume that for all and further assume that (by reordering the queue indices), i.e., Alg. 2 processes at the th iteration of Line 2. Let be the decision of Alg. 2 for queue size vector . Then, we can express the value of Eq. (2) computed by Alg. 2 as
(5) 
Let be the decision of Alg. 1 for queue size vector . If the conditions in Line 2 of Alg. 2 hold for the th iteration (i.e., ), then we let ^{3}^{3}3Here, we use to represent the set of common servers for queues and . be a set of queues. The set has the following properties:

For queue , we have
(6) since .

All queues in are interferencefree, i.e., they connect to different servers, since . Moreover, queue connects to at least one of the servers for (i.e., ). Thus, we have
(7) 
Since all queues in connect to different servers, and there are servers in the virtual queueing network, we have
(8)
Appendix C Proof of Theorem 12
The proof of Theorem 12 needs the following technical lemma, whose proof follows the line of [18] along with the i.i.d. property of state (as discussed in the proof of Theorem 5) and the constant task completion probabilities .
Lemma 13.
If there exists a scheduling algorithm (that can depend on history) to stabilize all queues in the virtual queueing network, then there exists a stationary scheduling algorithm (i.e., decision depends on the state in frame only) to stabilize all queues.
With Lemma 13, we can focus on the stability regions of stationary scheduling algorithms. To analyze the stability region of Alg. 2, we leverage the Lyapunov theory [18] as stated in the following lemma, where we consider the Lyapunov function .
Lemma 14.
Given arrival rate vector , if there exist constants and such that
for all frames under scheduling algorithm , then all queues are stable under the scheduling algorithm , i.e., .
Then, we are ready to prove Theorem 12. Suppose that arrival rate vector lies in . According to Lemma 13, under arrival rate vector , all queues can be stabilized by a stationary scheduling algorithm. We denote that stationary scheduling algorithm by . Consider arrival rate vector where for all . Next, applying Lemma 14 to Alg. 2, we conclude that Alg. 2 stabilizes all queues under arrival rate vector because
where (a) follows [18] with some constant ; (b) is because and the approximation ratio of Alg. 2 to Eq. (2) is (as stated in Lemma 10); (c) is because Alg. 1 (in Line 1) maximizes the value of among all possible scheduling algorithms ; (d) is because decision under stationary scheduling algorithm depends on the state only (regardless of the queue sizes) and also the state is i.i.d. over frames, yielding for all and ; (e) is because for all , i.e., there exists an such that for all .
References
 [1] (2013) Effective Straggler Mitigation: Attack of the Clones. Proc. of NSDI, pp. 185–198. Cited by: §I.
 [2] (2019) Near Optimal Coflow Scheduling in Networks. Proc. of ACM SPAA, pp. 123–134. Cited by: §I.
 [3] (2012) Coflow: A Networking Abstraction for Cluster Applications.. Proc. of ACM HotNets, pp. 31–36. Cited by: §I.
 [4] (2015) Efficient Coflow Scheduling without Prior Knowledge. Proc. of ACM SIGCOMM 45 (4), pp. 393–406. Cited by: §I.
 [5] (2014) Efficient Coflow Scheduling with Varys. Proc. of ACM SIGCOMM 44 (4), pp. 443–454. Cited by: §I.
 [6] (2008) MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM 51 (1), pp. 107–113. Cited by: §I.
 [7] (2006) Resource Allocation and CrossLayer Control in Wireless Networks. Vol. 1, Now Publishers, Inc.. Cited by: §IIIA, §IIIA.
 [8] (1998) Independent Sets with Domination Constraints. Proc. of ICALP, pp. 176–187. Cited by: Appendix A, §IIIC, Remark 11.
 [9] (2013) Packets with Deadlines: A Framework for RealTime Wireless Networks. Vol. 6, Morgan & Claypool Publishers. Cited by: §I, §IIA, §IIIA, footnote 2.
 [10] (2019) Matroid Coflow Scheduling. Proc. of ICALP, pp. 1–14. Cited by: §I.
 [11] (2007) Dryad: Distributed DataParallel Programs from Sequential Building Blocks. ACM SIGOPS operating systems review 41 (3), pp. 59–72. Cited by: §I.

[12]
(2017)
Speeding Up Distributed Machine Learning using Codes
. ieee_j_it 64 (3), pp. 1514–1529. Cited by: §V.  [13] (2018) Efficient Scheduling for Synchronized Demands in Stochastic Networks. Proc. of IEEE WiOpt, pp. 1–8. Cited by: §I.
 [14] (2016) Efficient Online Coflow Routing and Scheduling. Proc. of ACM MobiHoc, pp. 161–170. Cited by: §I.
 [15] (2017) Coflow Scheduling in InputQueued Switches: Optimal Delay Scaling and Algorithms. Proc. of IEEE INFOCOM, pp. 1–9. Cited by: §I.
 [16] (2016) Decentralized DeadlineAware Coflow Scheduling for Datacenter Networks. Proc. of IEEE ICC, pp. 1–6. Cited by: §I.
 [17] (2016) Chronos: Meeting Coflow Deadlines in Data Center Networks. Proc. of IEEE ICC, pp. 1–6. Cited by: §I.
 [18] (2010) Stochastic Network Optimization with Application to Communication and Queueing Systems. Vol. 3, Morgan & Claypool Publishers. Cited by: Appendix C, Appendix C, Appendix C, §I, §IIIA, §IIIB, §IIIB, footnote 2.
 [19] (2018) An Improved Bound for Minimizing the Total Weighted Completion Time of Coflows in Datacenters. ieee_j_net 26 (4), pp. 1674–1687. Cited by: §I.
 [20] (2018) Coflow Deadline Scheduling via NetworkAware Optimization. Proc. of Allerton, pp. 829–833. Cited by: §I.
 [21] (2018) A Survey of Coflow Scheduling Schemes for Data Center Networks. ieee_m_com 56 (6), pp. 179–185. Cited by: §I.
 [22] (2019) Efficient Scheduling of Weighted Coflows in Data Centers. ieee_j_pds 30 (9), pp. 2003–2017. Cited by: §I.
 [23] (2011) Better Never than Late: Meeting Deadlines in Datacenter Networks. Proc. of ACM SIGCOMM 41 (4), pp. 50–61. Cited by: §I.
 [24] (2019) TimelyThroughput Optimal Coded Computing over Cloud Networks. Proc of ACM MobiHoc, pp. 301–310. Cited by: §V.
 [25] (2012) Resilient Distributed Datasets: A FaultTolerant Abstraction for InMemory Cluster Computing. Proc. of NSDI, pp. 2–2. Cited by: §I.