A Case for Sampling Based Learning Techniques in Coflow Scheduling

by   Akshay Jajoo, et al.

Coflow scheduling improves data-intensive application performance by improving their networking performance. State-of-the-art online coflow schedulers in essence approximate the classic Shortest-Job-First (SJF) scheduling by learning the coflow size online. In particular, they use multiple priority queues to simultaneously accomplish two goals: to sieve long coflows from short coflows, and to schedule short coflows with high priorities. Such a mechanism pays high overhead in learning the coflow size: moving a large coflow across the queues delays small and other large coflows, and moving similar-sized coflows across the queues results in inadvertent round-robin scheduling. We propose Philae, a new online coflow scheduler that exploits the spatial dimension of coflows, i.e., a coflow has many flows, to drastically reduce the overhead of coflow size learning. Philae pre-schedules sampled flows of each coflow and uses their sizes to estimate the average flow size of the coflow. It then resorts to Shortest Coflow First, where the notion of shortest is determined using the learned coflow sizes and coflow contention. We show that the sampling-based learning is robust to flow size skew and has the added benefit of much improved scalability from reduced coordinator-local agent interactions. Our evaluation using an Azure testbed, a publicly available production cluster trace from Facebook shows that compared to the prior art Aalo, Philae reduces the coflow completion time (CCT) in average (P90) cases by 1.50x (8.00x) on a 150-node testbed and 2.72x (9.78x) on a 900-node testbed. Evaluation using additional traces further demonstrates Philae's robustness to flow size skew.



There are no comments yet.


page 1

page 2

page 3

page 4


Saath: Speeding up CoFlows by Exploiting the Spatial Dimension

Coflow scheduling improves data-intensive application performance by imp...

The Case for Task Sampling based Learning for Cluster Job Scheduling

The ability to accurately estimate job runtime properties allows a sched...

Scheduling With Inexact Job Sizes: The Merits of Shortest Processing Time First

It is well known that size-based scheduling policies, which take into ac...

QCluster: Clustering Packets for FlowScheduling

Flow scheduling is crucial in data centers, as it directly influences us...

An Adaptable and Agnostic Flow Scheduling Approach for Data Center Networks

Cloud applications have reshaped the model of services and infrastructur...

Size-based scheduling vs fairness for datacenter flows: a queuing perspective

Contrary to the conclusions of a recent body of work where approximate s...

Online Starvation Mitigation to Balance Average Flow Time and Fairness

In job scheduling, it is well known that Shortest Remaining Processing T...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Background and Problem Statement

We start with a brief review of the coflow abstraction and the need for non-clairvoyant coflow scheduling and state the network model. We then give an overview of existing online coflow schedulers and formally state the problem.

Coflow abstraction

In data-parallel applications such as Hadoop [1] and Spark [2], the job completion time heavily depends on the completion time of the communication stage [13, 21]. The coflow abstraction [20] was proposed to speed up the communication stage to improve application performance. A coflow is defined as a set of flows between several nodes that accomplish a common task. For example, in map-reduce jobs, the set of all flows from all map to all reduce tasks in a single job forms a typical coflow. The coflow completion time (CCT) is defined as the time duration between when the first flow arrives and the last flow completes. In such applications, improving CCT is more important than improving individual flows’ completion time (FCT) for improving the application performance [19, 22, 25, 32, 33].

Non-clairvoyant coflows

Data-parallel directed acyclic graphs (DAGs) typically have multiple stages which are represented as multiple coflows with dependencies between them. Recent systems (e.g., [23, 3, 31, 41]) employ optimizations that pipeline the consecutive computation stages which removes the barrier at the end of each coflow, making knowing flow sizes of each coflow beforehand difficult. A recent study [26] further shows various other reasons why it is not very plausible to learn flow sizes from applications, for example, learning flow sizes from applications requires changing either the network stack or the applications. Thus in this paper, we focus on non-clairvoyant coflow scheduling which do not assume knowledge about coflow characteristics such as flow sizes upon coflow arrival.

Non-blocking network fabric

We assume the same non-blocking network fabric model in recent network designs for coflows [22, 19, 32, 33, 8], where the datacenter network fabric is abstracted as a single non-blocking switch that interconnects all the servers, and each server (computing node) is abstracted as a network port that sends and receives flows. In such a model, the ports, i.e., server uplinks and downlinks, are the only source of contention as the network core is assumed to be able to sustain all traffic injected into the network. We note that the abstraction is to simplify our description and analysis, and is not required or enforced in our evaluation.

1.1 Prior-art on non-clairvoyant coflow scheduling

A classic approach to reduce average CCT is Shortest Coflow First (SCF) [19] (derived from classic SJF), where the coflow size is loosely defined as the total bytes of the coflow, i.e., sum of length of all its flows. However, using SCF online is not practical as it requires prior knowledge about the coflow sizes. This is further complicated as coflows arrive and exit dynamically and by other cluster dynamics such as failures and stragglers.

Aalo [19] was proposed to schedule coflows online without any prior knowledge. The key idea in Aalo is to approximate SCF by learning Coflow length using discrete priority queues. In particular, it starts a newly arrived coflow in the highest priority queue and gradually moves it to the lower priority queues when the total data sent by the coflow exceeds the per-queue thresholds.

The above idea of learning the order of jobs in a priority queues was originally applied to scheduling jobs on a single server. To apply it to scheduling coflows with many constituent flows over many network ports, i.e., in a distributed setting, Aalo uses a global coordinator to assign coflows to logical priority queues, and uses the total bytes sent by all flows of a coflow as its logical “length” in moving coflows across the queues. The logical priority queues are mapped to local priority queues at each port, and the individual local ports act independently in scheduling flows in its local priority queues, e.g., by enumerating flows from the highest to lowest priority queues and using FIFO to order the flows within each queue.

Generally speaking, using multiple priority queues in Aalo in this way has three effects: (1) Coflow segregation: It segregates long coflows (who will move to low priority queues) from short coflows who will finish while in high priority queues; (2) Finishing short coflows sooner: Since high priority queues receive more bandwidth allocation, short coflows will finish sooner (than longer ones); (3) Starvation avoidance: Using the FIFO policy for intra-queue scheduling provides starvation avoidance, since at every scheduling slot, each queue at each port receives a fixed bandwidth allocation and FIFO ensures that every Coflow (its flow) in each queue is never moved back.

Similar to Aalo [19], Graviton [32] also uses a logical priority queue structure. Unlike Aalo, Graviton uses sophisticated policies to sort coflows within a queue based on their width (total number of ports that a coflow is present on). Saath [33] is another priority queue based online coflow scheduler that improves over Aalo with three high-level design priniciples: (1) it schedules flows of a coflow in an all-or-none fashion to prevent flows of a coflow from going out-of-sync; (2) it incorporates contention, i.e., with how many other coflows a coflow is sharing ports with, into the metric for sorting coflows within a queue; (3) instead of using the total coflow size, it uses the length of the longest flow of a coflow to determine transition across priority queues, which helps in deciding the correct priority queue of a coflow faster.

1.2 Problem statement

Our goal is to develop an efficient non-clairvoyant coflow scheduler that optimizes the communication performance, in particular the average CCT, of data-intensive applications without prior knowledge, while guaranteeing starvation freedom and work conservation and being resilient to the network dynamics. The problem of non-clairvoyant coflow scheduling is NP-hard because coflow scheduling even assuming all coflows arrive at time 0 and their size are known in advance is already NP-hard [22]. Thus practical non-clairvoyant coflow schedulers are approximation algorithms. Our approach is to dynamically prioritize coflows by efficiently learning their flow sizes online.

2 Key Idea

Our new non-clairvoyant coflow scheduler design, Philae, is based on a key observation about coflows that a coflow has a spatial dimension, i.e., it typically consists of many flows. We thus propose to explicitly learn coflow sizes online by using sampling, a highly effective technique used in large-scale surveys [39]. In particular, Philae preschedules sampled flows, called pilot flows, of each coflow and uses their measured sizes to estimate the coflow size. It then resorts to SJF or variations using the estimated coflow sizes.

Developing a complete non-clairvoyant coflow scheduler based on the simple sampling idea raises three questions:

(1) Why is sampling more efficient than the priority-queue-based coflow size learning? Would scheduling the remaining flows after sampled pilot flows are completed adversely affect the coflow completion time?

(2) Will sampling be effective in the presence of skew of flow sizes?

(3) How to design the complete scheduler architecture? We answer the first two questions below, and present the complete architecture design in §LABEL:sec:design.

2.1 Why is sampling-based learning more efficient than priority-queue-based learning?

Scheduling pilot flows first before the rest of the flows can potentially incur two sources of overhead. First, scheduling pilot flows of a newly arriving coflow consumes port bandwidth which can delay other coflows (with already estimated sizes). However, compared to the multi-queue based approach, the overhead is much smaller for two reasons: (1) Philae schedules only a small subset of the flows (e.g., fewer than 1% for coflows with many flows). (2) Since the CCT of a coflow depends on the completion of its last flow, some of its earlier finishing flows could be delayed without affecting the CCT. Philae

exploits this observation and schedules pilot flows on the least-busy ports to increase the odds that it only affects earlier finishing flows of other coflows.

Second, scheduling pilot flows first may elongate the CCT of the newly arriving coflow itself whose other flows cannot start until the pilot flows finish. This is again typically insignificant for two reasons: (1) A coflow (e.g., from a MapReduce job) typically consists of flows from all sending ports to all receiving ports. Conceptually, pre-scheduling one out of multiple flows from each sender may not delay the coflow progress at that port, because all flows at that port have to be sent anyway. (2) Coflow scheduling is of high relevance in a busy cluster (when there is a backlog of coflows in the network), in which case the CCT of coflow is expected to be much higher than if it were the only coflow in the network, and hence the piloting overhead is further dwarfed by a coflow’s actual CCT.

2.2 Why is sampling effective in the presence of skew?

The flow sizes within a coflow may vary (skew). In this paper we measure skew as . Other papers like Varys[22] have used metrics like coefficient of variation to measure the skew. We used the ratio because it allows us to analyze the learning error without assuming the specific distribution of flow-sizes.

Intuitively, if the skew across flow sizes is small, sampling even a small number of pilot flows will be sufficient to yield an accurate estimate. Interestingly, even if the skew across flow sizes is large, our experiment indicates that sampling is still highly effective. In the following, we give both the intuition and theoretical underpinning for why sampling is effective.

Consider, for example, two coflows and the simple setting where both coflows share the same set of ports. In order to improve the average CCT, we wish to schedule the shorter coflow ahead of the longer coflow. If the total sizes of the two coflows are very different, then even a moderate amount of estimation error of the coflow sizes will not alter their ordering. On the other hand, if the total sizes of the two coflows are close to each other, then indeed the estimation errors will likely alter their ordering. However, in this case since their sizes are not very different anyway, switching the order of these two coflows will not significantly affect the average CCT.

Analytic results.

To illustrate the above effect, we show that the gap between the CCT based on sampling and assuming perfect knowledge is small, even under general flow size distributions. Specifically, coflows and have and flows, respectively. Here, we assume that and are fixed constants. Thus, by taking to be larger, we will be able to consider wider coflows.

Assume that each flow of (correspondingly, ) has a size that is distributed within a bounded interval [] ([]) with mean (), i.i.d. across flows. Let be the total completion time when the exact flow sizes are known in advance. Let be the average CCT by sampling and flows from and , respectively. Without loss of generality, we assume that .

Then, using Hoeffding’s Inequality, we can show that (see §LABEL:sec:appendix for detailed proof)


(Note that here also we have used the fact that, since both coflows share the same set of ports and is large, the CCT is asymptotically proportional to the coflow size.)

Equation (1) can be interpreted as follows. First, due to the first exponential term, the relative gap between and decreases as and decrease. In other words, as the skew of each coflow decreases, sampling becomes more effective. Second, when and are fixed, if is large (i.e., the two coflow sizes are very different), the value of the exponential function will be small. On the other hand, if is close to zero (i.e., the two coflow sizes are close to each other), the numerator on the second term on the right hand side will be small. In both cases, the relative gap between and will also be small, which is consistent with the intuition explained earlier. The largest gap occurs when is on the same order as . Finally, although these analytical results assume that both coflows share the same set of ports, similar conclusions on the impact of estimation errors due to sampling also apply under more general settings.

The above analytical results suggest that, when is large, the relative performance gap for CCT is a function of the number of pilot flows sampled for each coflow, but is independent of the total number of flows in each coflow. In practice, large coflows will dominate the total CCT in the system. Thus, these results partly explain that, while in our experiments the number of pilot flows is never larger than of the total number of flows, the performance of our proposed approach is already very good.

Finally, the above analytical results do not directly tell us how to choose the number of pilot flows, which likely depends on the probability distribution of the flow size. In practice, we do not know such distribution ahead of time. Further, while choosing a larger number of pilot flows reduces the estimation errors, it also incurs higher overhead and delay. Therefore, our design (§

LABEL:sec:design) needs to have practical solutions that carefully address these issues.

Error-correction. Readers familiar with the online learning and multi-armed bandit (MAB) literature [28, 16, 36, 14] will notice that our key idea above does not attempt to correct the errors in the initial sampling step. In particular, we did not use an iterative procedure to refine the initial estimates based on additional samples, e.g., as in the classical UCB (upper-confidence-bound) algorithm [14]. The reason is because, from our preliminary investigation (details are available in our online technical report [7].), we have found that straight-forward ways of applying UCB-type of algorithms do not work well for minimizing the total completion time. For instance, consider two coflows whose sizes are nearly identical. In order to identify which coflow is smaller, UCB-type of algorithms tend to alternately sample both coflows, which leads to nearly round-robin scheduling. While this is desirable for maximizing payoff (as in typical MAB problems), for minimizing completion time it becomes sub-optimal: Indeed, we should have instead let either one of coflows run to completion first, before the other coflow starts. Consistent with this intuition, our preliminary simulation results show that adding UCB-type of iterative error-correction actually degrades the performance of Philae: the average CCT improvement over Aalo reduces from 1.51 to 0.95. (Further details are available in our online technical report [7].) While these preliminary results do not preclude the possibilities that other iterative sampling algorithms may outperform Philae, it does illustrate that straight-forward extensions of UCB-type of ideas may not work well. How to find iterative sampling algorithms outperforming Philae remains an interesting direction for future work.

2.3 Scalability analysis

Compared to learning coflow sizes using priority queues (PQ-based) [19, 33], learning coflow sizes by sampling Philae not only reduces the learning overhead as discussed in §2.1 and shown in §LABEL:sec:sim:probeOverhead, but also significantly reduces the amount of interactions between the coordinator and local agents and thus makes the coordinator highly scalable, as summarized in Table 1.

First, PQ-based learning requires much more frequent update from local agents. PQ-based learning estimates coflow sizes by incrementally moving coflows across priority queues according to the data sent by them so far. As such, the scheduler needs frequent updates (every ms) of data sent per coflow from the local agents. In contrast, Philae directly estimates a coflow’s size upon the completion of all its pilot flows. The only updates Philae needs from the local agents are about the flow completion which is needed for updating contentions and removing flows from active consideration..

Second, PQ-based learning results in much more frequent rate allocation. In sampling-based approach, since coflow sizes are estimated only once, coflows are re-ordered only upon coflow completion or arrival events or in the case of contention based policies only when contention changes, which is triggered by completion of all the flows of a coflow at a port. In contrast, in PQ-based learning, at every interval, coflow data sent are updated and coflow priority may get updated, which will trigger new rate assignment.

Our scalability experiments in §4.3 confirms that Philae achieves much higher scalability than Aalo.

Update Update of Rate
of data sent flow completion calculation
Philae No Yes Event triggered
Aalo Periodic () Yes Periodic ()
Table 1: Comparison of frequency of interactions between the coordinator and local agents.

3 Implementation

We implemented both Philae and Aalo scheduling policies in the same framework consisting of the global coordinator and local agents (Fig. LABEL:fig:design:arch), in 5.2 KLoC in C++.

Coordinator: The coordinator schedules the coflows based on the operations received from the registering framework. The key implementation challenge for the coordinator is that it needs to be fast in computing and updating the schedules. The Philae coordinator is optimized for speed using a variety of techniques including pipelining, process affinity, and concurrency whenever possible.

Local agents: The local agents update the global coordinator only upon completion of a flow, along with its length if it is a pilot flow. Local agents schedule the coflows based on the last schedule received from the coordinator. They comply to the last schedule until a new schedule is received. To intercept the packets from the flows, local agents require the compute framework to replace datasend(), datarecv() APIs with the corresponding Philae APIs, which incurs very small overhead.

Coflow operations: The global coordinator runs independently from, and is not coupled to, any compute framework, which makes it general enough to be used with any framework. It provides RESTful APIs to the frameworks for coflow operations: (a) register() for registering a new coflow when it enters, (b) deregister() for removing a coflow when it exits, and (c) update() for updating coflow status whenever there is a change in the coflow structure, e.g., during task migration and restarts after node failures.

4 Testbed Evaluation

Next, we deployed Philae in a 150-machine Azure cluster and a 900-machine cluster to evaluate its performance and scalability.

Testbed setup: We rerun the FB trace on a Spark-like framework on a 150-node cluster in Microsoft Azure [5]. The coordinator runs on a Standard DS15 v2 server with 20-core 2.4 GHz Intel Xeon E5-2673 v3 (Haswell) processor and 140GB memory. The local agents run on D2v2 with the same processor as the coordinator with 2-core and 7GB memory. The machines on which local agents run have 1 Gbps network bandwidth. Similarly as in simulations, our testbed evaluation keeps the same flow lengths and flow ports in trace replay. All the experiments use default parameters and the default pilot flow selection policy.

4.1 CCT Improvement

In this experiment, we measure CCT improvements of Philae compared to Aalo. Fig. LABEL:fig:eval:jct-cct-speedUp shows the CDF of the CCT speedup of individual coflows under Philae compared to under Aalo. The average CCT improvement is 1.50 which is similar to the results in the simulation experiments. We also observe 1.63 P50 speedup and 8.00 P90 speedup.

We also evaluated Philae using the Wide-coflow-only trace. Table 2 shows that Philae achieves 1.52 improvement in average CCT over Aalo, similar to that using the full FB trace. This is because the improvement in average CCT is dominated by large coflows, Philae is speeding up large coflows, and the Wide-coflow-only trace consists of mostly large coflows.

P50 P90 Avg. CCT
FB Trace 1.63 8.00 1.50
Wide-coflow-only 1.05 2.14 1.49
Table 2: [Testbed] CCT improvement in Philae as compared to Aalo.

4.2 Job Completion Time

Next, we evaluate how the improvement in CCT affects the job completion time (JCT). In data clusters, different jobs spend different fractions of their total job time in data shuffle. In this experiment, we used 526 jobs, each corresponding to one coflow in the FB trace. The fraction of time that the jobs spent in the shuffle phase follows the same distribution used in Aalo [19], i.e., 61% jobs spent less than 25% of their total time in shuffle, 13% jobs spent 25-49%, another 14% jobs spent 50-74%, and the remaining spent over 75% of their total time in shuffle. Fig. LABEL:fig:eval:jct-cct-speedUp shows the CDF of individual speedups in JCT. Across all jobs, Philae reduces the job completion time by 1.16 in the median case and 7.87 in the 90 percentile. This shows that improved CCT translates into better job completion time. As expected, the improvement in job completion time is smaller than the improvement in CCT because job completion time depends on the time spent in both compute and shuffle (communication) stages, and Philae improves only the communication stage.

4.3 Scalability

Rate Calc. New Rate Send Update Recv. Total
Philae 2.99 (5.35) 4.90 (11.25) 6.89 (17.78) 14.80 (28.84)
Aalo 4.28 (4.14) 17.65 (20.90) 10.97 (19.98) 32.90 (34.09)
Table 3:

[Testbed] Average (standard deviation) coordinator CPU time (ms) per scheduling interval in 900-port runs.

Philae did not have to calculate and send new rates in 66% of intervals, which contributes to its low average.
150 ports 900 ports
Philae 1% 10%
Aalo 16% 37%
Table 4: [Testbed] Percentage of scheduling intervals where synchronization and rate calculation took more than for 150-port and for 900-port runs.

Finally, we evaluate the scalability of Philae by comparing its performance with Aalo on a 900-node cluster. To drive the evaluation, we derive a 900-port trace by replicating the FB trace 6 times across ports, i.e., we replicated each job 6 times, keeping the arrival time for each copy the same but assigning sending and receiving ports in increments of 150 (the cluster size for the original trace). We also increased the scheduling interval by 6 times to = 6.

Philae achieved 2.72 (9.78) speedup in average (P90) CCT over Aalo. The higher speedup compared to the 150-node runs (1.50) comes from higher scalability of Philae. In 900-node runs, Aalo was not able to finish receiving updates, calculating new rates and updating local agents of new rates within in 37% of the intervals, whereas Philae only missed the deadline in 10% of the intervals. For 150-node runs these values are 16% for Aalo and 1% for Philae. The 21% increase in missed scheduling intervals in 900-node runs in Aalo resulted in local agents executing more frequently with outdated rates. As a result, Philae achieved even higher speedup in 900-node runs.

As discussed in §2.3, Aalo’s poorer coordinator scalability comes from more frequent updates from local agents and more frequent rate allocation, which result in longer coordinator CPU time in each scheduling interval. Table 3 shows the average coordinator CPU usage per interval and its breakdown. We see that (1) on average Philae spends much less time than Aalo in receiving updates from local agents, because Philae does not need updates from local agents at every interval – on average in every scheduling interval Philae receives updates from 49 local agents whereas Aalo receives from 429 local agents, and (2) on average Philae spends much less time calculating new rates and send new rates. This is because rate calculation in Philae is triggered by events and Philae did not have to flush rates in 66% of the intervals.

4.4 Robustness to network error

As discussed in §3, unlike Aalo, Philae’s coordinator does not need constant updates from local agents to sort coflows in priority queues. This simplifies Philae’s design and makes it robust to network error. To evaluate the benefit of this property of Philae, we evaluated Aalo and Philae 5 times with the same configuration using the FB trace. Table 5 shows the mean-normalized standard deviation in the 10, 50, 90 percentile and the average CCT across the 5 runs. The lower values for Philae indicates that it is more robust to network dynamics than Aalo.

P10 P50 P90 Avg. CCT
Philae 6.1% 2.3% 0.1% 0.1%
Aalo 7.1% 4.4% 2.7% 1.6%
Table 5: [Testbed] Mean normalised standard deviation in CCT among Philae and Aalo.

4.5 Resource Utilization

Finally, we evaluate, for both Philae and Aalo, the resource utilization at the coordinator and the local agents (Table 6) in terms of CPU and memory usage. We measure the overheads in two cases: (1) Overall: average during the entire execution of the trace, (2) Busy: the 90-th percentile utilization indicating the performance during busy periods due to a large number of coflows arriving. As shown in Table 6, Philae agents have similar utilization as Aalo at the local nodes, where the CPU and memory utilization are minimal even during busy times. The global coordinator of Philae consumes much lower server resources than Aalo – the CPU utilization is 3.4 lower than Aalo on average, and 2.6 than Aalo during busy periods. This is due to Philae’s event triggered communication and sampling-based learning, which significantly lowers its communication frequency with local agents when compared to Aalo. The lower resource utilization of the global coordinator enables Philae to scale to a lager cluster size than Aalo.

Philae Aalo
Overall Busy Overall Busy
Coordi- CPU (%) 5.0 10.4 17.0 27.2
nator Memory (MB) 212 218 318 427
Local CPU (%) 4.3 4.6 4.5 4.6
node Memory (MB) 1.65 1.70 1.64 1.70
Table 6: [Testbed] Resource usage in Philae and Aalo for 150 ports experiment.

5 Related Work

Coflow scheduling: In this paper, we have shown Philae outperforms prior-art non-clairvoyant coflow scheduler Aalo from more efficient learning of coflow sizes online. In [19], Aalo was shown to outperform previous non-clairvoyant coflow schedulers Baraat [25] by using global coordination, and Orchestra [21] by avoiding head-of-line blocking.

Clairvoyant coflow schedulers such as Varys [22] and Sincronia [8]

assume prior knowledge of coflows upon arrival. Varys runs a shortest-effective-bottleneck-first heuristic for inter-coflow scheduling and performs per-flow rate allocation at the coordinator. Sincronia improves the scalability of the centralized coordinator of Varys by only calculating the coflow ordering at the coordinator (by solving an LP) and offloading flow rate allocation to individual local agents. Sincronia is orthogonal to

Philae; once coflow sizes are learned through sampling, ideas from Sincronia can be adopted in Philae to order coflows and offload rate allocation to local ports. Prioritized work conservation in Sincronia helps in mitigating the effect of starvation, which arises as a result of ordering coflows in the optimal order. However, it still does not guarantee complete freedom from starvation. CODA [48] tackles an orthogonal problem of identifying flows of individual coflows online.

However, recent studies [26, 19] have shown various reasons why it is not very plausible to learn flow sizes from applications beforehand. For example, many applications stream data as soon as data are generated and thus the application does not know the flow sizes until flow completion, and learning flow sizes from applications requires changing either the network stack or the applications.

Flow scheduling: There exist a rich body of prior work on flow scheduling. Efforts to minimize flow completion time (FCT), both with prior information (pFabric [10]) and without prior information (e.g., Fastpass [40]), fall short in minimizing CCTs which depend on the completion of the last flow [22]. Similarly, Hedera [9] and MicroTE [15] schedule the flows with the goal of reducing the overall FCT, which again is different from reducing the overall CCT of coflows.

Speculative scheduling Recent works [17, 38] use the idea of online requirement estimation for scheduling in datacenter. In [35], recurring big data analytics jobs are scheduled using their history.

Job scheduling: There have been much work on scheduling in analytic systems and storage at scale by improving speculative tasks [47, 13, 12], improving locality [45, 11], and end-point flexibility [18, 43]. The coflow abstraction is complimentary to these work, and can benefit from them. Combining coflow them remains a future work.

Scheduling in parallel processors: Coflow scheduling by exploiting the spatial dimension bears similarity to scheduling processes on parallel processors and multi-cores, where many variations of FIFO [42], FIFO with backfilling [37] and gang scheduling [27] have been proposed.

6 Conclusion

State-of-the-art online coflow schedulers approximate the classic SJF by implicitly learning coflow sizes and pay a high penalty for large coflows. We propose the novel idea of sampling in the spatial dimension of coflows to explicitly and efficiently learn coflow sizes online to enable efficient online SJF scheduling. Our extensive simulation and testbed experiments show the new design offers significant performance improvement over prior art. Further, the sampling-in-spatial-dimension technique can be generalized to other distributed scheduling problems such as cluster job scheduling. We have made our simulator publicly available at https://github.com/coflowPhilae/simulator [6].


  • [1] Apache hadoop. http://hadoop.apache.org.
  • [2] Apache spark. http://spark.apache.org.
  • [3] Apache tez. http://tez.apache.org.
  • [4] Coflow trace from facebook datacenter. https://github.com/coflow/coflow-benchmark.
  • [5] Microsoft azure. http://azure.microsoft.com.
  • [6] Philae simulator. https://github.com/coflowPhilae/simulator.
  • [7] Philae tech report. https://github.com/coflowPhilae/techreport-ton.
  • [8] Saksham Agarwal, Shijin Rajakrishnan, Akshay Narayan, Rachit Agarwal, David Shmoys, and Amin Vahdat. Sincronia: Near-optimal network design for coflows. SIGCOMM ’18, pages 16–29. ACM, 2018.
  • [9] Mohammad Al-Fares, Sivasankar Radhakrishnan, Barath Raghavan, Nelson Huang, and Amin Vahdat. Hedera: Dynamic flow scheduling for data center networks. NSDI’10, pages 19–19. USENIX, 2010.
  • [10] Mohammad Alizadeh, Shuang Yang, Milad Sharif, Sachin Katti, Nick McKeown, Balaji Prabhakar, and Scott Shenker. pfabric: Minimal near-optimal datacenter transport. SIGCOMM ’13, pages 435–446. ACM.
  • [11] Ganesh Ananthanarayanan, Sameer Agarwal, Srikanth Kandula, Albert Greenberg, Ion Stoica, Duke Harlan, and Ed Harris. Scarlett: Coping with skewed content popularity in mapreduce clusters. EuroSys ’11, pages 287–300, New York, NY, USA, 2011. ACM.
  • [12] Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, and Ion Stoica. Effective straggler mitigation: Attack of the clones. nsdi’13, pages 185–198, Berkeley, CA, USA, 2013. USENIX.
  • [13] Ganesh Ananthanarayanan, Srikanth Kandula, Albert Greenberg, Ion Stoica, Yi Lu, Bikas Saha, and Edward Harris.

    Reining in the outliers in map-reduce clusters using mantri.

    OSDI’10, pages 265–278, Berkeley, CA, USA, 2010. USENIX Association.
  • [14] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002.
  • [15] Theophilus Benson, Ashok Anand, Aditya Akella, and Ming Zhang. Microte: Fine grained traffic engineering for data centers. CoNEXT ’11, pages 8:1–8:12, New York, NY, USA, 2011.
  • [16] Sébastien Bubeck and Nicolò Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122, 2012.
  • [17] Inho Cho, Keon Jang, and Dongsu Han. Credit-scheduled delay-bounded congestion control for datacenters. SIGCOMM ’17, pages 239–252, New York, NY, USA, 2017. ACM.
  • [18] Mosharaf Chowdhury, Srikanth Kandula, and Ion Stoica. Leveraging endpoint flexibility in data-intensive clusters. SIGCOMM ’13, pages 231–242, New York, NY, USA, 2013. ACM.
  • [19] Mosharaf Chowdhury and Ion Stoica. Efficient coflow scheduling without prior knowledge. SIGCOMM ’15, pages 393–406. ACM.
  • [20] Mosharaf Chowdhury and Ion Stoica. Coflow: A networking abstraction for cluster applications. HotNets-XI, pages 31–36, New York, 2012.
  • [21] Mosharaf Chowdhury, Matei Zaharia, Justin Ma, Michael I. Jordan, and Ion Stoica. Managing data transfers in computer clusters with orchestra. SIGCOMM ’11, pages 98–109. ACM, 2011.
  • [22] Mosharaf Chowdhury, Yuan Zhong, and Ion Stoica. Efficient coflow scheduling with varys. SIGCOMM ’14, pages 443–454, 2014.
  • [23] Tyson Condie, Neil Conway, Peter Alvaro, Joseph M. Hellerstein, Khaled Elmeleegy, and Russell Sears. Mapreduce online. NSDI’10, pages 21–21, Berkeley, CA, USA, 2010. USENIX.
  • [24] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplified data processing on large clusters. In Commun. ACM, volume 51, pages 107–113. ACM, January 2008.
  • [25] Fahad R. Dogar, Thomas Karagiannis, Hitesh Ballani, and Antony Rowstron. Decentralized task-aware scheduling for data center networks. SIGCOMM ’14, pages 431–442, New York, NY, USA, 2014. ACM.
  • [26] Vojislav Dukic, Sangeetha Abdu Jyothi, Bojan Karlas, Muhsen Owaida, Ce Zhang, and Ankit Singla. Is advance knowledge of flow sizes a plausible assumption? NSDI, pages 565–580, Boston, MA, 2019. USENIX.
  • [27] Dror G. Feitelson and Morris A. Jette. Improved utilization and responsiveness with gang scheduling. IPPS ’97, pages 238–261, London, UK, UK, 1997. Springer-Verlag.
  • [28] John Gittins, Kevin Glazebrook, and Richard Weber. Multi-armed bandit allocation indices. John Wiley & Sons, 2011.
  • [29] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, 1963.
  • [30] Xin Sunny Huang, Xiaoye Steven Sun, and T.S. Eugene Ng. Sunflow: Efficient optical circuit scheduling for coflows. CoNEXT ’16. ACM.
  • [31] Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. Dryad: Distributed data-parallel programs from sequential building blocks. EuroSys ’07, pages 59–72. ACM, 2007.
  • [32] Akshay Jajoo, Rohan Gandhi, and Y. Charlie Hu. Graviton: Twisting space and time to speed-up coflows. In 8th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 16), Denver, CO, 2016. USENIX.
  • [33] Akshay Jajoo, Rohan Gandhi, Y. Charlie Hu, and Cheng-Kok Koh. Saath: Speeding up coflows by exploiting the spatial dimension. In Proceedings of the 13th International Conference on Emerging Networking EXperiments and Technologies, CoNEXT ’17, pages 439–450. ACM.
  • [34] Akshay Jajoo, Y. Charlie Hu, and Xiaojun Lin. Your coflow has many flows: Sampling them for fun and speed. In 2019 USENIX Annual Technical Conference, pages 833–848, Renton, WA.
  • [35] Virajith Jalaparti, Peter Bodik, Ishai Menache, Sriram Rao, Konstantin Makarychev, and Matthew Caesar. Network-aware scheduling for data-parallel jobs: Plan when you can. SIGCOMM ’15, pages 407–420. ACM.
  • [36] Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4–22, 1985.
  • [37] David A. Lifka. The anl/ibm sp scheduling system. IPPS ’95, pages 295–303, London, UK, UK, 1995. Springer-Verlag.
  • [38] Masoud Moshref, Minlan Yu, Ramesh Govindan, and Amin Vahdat. Trumpet: Timely and precise triggers in data centers. SIGCOMM ’16, pages 129–143, New York, NY, USA, 2016. ACM.
  • [39] Stanley Lemeshow Paul S. Levy. Sampling of Populations: Methods and Applications. Wiley, 4 edition, Jun 2012.
  • [40] Jonathan Perry, Amy Ousterhout, Hari Balakrishnan, Devavrat Shah, and Hans Fugal. Fastpass: A centralized ”zero-queue” datacenter network. SIGCOMM ’14, pages 307–318. ACM, 2014.
  • [41] Christopher J. Rossbach, Yuan Yu, Jon Currey, Jean-Philippe Martin, and Dennis Fetterly. Dandelion: A compiler and runtime for heterogeneous systems. SOSP ’13, pages 49–68, New York, 2013.
  • [42] Uwe Schwiegelshohn and Ramin Yahyapour. Analysis of first-come-first-serve parallel job scheduling. SODA ’98, pages 629–638, Philadelphia, PA, USA, 1998. Society for Industrial and Applied Mathematics.
  • [43] David Shue, Michael J. Freedman, and Anees Shaikh. Performance isolation and fairness for multi-tenant cloud storage. OSDI’12, pages 349–362, Berkeley, CA, USA, 2012.
  • [44] Abraham Silberschatz, Peter B. Galvin, and Greg Gagne. Process Scheduling. Operating System Concepts. John Wiley & Sons, 8 edition.
  • [45] Matei Zaharia, Dhruba Borthakur, Joydeep Sen Sarma, Khaled Elmeleegy, Scott Shenker, and Ion Stoica. Delay scheduling: A simple technique for achieving locality and fairness in cluster scheduling. EuroSys ’10, pages 265–278, New York, NY, USA, 2010. ACM.
  • [46] Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. Spark: Cluster computing with working sets. HotCloud’10, pages 10–10, Berkeley, CA, USA, 2010.
  • [47] Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, and Ion Stoica. Improving mapreduce performance in heterogeneous environments. OSDI’08, pages 29–42, Berkeley, CA, USA, 2008. USENIX.
  • [48] Hong Zhang, Li Chen, Bairen Yi, Kai Chen, Mosharaf Chowdhury, and Yanhui Geng. Coda: Toward automatically identifying and scheduling coflows in the dark. SIGCOMM ’16, pages 160–173. ACM.