BoPF: Mitigating the Burstiness-Fairness Tradeoff in Multi-Resource Clusters

by   Tan N. Le, et al.

Simultaneously supporting latency- and throughout-sensitive workloads in a shared environment is an increasingly more common challenge in big data clusters. Despite many advances, existing cluster schedulers force the same performance goal - fairness in most cases - on all jobs. Latency-sensitive jobs suffer, while throughput-sensitive ones thrive. Using prioritization does the opposite: it opens up a path for latency-sensitive jobs to dominate. In this paper, we tackle the challenges in supporting both short-term performance and long-term fairness simultaneously with high resource utilization by proposing Bounded Priority Fairness (BoPF). BoPF provides short-term resource guarantees to latency-sensitive jobs and maintains long-term fairness for throughput-sensitive jobs. BoPF is the first scheduler that can provide long-term fairness, burst guarantee, and Pareto efficiency in a strategyproof manner for multi-resource scheduling. Deployments and large-scale simulations show that BoPF closely approximates the performance of Strict Priority as well as the fairness characteristics of DRF. In deployments, BoPF speeds up latency-sensitive jobs by 5.38 times compared to DRF, while still maintaining long-term fairness. In the meantime, BoPF improves the average completion times of throughput-sensitive jobs by up to 3.05 times compared to Strict Priority.


page 2

page 10


Differential Approximation and Sprinting for Multi-Priority Big Data Engines

Today's big data clusters based on the MapReduce paradigm are capable of...

Node-Based Job Scheduling for Large Scale Simulations of Short Running Jobs

Diverse workloads such as interactive supercomputing, big data analysis,...

heSRPT: Optimal Parallel Scheduling of Jobs With Known Sizes

When parallelizing a set of jobs across many servers, one must balance a...

Themis: Fair and Efficient GPU Cluster Scheduling for Machine Learning Workloads

Modern distributed machine learning (ML) training workloads benefit sign...

Opportunistic Temporal Fair Mode Selection and User Scheduling for Full-duplex Systems

In-band full-duplex (FD) communications - enabled by recent advances in ...

Throughput Optimization of Coexistent LTE-U and WiFi in Next Generation Networks

Next generation networks are envisioned to have ubiquitous availability ...

Fissile Locks

Classic test-and-test (TS) mutual exclusion locks are simple, and enjoy ...

1 Introduction

Cloud computing infrastructures are increasingly being shared between diverse workloads with heterogeneous resource requirements. In particular, throughput-sensitive batch processing systems [28, 45, 58, 11] are often complemented by latency-sensitive interactive analytics [77, 2, 63] and online stream processing systems [78, 70, 3, 62, 17]. Simultaneously supporting these workloads is a balancing act between distinct performance metrics. For example, Figure 1 shows a cluster scheduling a mix of jobs from a throughput-sensitive queue (TQ) and a latency-sensitive queue (LQ). Batch processing workloads such as indexing [28] and log processing [10, 77] may submit hours-long large jobs via TQ. The average amount of resources received over a certain period of time is critical for these jobs. In contrast, interactive [77, 63] and online streaming [78, 70] workloads respectively submit on-demand and periodic smaller jobs via LQ. Therefore, receiving enough resources immediately for an LQ job is more important than the average resources received over longer time intervals.

Figure 1: Users and automated processes submit throughput-sensitive (TQ) and latency-sensitive (LQ) to the same cluster.

To address the diverse goals, today’s schedulers are becoming more and more complex. They are multi-resource [36, 39, 51, 24, 8], DAG-aware [23, 39, 77], and allow a variety of constraints [79, 46, 7, 37, 76]. Given all these inputs, they optimize for objectives such as fairness [36, 47, 35, 13], performance [34], efficiency [39], or different combinations of the three [40, 41]. However, most existing schedulers have one shortcoming in common: they force the same performance goal on all jobs while jobs may have distinct goals; therefore, they fail to provide performance guarantee in the presence of multiple types of workloads with different performance metrics. In fact, the performance of existing schedulers can be arbitrarily bad for some workloads.

DRF ensures instantaneous fairness, but increases the completion times of latency-sensitive jobs.
(a) DRF ensures instantaneous fairness, but increases the completion times of latency-sensitive jobs.
(b) SP decreases completion times of latency-sensitive jobs, but batch jobs do not receive their fair shares.
(c) The ideal solution allows first two latency-sensitive jobs to finish as quickly as possible, but protects batch jobs from latter LQjobs by ensuring long-term fairness.
Figure 2: Need for bounded priority and long-term fairness in a shared multi-resource cluster with latency-sensitive (LQ: blue/dark) and throughput-sensitive (TQ: orange/light) jobs. The blank part on the top is due to resource fragmentation and overheads in Apache YARN. Although we focus only on memory allocations here, similar observations hold in multi-resource scenarios.

Consider the simple example in Figure 2 that illustrates the inefficiencies of existing resource allocation policies, in particular, DRF [36] and Strict Priority (SP) [54] for coexisting workloads with different requirements. In this example, we run memory-bound jobs in a 40-node cluster, where each node has 32 CPU cores and 64 GB RAM. There are two queues, where each queue contains a number of jobs with the same performance goals. Apache Hadoop YARN [75] is set up to manage the resource allocation among the queues. The first queue is for Spark streaming, which submits a simple MapReduce-like job every 10 minutes. We call it latency-sensitive queue (LQ) because it aims to finish each of the jobs as quickly as possible. The second queue is a throughout-sensitive batch-job queue (TQ) formed by jobs generated from the BigBench workload [14] and queued up at the beginning. TQcares more about its long-term averaged resources received, e.g., every 10 minutes. For simplicity of exposition, all jobs are memory-bound. We consider two common classes of policies – priority-based and fairness-based allocation – in this example, where the former is optimized for latency and the latter for fairness. The memory resource consumption under these two policies is depicted in Figures (a)a and (b)b, respectively. We defer the discussion of other policies to Section 2.3.

SP gives LQ the whole cluster’s resources (high priority) whenever it has jobs to run; hence, it provides the lowest possible response time. For the first two arrivals, the average response time is 130 seconds. A detrimental side effect of SP, however, is that there is no resource isolation – TQ jobs may not receive any resources at all! In particular, LQ is incentivized to increase its arrival rate – e.g., for more accurate sampling and more iterations in training neural networks – without any punishment. As it does so from the third job arrival, TQ no longer receives its fair share. In the worst case, LQ can take all the system resources and

starve TQ. In summary, SP provides the best response time for LQ , but no performance isolation for TQ at all. In addition, SP is incapable of handling multiple LQs.

In contrast, DRF enforces instantaneous fair allocation of resources at all times. During the burst of LQ, LQ and TQ share the bottleneck resource (memory) evenly until the jobs from LQ complete; then TQ gets all resources before the next burst of LQ. Clearly, TQ is happy at the cost of longer completion times of LQ’s jobs, whose response time increases by 1.6 times. In short, DRF provides the best performance isolation for TQ, but no performance consideration for LQ. When there are many TQs, the response time of LQ can be very large.

Clearly, it is impossible to achieve the best response time under instantaneous fairness. In other words, there is a hard tradeoff between providing instantaneous fairness for TQs and minimizing the response time of LQs. Consequently, we aim to answer the following fundamental question in this paper: how well can we simultaneously accommodate multiple classes of workloads with performance guarantees, in particular, performance isolation for TQs and low response times for LQs?

Figure 3: BoPF  in the cluster scheduling design space.

We answer this question by designing BoPF : the first multi-resource scheduler that achieves both performance isolation for TQs in terms of long-term fairness and response time guarantees for LQs, and is strategyproof. It is simple to implement and provides significant performance improvements even in the presence of uncertainties. The key idea is “bounded” priority for LQs: as long as the burst is not too large to hurt the long-term fair share of TQs, they are given higher priority so jobs can be completed as quickly as possible. Figure 3 shows BoPF in the context of cluster scheduling landscape.

We make the following contributions in this paper.

Algorithm design. We develop BoPF with the rigorously proven properties of strategyproofness, short-term bursts, long-term fairness, and high system utilization (§3). When LQs have different demands for each arrival, we further design mechanism to handle the uncertainties.

Design and implementation. We have implemented BoPF on Apache YARN [71]4). Any framework that runs on YARN can take advantage of BoPF . The BoPF scheduler is implemented as a new scheduler in Resource Manager that runs on the master node. The scheduling overheads for admitting queues or allocating resources are negligibly less than 1 ms for 20,000 queues.

Evaluation on both testbed experiments and large-scale simulations. In deployments, BoPF provides up to lower completion times for LQ jobs than DRF, while maintaining the same long-term fairness (§5.2). At the same time, BoPF provides up to more fair allocation to TQ jobs compared to SP.

2 Motivation

2.1 Benefits of Temporal Co-Scheduling

Consider the example in Figure 2 again. Recall that SP and DRF are two extreme cases in trading off performance and fairness: SP provides the best performance (for LQs) with no fairness consideration (for TQs); DRF ensures the best isolation (for TQs) with poor performance (of LQs). However, it is still possible for LQs and TQs to share the cluster by thoughtful co-scheduling over time.

The ideal allocation is depicted in Figure (c)c. The key idea is “bounded” priority for LQs as we discussed in the previous section. In particular, before 1,400 seconds, LQ’s bursts are small, so it gets higher priority, which is similar to SP. After LQ increases its demand, only a fraction of its demand can be satisfied with the entire system’s resources. Then it has to give resources back to TQ to ensure long-term fairness.

2.2 Desired Properties

We restrict our attention in this paper to the following, important properties: burst guarantee for LQs, long-term fairness for TQs, strategyproofness, and Pareto efficiency to improve cluster utilization.

Burst guarantee (BG) provides performance guarantee for LQs by allocating guaranteed amount of resources during their bursts. In particular, an LQ requests its minimum required resources for its bursts to satisfy its service level agreements, e.g., percentiles of response time.

Long-term fairness (LF) provides every queue in the system the same amount of resources over a (long) period, e.g., 10 minutes. Overall, it ensures that TQs progress no slower than any LQ in the long run. LF implies sharing incentive, which requires that each queue should be better off sharing the cluster, than exclusively using its own static share of the cluster. If there are queues, each queue cannot exceed of all resources under static sharing.111For simplicity of presentation, we consider queues with the same weights, which can be easily extended to queues with different weights.

Strategyproofness (SPF) ensures that queues cannot benefit by lying about their resource demands. This provides incentive compatibility, as a queue cannot improve its allocation by lying.

Pareto efficiency (PE) is about the optimal utilization of the system. A resource allocation is Pareto efficient if it is impossible to increase the allocation/utility of a queue without hurting at least another queue.

2.3 Analysis of Existing Policies

Property SP DRF M-BVT BoPF
Burst Guarantee (BG) ✓* ✓*
Long-Term Fairness (LF)
Strategyproofness (SPF)
Pareto Efficiency (PE)
Single Resource Fairness
Bottleneck Fairness
Population Monotonicity
Table 1: Properties of existing policies and BoPF . means that the property holds when there is only one LQ.

Strict Priority (SP): SP is employed to provide performance guarantee for LQs. As the name suggests, an SP scheduler always prioritize LQs. Therefore, when there is only one LQ, SP provides the best possible performance guarantee. However, when there are more than one LQs, it is impossible to give all of them the highest priority. Meanwhile, TQs may not receive enough resources, which violates long-term fairness. As the LQs may request more resources than what they actually need, strategyproofness is not enforced, and therefore the system may waste some resources – i.e., it is not Pareto efficient.


DRF is an extension of max-min fairness to the multi-resource environment, where the dominant share is used to map the resource allocation (as a vector) to a scalar value. It provides instantaneous fairness, strategyproofness, and Pareto efficiency. However, because DRF is an instantaneous allocation policy without any memory, it cannot prioritize jobs with more urgent deadlines. In particular, no burst guarantee is provided. Even assigning queues different weights in DRF is homogeneous over time and cannot provide the burst guarantee needed. In addition, there is no admission control. Therefore, as the number of queues increases, no queue’s performance can be guaranteed.

M-BVT: BVT  [31] is a thread-based CPU scheduler for a mix of real-time and best-effort tasks. The idea is that for real-time tasks, BVT allows them to borrow some virtual time (and therefore resources) from the future and be prioritized for a period without increasing their long-term shares.

Since BVT was designed for a single-resource environment, we extend the idea of BVT to M-BVT for multiple resources. Under the M-BVT policy, LQ- is assigned a virtual time warp parameter , which represents the urgency of the queue. Upon an arrival of its burst at , an effective virtual time is calculated. This is used as the priority (smaller means higher priority) for scheduling. When LQ- has the only smallest , it may use the whole system’s resources and its increases at the rate of its progress calculated by DRF. Eventually, its is no longer the only smallest. Then resources are shared in a DRF-fashion among queues with the smallest virtual times.

M-BVT has some good properties. For instance, the DRF component ensures long-term fairness, and the BVT component strives for performance. Pareto efficiency follows from the work conservation of the policy.

However, it does not provide general burst guarantees as any new arriving queue (with larger virtual time warp parameter) may occupy the resources of existing LQs or share resources with them, thus hurting their completion time. In addition, it is not strategyproof because queues can lie about their needs in order to get a larger virtual time warp.

Other policies like the CEEI [61] provide fewer desired properties.

2.4 Summary of the Tradeoffs

As listed in Table 1, no prior policy can simultaneously provide all the desired properties of fairness/isolation for TQs while providing burst guarantees for all the LQs with strategyproofness. In particular, if strict priority is provided to an LQwithout any restriction for its best performance (e.g., strategyproofness), there is no isolation protection for TQs’ performance. On the other hand, if the strictly instantaneous fairness is enforced (e.g., DRF), there is no room to prioritize short-term bursts. While the idea in M-BVT is reasonable, it is not strategyproof and cannot provide burst guarantee.

The key question of the paper is, therefore, how to allocate system resources in a near-optimal way; meaning, satisfying all the critical properties in Table 1.

3 BoPF : A Scheduler With Memory

In this section, we first present the problem setting (§3.1) and then formally model the problem in Section 3.2. BoPF achieves the desired properties by admission control, guaranteed resource provision, and spare resource allocation (§3.3). Finally, we prove that BPF satisfies all the properties in Table 13.4).

3.1 Problem Settings

We consider a system with resources. The capacity of resource is denoted by . The system resource capacity is therefore a vector For ease of exposition, we assume is a constant over time, but our methodology applies directly to the cases with time-varying

, e.g., with estimations of

at the beginning and leveraging stochastic optimization [64] and online algorithm design [48].

We restrict our attention to LQs for interactive sessions and streaming applications, and TQs for batch jobs.

LQ-’s demand comes from a series of bursts, each consisting of a number of jobs. We denote by the arrival time of the -th burst, which must be finished within . Therefore, its -th burst needs to be completed by (i.e., deadline). Denote the demand of its -th arrival by a vector , where is the demand on resource-.

In practice, inter-arrival time between consecutive bursts can be fixed for some applications such as Spark Streaming [78], or it may vary for interactive user sessions. In general, the duration is quite short, e.g., several minutes. Similarly, the demand vector may contain some uncertainties, and we assume that queues have their own estimations. Therefore, our approach has to be strategyproof so that queues report their estimated demand, as well as their true deadlines.

To enforce the long-term fairness, the total demand of LQ-’s -th arrival should not exceed its fair share, which can be calculated by a simple fair scheduler – i.e., , when there are queues admitted by BoPF  – or a more complicated one such as DRF. We adopt the former in analysis because it provides a more conservative evaluation of the improvements brought by BoPF .

In contrast, TQ’s jobs are queued at the beginning with much larger demand than each burst of LQs.

Notation Description
Admitted LQs with hard guarantee
Admitted LQs with soft guarantee
Admitted TQs and LQs with fair share only
Table 2: Important notations

3.2 Modeling the Problem

Completion time: Let us denote by the (last) completion time of jobs during LQ-’s -th arrival. If LQ- is admitted with hard guarantee, we ensure that a large fraction of arrivals are completed before deadlines222 can be 95% or 99% depending on the SLA.; i.e., , where is the indicator function which equals to 1 if the condition is satisfied and 0 otherwise, is the number of arrivals of LQ-. A more general function is considered as the future work. If LQ- is admitted with only soft/best-effort guarantee, we maximize the fraction of arrivals completed on time.

Long-term fairness: Denote by and the resources allocated for LQ- and TQ- at time , respectively. For a possibly long evaluation interval during which there is no new admission or exit, the average resource guarantees received are calculated as and . We require the allocated dominant resource, i.e., the largest amount of resource allocated across all resource types, received by any TQ queue is no smaller than that received by an LQ. Formally, , where and is the set of admitted LQs and TQs, respectively, , where and are allocated type- resources for LQ- and TQ- at time , respectively. This condition provides long-term protections for admitted TQs.

The optimization problem: We would like to maximize the arrivals completed before the deadlines for admitted LQs with soft guarantee while meeting the specified fraction of deadlines of admitted LQs with hard guarantees and keeping the long-term fairness.

The decisions to be made are (i) admission control, which decides the set of admitted LQs (, ) and the set of admitted TQs (); and (ii) resources allocated to admitted queues LQ- and TQ- ( and , respectively) over time. If there are some unused/unallocated resources, queues with unsatisfied demand can share them.

3.3 Solution Approach

Our solution BoPF consists of three major components: admission control procedure to decide , and , guaranteed resource provisioning procedure for , and a spare resource allocation procedure.

Admission control procedure: BoPF admits queues into the following three classes:

  • [noitemsep,nolistsep]

  • : LQs admitted with hard resource guarantee.

  • : LQs admitted with soft resource guarantee. Similar to hard guarantee, but need to wait when some LQs with hard guarantee are occupying system resources.

  • : Elastic queues that can be either LQs or TQs. There is no burst guarantee, but long-term fair share is provided.

1:procedure periodicSchedule()
2:     if there are new LQs  then
3:           =LQAdmit()     
4:     if there are new TQs  then
5:           =TQAdmit()     
6:     allocate()
8:function LQAdmit(LQs )
9:     for all LQ do
10:          if safety condition (1) satisfied then
11:               if fairness condition (2) satisfied then
12:                    if resource condition (3) satisfied then
13:                         Admit to hard guarantee
14:                    else
15:                         Admit to soft guarantee                     
16:               else
17:                    Admit to elastic with long-term fair share                
18:          else
19:               Reject                
20:     return
22:function TQAdmit(queue )
23:     for all TQ do
24:          if safety condition (1) satisfied then
25:               Admit to elastic with long-term fair share
26:          else
27:               Reject                
28:     return
30:function allocate(, , )
31:     for all LQ do
32:           for      
33:     for all LQ do
34:          allocate based on SRPT until each LQ-’s allocation reaches or the deadline arrives.      
35:     Obtain the remaining resources
36:     DRF(, )
Algorithm 1 BoPF Scheduler

The system expects to admit at least queues. Before admitting LQ-, BoPF checks if admitting it invalidates any resource guarantees committed for LQs in , i.e., the following safety condition needs to be satisfied:


where is the number of already admitted queues. If (1) is not satisfied, LQ- is rejected. Otherwise, it is safe to admit LQ- and the next step is to decide which of the three classes it should be added to.

For LQ- to have some resource guarantee, either hard or soft, its own total demand should not exceed its long-term fair share. Formally, the fairness condition is


If only condition (1) is satisfied but (2) is not, LQ- is added to . If both conditions (1) and (2) are satisfied, it is safe to admit LQ- to or . If there are enough uncommitted resources (resource condition (3)), LQ- is admitted to . Otherwise it is added to .


For TQ-, BoPF simply checks the safety condition (1). If it is satisfied, TQ-j is added to . Otherwise TQ- is rejected.

Guaranteed resource provisioning procedure For each LQ- in , during , BoPF allocates constant resources to fulfill its demand . LQs in shares the uncommitted resource based on SRPT (Shortest Remaining Processing Time) [12] until each LQ-’s consumption reaches or the deadline arrives. Meaning, BoPF prioritizes the LQs that are about to reach before the deadline.

After every LQ in and is allocated, remaining resources are allocated to queues in using DRF [36].

Spare resource allocation procedure If some allocated resources are not used, they are further shared by TQs and LQs with unsatisfied demand. This maximizes system utilization.

3.4 Properties of BoPF 

First, we argue that BoPF ensures long-term fairness, burst guarantee, and Pareto efficiency.

The safety condition and fairness condition ensure the long-term fairness for all TQs.

For LQs in , they have hard resource guarantee and therefore can meet their SLA. For LQs in , they have resource guarantee whenever possible, and only need to wait after LQs in when there is a conflict. Therefore, their performance is much better than if they were under fair allocation policies.

The addition of allows more LQs to be admitted with resource guarantee and therefore increases the resources utilized by LQs. Finally, we fulfill spare resources with TQs, so system utilization is maximized, reaching Pareto efficiency.

In addition, we prove that BoPF is weak-strategyproof. Meaning, users have limited incentive to lie about their demand. The detail of proof is in Appendix 9.1.

3.5 Handling Uncertainties

In practice, arrivals of LQ- may have different sizes, i.e.,

is not deterministic but instead has some probability distributions. Here we extend BoPF to handle this case.

We assume LQ- knows its distributions, e.g., from historical data. In particular, it knows the cumulative probability distribution of each resource , denoted by if these distributions on multiple resources are independent. The requirement regarding can be converted into , where is the request demand on resource . This gives . Finally, the request on resource- . We call this -strategy.

When distributions of multiple resources are correlated, we only have the general form , where

is the joint distribution on all resources. We have the following properties in this case.

333The proof is omitted due to space limit. When the distributions are pairwise positively correlated, -strategy over-provisions resources. If they are pairwise negatively correlated, -strategy under-provisions resources. Numerical approaches can be applied to adjust the -strategy accordingly.

Taking the correlations on multiple resources into consideration is important. In particular, when these distributions are perfectly correlated, can be reduced to

. When the standard deviation is large (e.g., 40% of the mean for Normal distribution), the demand can be reduced by 10% with three resources. This increases the chance of LQ-

being admitted.

4 Design Details

In this section, we describe how we have implemented BoPF on Apache YARN, how we use standard techniques for demand estimation, and additional details related to our implementation.

Figure 4: Enabling bounded prioritization with long-term fairness in a multi-resource cluster. BoPF -related changes are shown in orange.

4.1 Enabling BoPF in Cluster Managers

Enabling bounded prioritization with long-term fairness requires implementing the BoPF scheduler itself along with an additional admission control module in cluster managers, and it takes additional information on demand characteristics from the users. A key benefit of BoPF  is its simplicity of implementation: we have implemented it in YARN. In the following, we describe how and where we have made the necessary changes.

Primer on Data-Parallel Cluster Scheduling Modern cluster managers typically includes three components: job manager or application master (AM), node manager (NM), and resource manager (RM).

One NM runs on each server, which is responsible for managing resource containers on that server. A container is a unit of allocation and are used to run specific tasks.

For each application, a job manager or AM interacts with the RM to request job demands and receive allocation and progress updates. It can run on any server in the cluster. AM manages and monitors job demands (memory and CPU) and job status (PENDING, IN_PROGRESS, or FINISHED).

The RM is the most important part in terms of scheduling. It receives requests from AMs and then schedules resources using an operator-selected scheduling policy. It asks NM to prepare resource containers for the various tasks of the submitted jobs.

BoPF Implementation We made three changes for taking user input, performing admission control, and calculating resource shares – all in the RM. We do not modify NM and AM. Our implementation also requires more input parameters from the users regarding the demand characteristics of their job queues. Figure 4 depicts our design.

User Input Users submit their jobs to their queues. In our system, there are 2 queue types, i.e., LQs and TQs. We do not need additional parameters for TQs because they are the same as the conventional queues. Hence, we assume that TQs are already available in the system. However, the BoPF scheduler needs additional parameters for LQs; namely, arrival times and demands. Since LQs prefer to have resource guarantee, it is necessary for them to report their own demands. The demand can be estimated by using an off-line estimator like Ernest [72]. The estimation is not necessarily accurate. Nonetheless, we will show that our system is robust to large errors in Section 5.3.1.

A user submits requests containing their parameters of the new LQ. After receiving the parameters, the RM sets up a new LQ queue for the user. Users can also ask the cluster administrator to set up the parameters.

Admission Control

YARN does not support admission control. We implement an admission control module to classify LQs and TQs into Hard Guarantee, Soft Guarantee, and Elastic classes. A new queue is rejected if it cannot meet the safety condition (

1), which invalids the committed performance. If it is a TQ, it is added into the Elastic class. If the new LQ does not satisfy the fairness condition (2), it is also admitted to the Elastic class. If the new LQ meets the fairness condition (2), but fails at the resource condition (3), it will be put in the Soft Guarantee class. If the new LQ meets all the three conditions, i.e., safety, fairness, and resource, it will be admitted to the Hard Guarantee class.

BoPF Scheduler We implement BoPF as a new scheduling policy to achieve our joint goals of bounded priority with long-term fairness. Upon registering the queues, users submit their jobs to their LQs or TQs. Thanks to admission control, LQs and TQs are classified into Hard Guarantee, Soft Guarantee, and Elastic classes. Note that resource sharing policies are implemented across queues in YARN, jobs in the same queue are scheduled in FIFO manner. Hence, BoPF only sets the share at the individual queue level.

BoPF Scheduler periodically set the share levels to all LQs in Hard Guarantee and Soft Guarantee classes. These share levels are upper-bounds on resource allocation that an LQ can receive from the cluster. Based on the real demand of LQs, BoPF allocates resources until it meets the share levels.

BoPF Scheduler allocates the resource to the three classes in the following priority order: (1) Hard Guarantee class, (2) Soft Guarantee class, and (3) Elastic class. The LQs in the Hard Guarantee class are allocated first. Then, the BoPF continues allocates the resource to the LQs in Soft Guarantee class. The queues in the Elastic class are allocated with leftover resources using DRF [36].

4.2 Demand Estimation

BoPF requires accurate estimates of resource demands and their durations of LQ jobs by users. These estimations can be done by using well-known techniques. For example, users can use history of prior runs [1, 32, 39] with the assumption that resource requirements for the tasks in the same stage are similar [36, 8, 60]. The progress of pipelining jobs like SQL queries can be estimated by using the number of completed calls versus the total number of calls [19, 59, 55]

. For distributed jobs, their completion times given resource allocations can be estimated using machine learning techniques

[74, 72, 4]. We do not make any new contributions on demand estimation in this paper. When LQs have bursty arrivals of different sizes, BPF with the -strategy ensures the performance with the average usage remains similar (§5.3). We consider a more thorough study an important future work.

4.3 Operational Issues

Container Reuse Container reuse is a well-known technique that is used in some application frameworks, such as Apache Tez. The objective of container reuse is to reduce the overheads of allocating and releasing containers. The downside is that it causes resource waste if the container to be reused is larger than the real demand of the new task. Furthermore, container reuse is not possible if the new task requires more resource than existing containers. For our implementation and deployment, we do not enable container reuse because BoPF  periodically prefers more free resources for LQ jobs, causing its drawbacks to outweigh its benefits.

Preemption Preemption is a recently introduced setting in the YARN Fair Scheduler [75], and it is used to kill running containers of one job to create free containers for another. By default, preemption is not enabled in YARN. For BoPF , using preemption can help in providing guarantees for LQs. However, killing the tasks of running jobs often results in failures and significant delays. We do not use preemption in our system throughout this paper.

5 Evaluation

We evaluated BoPF using three widely used big data benchmarks – BigBench (BB), TPC-DS, and TPC-H. We ran experiments on a 40-node CloudLab cluster [25]. We setup Tez atop YARN for the experiment. Furthermore, to understand performance at a larger scale, we used a trace-driven simulator to replay jobs from the same traces. Our key findings are:

  • [noitemsep,nolistsep]

  • BoPF can closely approximate the LQ performance of Strict Priority (§5.2.2) and the long-term fairness for TQs of DRF (§5.2.3).

  • BoPF handles multiple LQs to accommodate bounded priority and fairness (§5.2.5).

  • BoPF can provide similar benefits in the large-scale setting (§5.3).

  • When LQs have bursty arrivals of different sizes, BPF with the -strategy ensures the performance with the average usage remains similar (§5.3).

  • Sensitivity analysis shows that BoPF is robust to estimation errors (§5.3.1).

5.1 Experimental Setup


Our workloads consist of jobs from public benchmarks – BigBench (BB) [14], TPC-DS [68], and TPC-H [69]

traces. A job has multiple stages. A new stage can be executed if its prerequisite stages are finished. A stage has a number of equivalent tasks in terms of resource demands and durations. The cumulative distribution functions (CDFs) task durations across the three benchmarks are presented in Figure

5. In each experiment run, we chose the LQ jobs from one of the traces such that their shortest completion times are less than 30 seconds. We scale these jobs to make sure their instantaneous demands reach the maximum capacity of a single resource. The TQ jobs are randomly chosen from one of the traces. Each TQ job lasts from tens of seconds to tens of minutes. Each cluster experiment has 100 TQ jobs, and each simulation experiment has 500 TQ jobs. Throughout the evaluation, all the TQ jobs are submitted up at the beginning while the LQ jobs arrive sequentially. Our default experimental setup has a single LQ and 8 TQs.

Figure 5: CDFs of task durations across workloads.

User Input Since the traces give us the resource demand and durations of the job tasks, we can set an ON period (i.e., when a LQ job is active) equal to the shortest completion time of its corresponding LQ job. The average of ON periods is 27 seconds across the traces. Without loss of generality, we assume that the LQ jobs arrive periodically. The case of aperiodic LQ jobs is similar to multiple LQs with different periods. Unless otherwise noted, the inter-arrival period of two LQ jobs is 300 seconds (1000 seconds) for the cluster experiment (the simulation experiment).

Experimental Cluster We setup Apache Hadoop 2.7.2 (YARN) on a cluster having 40 worker nodes on CloudLab [25] (40-node cluster). Each node has 32 CPU cores, 64 GB RAM, a 10 Gbps NIC, and runs Ubuntu 16.04. Totally, the cluster has 1280 CPU cores and 2.5 TB memory. The cluster also has a master node with the same specification running the resource manager (RM).

Trace-driven Simulator To have the experimental results on a larger scale, we build a simulator that mimics the system like Tez atop YARN. The simulator can replay the directed acyclic graph jobs (like Tez does), and simulate the fair scheduler of YARN at queue level. For the jobs in the same queue, we allocate the resource to them in a FIFO manner. Unlike YARN, the simulator supports 6 resources, i.e., CPU, memory, disk in/out throughputs, and network in/out throughputs.

Baselines We compare BoPF against the following:

  • [noitemsep,nolistsep]

  • Dominant Resource Fairness (DRF): DRF algorithm is implemented in YARN Fair Scheduler [75]. DRF uses the concept of the dominant resource to compare multi-dimensional resources [36]. The idea is that resource allocation should be determined by the dominant share of a queue, which is the maximum share of any resource (memory or CPU). Essentially, DRF seeks to maximize the minimum dominant share across all queues.

  • Strict Priority (SP): We use Strict Priority to provide the best performance for LQ jobs. In fact, we borrow the concept of “Strict Priority” from network traffic scheduling that enables Strict Priority queues to get bandwidth before other queues [54]. Similarly, we enable the LQs to receive all resources they need first, and then allocate the remaining resources to other queues. If there are conflicts among the LQs, we use DRF among them.

  • Naive-BoPF (N-BoPF ): N-BoPF is a simple version of BoPF that can provide bounded performance guarantee and fairness. However, N-BoPF does not support admission to Soft Guarantee. For the queues that satisfy the safety condition 1, N-BoPF decides to admit them to Hard Guarantee if they meet the fairness condition 2 and resource condition 3. Otherwise, it put the queues into the Elastic class. We use N-BoPF as a baseline when there are multiple LQs (§5.2.5).

Overall, SP is the upper bound in terms of performance guarantee, and DRF is the upper bound of fairness guarantee for our proposed approach.

Metrics Our primary metric is the average completion times (avg. compl.) of LQ jobs or TQ jobs. To show the performance improvement, we use the average completion times of LQ jobs across the three approaches. On the other hand, we use average completion times of TQ jobs to show that BoPF also protects the TQ jobs. Additionally, we use factor of improvement to show how much BoPF can speed up the LQ jobs compared to DRF as

5.2 BoPF in Testbed Experiments

5.2.1 BoPF in Practice

Figure 6: [Cluster] BoPF ’s solution for the motivational problem (§2.1). The first two jobs of LQ quickly finish and the last two jobs are prevented from using too much resource. This solution is close to the optimal one.

Before diving into the details of our evaluation, recall the motivational problem from Section 2.1. Figure 6 depicts how BoPF solves it in the testbed. BoPF enables the first two jobs of LQ to quickly finish in 141 and 180 seconds. For the two large jobs arriving at 1400 and 2000 seconds, the share is very large only in roughly 335 seconds but it is cut down to give back resource to TQ.

5.2.2 Performance Guarantee

Figure 7: [Cluster] Average completion time of LQjobs in a single LQ across the 3 schedulers when varying the number of TQs. BoPF and SP guarantee the average completion time of the LQ jobs while DRF significantly suffers from the increase of number of TQs.

Next, we focus on what happens when there are more than one TQ. Figure 7 shows that average completion time of LQ jobs in the 40-node cluster on the BB workload. In this setting, there are a single LQ and multiple TQs. The x-axis shows the number of TQs in the cluster.

When there are no TQs, the average completion times of LQ jobs across three schedulers are the same (57 seconds). The completion times are greater than the average ON period (27 seconds) because of inefficient resource packing and allocation overheads. In practice, the resource demand of tasks cannot utilize all the resources of a node that results in large unallocated resources across the cluster. Hence, the LQ jobs are not able to receive the whole cluster capacity as expected. More importantly, this delay is also caused by allocation overheads, such as waiting for containers to be allocated or launching containers.

As the number of TQs increases, the performance of DRF significantly degrades because DRF tends to allocate less resource to LQ jobs. DRF is the worst among three schedulers. In contrast, BoPF and SP give the highest priority to LQs that guarantees the performance of LQ jobs. The average completion times, when TQs are available (1,2,4, and 8), are almost the same (65 seconds). These average completion times are still larger than the case of no TQs because of non-preemption. The LQ jobs are not able to receive the resources that are still used by the running tasks.

Workload 1 TQ 2 TQs 4 TQs 8 TQs
BB 1.18 1.42 1.86 4.66
TPC-DS 1.35 1.61 2.29 5.38
TPC-H 1.10 1.37 2.01 5.12
Table 3: [Cluster] Factor of improvement by BoPF across various workload with respect to the number of TQs.

To understand how well BoPF performs on various workload traces, we carried out the same experiments on TPC-DS and TPC-H. As SP and BoPF achieve similar performance, we only present the factors of improvement of BoPF across the various workloads in Table 3. The numbers on the table show consistent improvements inn terms of the average completion times of LQ jobs.

(a) 1 LQ & 4 TQs
(b) 1 LQ & 8 TQs
Figure 8: [Cluster] The completion time of LQ jobs is predictable using BoPF .

In addition to the average completion time, we evaluated the performance of individual LQ jobs. Figure 8 shows that cumulative distribution functions (cdf) of the completion times across 3 approaches. Figure (a)a and (b)b are the experimental results for the cases of 4 TQs and 8 TQs, respectively. We observe that the completion times of LQ jobs in DRF are not stable and vary a lot when the number of LQs becomes large as in Figure (b)b

. The variation is caused by the instantaneous fairness and the variance of total resource demand.

5.2.3 Fairness Guarantee

Figure 9: [Cluster] BoPF protects the batch jobs up to compared to SP.

Figure 9 shows the average completion time of TQ jobs when we scale up the number of tasks of LQ jobs are by 1x, 2x, 4x, and 8x. In this experiment, there are one LQ and 8 TQs.

Since DRF is a fair scheduler, the average completion times of TQ jobs are almost not affected by the size of LQ jobs. However, SP allocates too much resource to LQ jobs that significantly hurts TQ jobs. Since SP provides the highest priority for the LQ jobs, it makes the TQ jobs to starve for resources. BoPF performs closely to DRF. While DRF maintains instantaneous fairness, BoPF maintains the long-term fairness among the queues.

5.2.4 Scheduling Overheads

Recall from Section 4 that the BoPF scheduler has three components: user input, admission control, and allocation. Compared to the default schedulers in YARN, our scheduler has additional scheduling overheads for admission control and additional computation in allocation.

Since we only implement our scheduler in the Resource Manager, the scheduling overheads occur at the master node. To measure the scheduling overheads, we run admission control for 10000 LQ queues and 10000 TQ queues on a master node – Intel Xeon E3 2.4 GHz (with 12 cores). Each LQ queue has 500 ON/OFF cycles. Recall the LQAdmit and TQAdmit functions in Algorithm 1, the admission overheads increase linearly to the number of queues. The total admission overheads are approximately 1 ms, which is significantly smaller than the default update interval in YARN Fair Scheduler, i.e., 500 ms [75]. The additional computation time spent in allocation is also negligible (less than 1 ms).

5.2.5 Admission Control for Multiple LQs

DRF: LQ-0, LQ-1, LQ-2 are unhappy with high latency.
(a) DRF: LQ-0, LQ-1, LQ-2 are unhappy with high latency.
(b) SP: TQ-0 is starving of resources.
(c) N-BoPF : Only LQ-0 and TQ-0 are happy.
(d) BoPF : LQ-0, LQ-1 and TQ-1 are happy.
Figure 10: [Cluster]. DRF and SP fail to guarantee both performance and fairness simultaneously. BoPF gives the best performance to LQ-0, near optimal performance for LQ-1, and maintains fairness among 4 queues. LQ-2 requires too much resource, so its performance cannot be guaranteed.

To demonstrate how BoPF works with multiple LQs, we set up 3 LQs (LQ-0, LQ-1, and LQ-2) and a single TQ (TQ-0). The jobs TQ-0 are queued up at the beginning while LQ-0, LQ-1, and LQ-2 arrive at 50, 100, and 150 seconds, respectively. The periods of LQ-0, LQ-1, and LQ-2 are 150, 110, and 60 secs. All the LQs jobs have the identical demand and task durations. The TQ jobs are chosen from the BB benchmark. BoPF admits LQ-0 to the Hard Guarantee class, LQ-1 to the Soft Guarantee class, and LQ-2 to the Elastic class.

Figure (a)a shows the resource usage for each queue across four schedulers, i.e., DRF, SP, N-BoPF and BoPF . As an instantaneously fair scheduler, DRF continuously maintains the fair share for all queues as in Figure (a)a. Since LQ-2 requires a lot of resources, SP makes TQ-0 starving for resources (Figure (b)b). N-BoPF provides LQ-0 with resource guarantee and it fairly share the resources to LQ-1, LQ-2, and TQ-0 (Figure (c)c). BoPF provides hard guarantee to LQ-0 and soft guarantee to LQ-1 as in Figure (d)d. The soft guarantee allows LQ-1 to perform better than using N-BoPF . Since LQ-2 demands too much resources, BoPF treats it like TQ-0.

Figure 11 shows the average completion time of jobs on each queue across the four schedulers. The performance of DRF for LQ jobs is the worst among the four schedulers but it is the best for only TQ-0. The performance of SP is good for LQ jobs but it is the worst for TQ jobs. N-BoPF provides the best performance for LQ-0 but not LQ-1 and LQ-2. BoPF is the best among the four schedulers. Three of the four queues, i.e., LQ-0, LQ-1, and TQ-0, significantly benefit from BoPF . BoPF even outperforms SP for LQ-0 and LQ-1 jobs and does not hurt TQ.

Figure 11: [Cluster] BoPF provides with better performance for LQs than DRF and N-BoPF . Unlike SP, BoPF protects the performance of TQ jobs.
(a) Percentage of arrivals completed by the deadlines.
(b) Requested demand normalized by that under the vanilla BoPF .
(c) Resource consumption of LQ.
Figure 12: [Simulation] The proposed -strategy under =95% is robust against the uncertainties.

5.3 BoPF in Trace-Driven Simulations

To verify the correctness of the large-scale simulator, we replayed the BB trace logs from cluster experiments in the simulator. Table 4 shows the factors of improvement in completion times of LQ jobs from the simulator that are consistent with that from our cluster experiments (Table 3).

Workload Number of TQs
1 2 4 8 16 32
BB 1.08 1.56 2.32 4.09 7.28 16.61
TPC-DS 1.06 1.38 1.66 2.93 5.16 10.40
TPC-H 1.01 1.28 1.92 3.04 5.50 11.35
Table 4: [Simulation] Factors of improvement by BoPF across various workloads w.r.t the number of TQs.

BoPF significantly improves over DRF when we have more TQs. We note that the factors of improvement for TPC-DS and TPC-H in the simulation are smaller than that of the cluster experiments. It turns out that DRF in TPC-DS and TPC-H suffers from allocation overheads that our simulation does not capture. The allocation overheads for the LQ jobs in TPC-DS and TPC-H are large because they have more stages than the LQ jobs in BB (only 2 stages).

5.3.1 Impact of Estimation Errors

BoPF requires users to report their estimated demand for LQ jobs. Demand estimation often results in estimation errors. To understand the impact of estimation errors on BoPF , we assume that estimation errors follow the standard normal distribution with zero mean. The standard deviation (std.) of estimation errors lines in . To adopt the estimation errors, we update the task demand and durations of LQ jobs as .

Figure 13 shows the impact of estimation errors on the average completion time of LQ jobs. There are 1 LQ and 8 TQs. LQ jobs arrive every 350 seconds. BoPF is robust when the standard deviation of estimation errors vary 0 to 20. The LQ jobs in BB suffer more from the large estimation errors (std. ) than that of TPC-DS and TPC-H. The delays are caused by the underestimated jobs because the excessive demand is not guaranteed by the system. Meanwhile, the overestimated jobs do not suffer any delays as the guaranteed resource is more than needed. Although estimation errors result in performance degradation, the performance of LQ jobs is still much better than that of DRF (162 seconds).

Figure 13: [Simulation] BoPF ’s performance degrades with larger estimation errors, yet is still significantly better than DRF (162 secs).

5.3.2 Performance of the -Strategy.

Figure 12 depicts the requested demand, performance, and resource usage under the vanilla BoPF and the one with the -strategy when arrivals have different sizes. In particular, as the variance increases, the vanilla BoPF can no longer complete arrivals before the deadline. Actually, even with 10% standard deviation, the percentage drops below 50%. On the other side, BoPF with -strategy always satisfy the requirement. Even though the reported demand increases, the average resource usage does not change much, e.g., TQ receives the same long-term share.

6 Related Work

Bursty Applications in Big Data Clusters Big data clusters experience burstiness from a variety of sources, including periodic jobs [32, 1, 6, 65], interactive user sessions [5], as well as streaming applications [78, 3, 70]. Some of them show predictability in terms of inter-arrival times between successive jobs (e.g., Spark Streaming [78] runs periodic mini batches in regular intervals), while some others follow different arrival processes (e.g., user interactions with large datasets [5]). Similarly, resource requirements of the successive jobs can sometimes be predictable, but often it can be difficult to predict due to external load variations (e.g., time-of-day or similar patterns); the latter, without BoPF , can inadvertently hurt batch queues (§2).

Multi-Resource Job Schedulers Although early jobs schedulers dealt with a single resource [79, 7, 46], modern cluster resource managers, e.g., Mesos [43], YARN [71], and others [65, 73, 18], employ multi-resource schedulers [36, 34, 39, 40, 16, 53, 13] to handle multiple resources and optimize diverse objectives. These objectives can be fairness (e.g., DRF [36]), performance (e.g., shortest-job-first (SJF) [34]), efficiency (e.g., Tetris [39]), or different combinations of the three (e.g., Carbyne [40]). Hawk [29] focuses on reducing the overheads in scheduling the large number of small jobs. Chen et al. design a preemption algorithm to prioritize short jobs without resource guarantee [21]. However, all of these focus on instantaneous objectives, with instantaneous fairness being the most common goal. To the best of our knowledge, BoPF  is the first multi-resource job scheduler with long-term memory.

Handling Burstiness Supporting multiple classes of traffic is a classic networking problem that, over the years, have arisen in local area networks [33, 66, 67, 15, 30], wide area networks [56, 49, 44], and in datacenter networks [50, 42]. All of them employ some form of admission control to provide quality-of-service guarantees. They consider only a single link (i.e., a single resource). In contrast, BoPF  considers multi-resource jobs and builds on top this large body of literature.

BVT [31] is a thread-based CPU scheduler that was designed to work with both real-time and best-effort tasks. Although it prioritizes the real-time tasks, it cannot guarantee performance and fairness.

Collocating Mixed Workloads in Datacenters Latency-sensitive and best-effort workloads are often collocated. Heracles [57] and Parties [20] handle mixed workloads to increase the utilization of servers. Bistro [38] allows both data-intensive and online workloads to share the same infrastructure. Morpheus [52] reserves resources for periodic jobs ahead of time. All of them prioritize the latency-sensitive workloads to meet the quality of service requirement but do not provide both resource guarantee and fairness.

Expressing Burstiness Requirements BoPF  is not the first system that allows users to express their time-varying resource requirements. Similar challenges have appeared in traditional networks [67], network calculus [26, 27], datacenters [50, 9], and wide-area networks [56]. Akin to them, BoPF  requires users to explicitly provide their burst durations and sizes; BoPF  tries to enforce those requirements in short and long terms. Unlike them, however, BoPF  explores how to allow users to express their requirements in a multi-dimensional space, where each dimension corresponds to individual resources. One possible way to collapse the multi-dimensional interface to a single dimension is using the progress [22, 36]; however, progress only applies to scenarios when a user’s utility is captured using Leontief preferences.

7 Conclusion

To enable the coexist of latency-sensitive LQs and the TQs, we proposed BoPF (Bounded Priority Fairness). BoPF provides bounded performance guarantee to LQs and maintains the long-term fairness for TQs. BoPF classifies the queues into three classes: Hard Guarantee, Soft Guarantee and Elastic. BoPF provides the best performance to LQs in the Hard Guarantee class and the better performance for LQs in the Soft Guarantee class. The scheduling is executed in a strategyproof manner, which is critical for public clouds. The queues in the Elastic class share the left-over resources to maximize the system utilization. In the deployments, we show that BoPF not only outperforms the DRF up to for LQ jobs but also protects TQ jobs up to compared to Strict Priority. When LQ’s arrivals have different sizes, adding the -strategy can satisfy the deadlines with similar resource utilization.

8 Acknowledgments

This research is supported by NSF grants CNS-1617773, 1730128, 1919752, 1717588, 1617698 and was partially funded by MSIT, Korea, IITP-2019-2011-1-00783.

9 Appendices

9.1 Proof of strategyproofness

Let the true demand and deadline of LQ- be for a particular arrival. Let the request parameter be . We first argue that holds with .

As , let . Define a new vector and let . Notice here has the same performance as , while for any . Therefore, may request no more resources than , which is more likely to be admitted. Hence, it is always better to request , which is proportional to , the true demand.

Regarding the demand , reporting a larger () still satisfies its demand, while has a higher risk being rejected as it requests higher demand, while it does not make sense to report a smaller () as it may receive fewer resources than it actually needs. Therefore, there is no incentive to for LQ- to lie about its . The argument for deadline is similar. Reporting a larger deadline does not make sense as it may receive fewer resources than it actually needs. On the other hand, reporting a tighter deadline still satisfies the deadlines, while has a higher risk being rejected as it requests higher demand. Therefore, there is no incentive to for LQ- to lie about its deadline, either.


  • [1] S. Agarwal, S. Kandula, N. Burno, M. Wu, I. Stoica, and J. Zhou (2012) Re-optimizing data parallel computing. In NSDI, Cited by: §4.2, §6.
  • [2] S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica (2013) BlinkDB: queries with bounded errors and bounded response times on very large data. In EuroSys, Cited by: §1.
  • [3] T. Akidau, A. Balikov, K. Bekiroğlu, S. Chernyak, J. Haberman, R. Lax, S. McVeety, D. Mills, P. Nordstrom, and S. Whittle (2013) MillWheel: fault-tolerant stream processing at Internet scale. Cited by: §1, §6.
  • [4] O. Alipourfard, H. H. Liu, J. Chen, S. Venkataraman, M. Yu, and M. Zhang (2017) Cherrypick: adaptively unearthing the best cloud configurations for big data analytics. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17), pp. 469–482. Cited by: §4.2.
  • [5] S. Alspaugh, B. Chen, J. Lin, A. Ganapathi, M. Hearst, and R. Katz (2014) Analyzing log analysis: an empirical study of user log mining. In LISA, Cited by: §6.
  • [6] G. Ananthanarayanan, S. Agarwal, S. Kandula, A. Greenberg, I. Stoica, D. Harlan, and E. Harris (2011)

    Scarlett: coping with skewed popularity content in MapReduce clusters

    In EuroSys, Cited by: §6.
  • [7] G. Ananthanarayanan, S. Kandula, A. Greenberg, I. Stoica, Y. Lu, B. Saha, and E. Harris (2010)

    Reining in the outliers in MapReduce clusters using Mantri

    In OSDI, Cited by: §1, §6.
  • [8] G. Ananthanarayanan, A. Ghodsi, A. Warfield, D. Borthakur, S. Kandula, S. Shenker, and I. Stoica (2012) Pacman: coordinated memory caching for parallel jobs. In Presented as part of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12), pp. 267–280. Cited by: §1, §4.2.
  • [9] S. Angel, H. Ballani, T. Karagiannis, G. O’Shea, and E. Thereska (2014) End-to-end performance isolation through virtual datacenters. In OSDI, Cited by: §6.
  • [10] Apache Hadoop. Note: Cited by: §1.
  • [11] (2017) Apache Tez. Note: Cited by: §1.
  • [12] N. Bansal and M. Harchol-Balter (2001) Analysis of srpt scheduling: investigating unfairness. Vol. 29, ACM. Cited by: §3.3.
  • [13] A. A. Bhattacharya, D. Culler, E. Friedman, A. Ghodsi, S. Shenker, and I. Stoica (2013) Hierarchical scheduling for diverse datacenter workloads. In SoCC, Cited by: §1, §6.
  • [14] (2016) Big-Data-Benchmark-for-Big-Bench. Note: Cited by: §1, §5.1.
  • [15] S. Blake, D. Black, M. Carlson, E. Davies, Z. Wang, and W. Weiss (1998-12) An Architecture for Differentiated Services. Request for Comments, IETF, Internet Engineering Task Force. Note: RFC 2475 (Informational)Updated by RFC 3260 External Links: Link Cited by: §6.
  • [16] E. Boutin, J. Ekanayake, W. Lin, B. Shi, J. Zhou, Z. Qian, M. Wu, and L. Zhou (2014) Apollo: scalable and coordinated scheduling for cloud-scale computing. In OSDI, Cited by: §6.
  • [17] P. Carbone, S. Ewen, S. Haridi, A. Katsifodimos, V. Markl, and K. Tzoumas (2015) Apache Flink: stream and batch processing in a single engine. Data Engineering. Cited by: §1.
  • [18] R. Chaiken, B. Jenkins, P. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou (2008) SCOPE: easy and efficient parallel processing of massive datasets. In VLDB, Cited by: §6.
  • [19] S. Chaudhuri, V. Narasayya, and R. Ramamurthy (2004) Estimating progress of execution for sql queries. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data, pp. 803–814. Cited by: §4.2.
  • [20] S. Chen, C. Delimitrou, and J. F. Martínez (2019) PARTIES: qos-aware resource partitioning for multiple interactive services. Cited by: §6.
  • [21] W. Chen, J. Rao, and X. Zhou (2017) Preemptive, low latency datacenter scheduling via lightweight virtualization. In 2017 USENIX Annual Technical Conference (USENIXATC 17), pp. 251–263. Cited by: §6.
  • [22] M. Chowdhury, Z. Liu, A. Ghodsi, and I. Stoica (2016) HUG: multi-resource fairness for correlated and elastic demands. In NSDI, Cited by: §6.
  • [23] M. Chowdhury and I. Stoica (2015) Efficient coflow scheduling without prior knowledge. In SIGCOMM, Cited by: §1.
  • [24] M. Chowdhury, M. Zaharia, J. Ma, M. I. Jordan, and I. Stoica (2011) Managing data transfers in computer clusters with Orchestra. In SIGCOMM, Cited by: §1.
  • [25] (2017) Cloudlab. Note: Cited by: §5.1, §5.
  • [26] R. Cruz (1991) A calculus for network delay, Part I: network elements in isolation. IEEE Transactions on Information Theory 37 (1), pp. 114–131. Cited by: §6.
  • [27] R. Cruz (1991) A calculus for network delay, Part II: network analysis. IEEE Transactions on Information Theory 37 (1), pp. 132–141. Cited by: §6.
  • [28] J. Dean and S. Ghemawat (2004) MapReduce: simplified data processing on large clusters. In OSDI, Cited by: §1.
  • [29] P. Delgado, F. Dinu, A. Kermarrec, and W. Zwaenepoel (2015) Hawk: hybrid datacenter scheduling. In 2015 USENIX Annual Technical Conference (USENIXATC 15), pp. 499–510. Cited by: §6.
  • [30] A. Demers, S. Keshav, and S. Shenker (1989) Analysis and simulation of a fair queueing algorithm. In SIGCOMM, Cited by: §6.
  • [31] K. J. Duda and D. R. Cheriton (1999) Borrowed-virtual-time (BVT) scheduling: supporting latency-sensitive threads in a general-purpose scheduler. ACM SIGOPS Operating Systems Review 33 (5), pp. 261–276. Cited by: §2.3, §6.
  • [32] A. D. Ferguson, P. Bodik, S. Kandula, E. Boutin, and R. Fonseca (2012) Jockey: guaranteed job latency in data parallel clusters. In Eurosys, Cited by: §4.2, §6.
  • [33] S. Floyd and V. Jacobson (1995) Link-sharing and resource management models for packet networks. IEEE/ACM Transactions on Networking 3 (4), pp. 365–386. Cited by: §6.
  • [34] M. R. Garey, D. S. Johnson, and R. Sethi (1976) The complexity of flowshop and jobshop scheduling. Mathematics of Operations Research 1 (2), pp. 117–129. Cited by: §1, §6.
  • [35] A. Ghodsi, V. Sekar, M. Zaharia, and I. Stoica (2012) Multi-resource fair queueing for packet processing. SIGCOMM. Cited by: §1.
  • [36] A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker, and I. Stoica (2011) Dominant resource fairness: fair allocation of multiple resource types. In NSDI, Cited by: §1, §1, §3.3, §4.1, §4.2, 1st item, §6, §6.
  • [37] A. Ghodsi, M. Zaharia, S. Shenker, and I. Stoica (2013) Choosy: max-min fair sharing for datacenter jobs with constraints. In EuroSys, Cited by: §1.
  • [38] A. Goder, A. Spiridonov, and Y. Wang (2015) Bistro: scheduling data-parallel jobs against live production systems. In 2015 USENIX Annual Technical Conference (USENIXATC 15), pp. 459–471. Cited by: §6.
  • [39] R. Grandl, G. Ananthanarayanan, S. Kandula, S. Rao, and A. Akella (2014) Multi-resource packing for cluster schedulers. In SIGCOMM, Cited by: §1, §4.2, §6.
  • [40] R. Grandl, M. Chowdhury, A. Akella, and G. Ananthanarayanan (2016) Altruistic scheduling in multi-resource clusters. In OSDI, Cited by: §1, §6.
  • [41] R. Grandl, S. Kandula, S. Rao, A. Akella, and J. Kulkarni (2016) Graphene: packing and dependency-aware scheduling for data-parallel clusters. In OSDI, Cited by: §1.
  • [42] M. P. Grosvenor, M. Schwarzkopf, I. Gog, R. N. Watson, A. W. Moore, S. Hand, and J. Crowcroft (2015) Queues don’t matter when you can JUMP them!. In NSDI, Cited by: §6.
  • [43] B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. Katz, S. Shenker, and I. Stoica (2011) Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center. In NSDI, Cited by: §6.
  • [44] C. Hong, S. Kandula, R. Mahajan, M. Zhang, V. Gill, M. Nanduri, and R. Wattenhofer (2013) Achieving high utilization with software-driven WAN. In SIGCOMM, Cited by: §6.
  • [45] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly (2007) Dryad: distributed data-parallel programs from sequential building blocks. In EuroSys, Cited by: §1.
  • [46] M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, and A. Goldberg (2009) Quincy: fair scheduling for distributed computing clusters. In SOSP, Cited by: §1, §6.
  • [47] J. M. Jaffe (1981) Bottleneck flow control. IEEE Transactions on Communications 29 (7), pp. 954–962. Cited by: §1.
  • [48] P. Jaillet and M. R. Wagner (2012) Online optimization. Springer Publishing Company, Incorporated. Cited by: §3.1.
  • [49] S. Jain, A. Kumar, S. Mandal, J. Ong, L. Poutievski, A. Singh, S. Venkata, J. Wanderer, J. Zhou, M. Zhu, et al. (2013) B4: experience with a globally-deployed software defined WAN. In SIGCOMM, Cited by: §6.
  • [50] K. Jang, J. Sherry, H. Ballani, and T. Moncaster (2015) Silo: predictable message completion time in the cloud. In SIGCOMM, Cited by: §6, §6.
  • [51] C. Joe-Wong, S. Sen, T. Lan, and M. Chiang (2012) Multi-resource allocation: fairness-efficiency tradeoffs in a unifying framework. In INFOCOM, Cited by: §1.
  • [52] S. A. Jyothi, C. Curino, I. Menache, S. M. Narayanamurthy, A. Tumanov, J. Yaniv, R. Mavlyutov, Í. Goiri, S. Krishnan, J. Kulkarni, et al. (2016) Morpheus: towards automated slos for enterprise clusters. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 117–134. Cited by: §6.
  • [53] K. Karanasos, S. Rao, C. Curino, C. Douglas, K. Chaliparambil, G. Fumarola, S. Heddaya, R. Ramakrishnan, and S. Sakalanaga (2015) Mercury: hybrid centralized and distributed scheduling in large shared clusters. In USENIX ATC, Cited by: §6.
  • [54] L. Kleinrock and R. Gail (1996) Queueing systems: problems and solutions. Wiley. Cited by: §1, 2nd item.
  • [55] A. C. König, B. Ding, S. Chaudhuri, and V. Narasayya (2011) A statistical approach towards robust progress estimation. Proceedings of the VLDB Endowment 5 (4), pp. 382–393. Cited by: §4.2.
  • [56] A. Kumar, S. Jain, U. Naik, A. Raghuraman, N. Kasinadhuni, E. C. Zermeno, C. S. Gunn, J. Ai, B. Carlin, M. Amarandei-Stavila, et al. (2015) BwE: flexible, hierarchical bandwidth allocation for WAN distributed computing. In SIGCOMM, Cited by: §6, §6.
  • [57] D. Lo, L. Cheng, R. Govindaraju, P. Ranganathan, and C. Kozyrakis (2015) Heracles: improving resource efficiency at scale. In ACM SIGARCH Computer Architecture News, Vol. 43, pp. 450–462. Cited by: §6.
  • [58] Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein (2010) GraphLab: a new framework for parallel machine learning. In UAI, Cited by: §1.
  • [59] G. Luo, J. F. Naughton, C. J. Ellmann, and M. W. Watzke (2004) Toward a progress indicator for database queries. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data, pp. 791–802. Cited by: §4.2.
  • [60] K. Morton, M. Balazinska, and D. Grossman (2010) ParaTimer: a progress indicator for mapreduce dags. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pp. 507–518. Cited by: §4.2.
  • [61] H. Moulin (2014) Cooperative microeconomics: a game-theoretic introduction. Princeton University Press. Cited by: §2.3.
  • [62] D. G. Murray, F. McSherry, R. Isaacs, M. Isard, P. Barham, and M. Abadi (2013) Naiad: a timely dataflow system. In SOSP, Cited by: §1.
  • [63] (2016) Presto: Distributed SQL Query Engine for Big Data. Note: Cited by: §1.
  • [64] J. Schneider and S. Kirkpatrick (2007) Stochastic optimization. Springer Science & Business Media. Cited by: §3.1.
  • [65] M. Schwarzkopf, A. Konwinski, M. Abd-El-Malek, and J. Wilkes (2013) Omega: flexible, scalable schedulers for large compute clusters. In EuroSys, Cited by: §6, §6.
  • [66] S. Shenker, D. D. Clark, and L. Zhang (1993) A scheduling service model and a scheduling architecture for an integrated services packet network. Technical report Xerox PARC. Cited by: §6.
  • [67] I. Stoica, H. Zhang, and T. S. E. Ng (1997) A hierarchical fair service curve algorithm for link-sharing, real-time and priority service. In SIGCOMM, Cited by: §6, §6.
  • [68] (2017) TPC Benchmark DS (TPC-DS). Note: Cited by: §5.1.
  • [69] (2017) TPC Benchmark H (TPC-H). Note: Cited by: §5.1.
  • [70] (2017) Trident: stateful stream processing on Storm. Note: Cited by: §1, §6.
  • [71] V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino, O. O’Malley, S. Radia, B. Reed, and E. Baldeschwieler (2013) Apache Hadoop YARN: yet another resource negotiator. In SoCC, Cited by: §1, §6.
  • [72] S. Venkataraman, Z. Yang, M. Franklin, B. Recht, and I. Stoica (2016) Ernest: efficient performance prediction for large-scale advanced analytics. In 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16), pp. 363–378. Cited by: §4.1, §4.2.
  • [73] A. Verma, L. Pedrosa, M. Korupolu, D. Oppenheimer, E. Tune, and J. Wilkes (2015) Large-scale cluster management at Google with Borg. In EuroSys, Cited by: §6.
  • [74] N. J. Yadwadkar, B. Hariharan, J. E. Gonzalez, B. Smith, and R. H. Katz (2017) Selecting the best vm across multiple public clouds: a data-driven performance modeling approach. In Proceedings of the 2017 Symposium on Cloud Computing, pp. 452–465. Cited by: §4.2.
  • [75] (2014) YARN Fair Scheduler. Note: Cited by: §1, §4.3, 1st item, §5.2.4.
  • [76] M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker, and I. Stoica (2010) Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In EuroSys, Cited by: §1.
  • [77] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica (2012) Resilient Distributed Datasets: a fault-tolerant abstraction for in-memory cluster computing. In NSDI, Cited by: §1, §1.
  • [78] M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica (2013) Discretized streams: fault-tolerant stream computation at scale. In SOSP, Cited by: §1, §3.1, §6.
  • [79] M. Zaharia, A. Konwinski, A. D. Joseph, R. Katz, and I. Stoica (2008) Improving MapReduce performance in heterogeneous environments. In OSDI, Cited by: §1, §6.