Themis: Fair and Efficient GPU Cluster Scheduling for Machine Learning Workloads

07/02/2019 ∙ by Kshiteej Mahajan, et al. ∙ 0

Modern distributed machine learning (ML) training workloads benefit significantly from leveraging GPUs. However, significant contention ensues when multiple such workloads are run atop a shared cluster of GPUs. A key question is how to fairly apportion GPUs across workloads while ensuring overall cluster efficiency. We find that established cluster scheduling disciplines that provide instantaneous fair share of resources are a poor fit because of ML workloads' unique attributes. ML jobs are typically long running, have coarse grained tasks that need to be gang-scheduled, and their performance is sensitive to tasks' relative placement. These properties cannot be captured by existing fair sharing schemes. We propose Themis, a new scheduling framework for ML training workloads. It's GPU allocation policy enforces that ML workloads complete in a finish-time fair manner, a new notion we introduce. To capture placement sensitivity and ensure efficiency, Themis uses a two-level scheduling architecture where ML workloads bid on available resources that are offered in an auction run by a central arbiter. Our auction design allocates GPUs to winning bids by trading off efficiency for fairness in the short term but compensating for finish-time fairness in the long term. Our evaluation on a number of machine learning models shows that Themis can ensure greater fairness while providing more efficient allocations compared to state-of-the-art schedulers.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

With the widespread success of machine learning (ML) models for tasks such as object detection, speech recognition, and machine translation, a number of enterprises are now incorporating ML models into their products. Training individual ML models is time- and resource-intensive with each training job typically executing in parallel on a number of GPUs.

With different groups in the same organization training ML models, it is beneficial for enterprises to consolidate GPU resources into a shared cluster. Similar to existing clusters used for large scale data analytics, using a shared cluster for ML has a number of operational advantages, e.g., reduced development overheads, lower costs for maintaining GPUs, etc. However, sharing an ML cluster becomes attractive to users only if they have the appropriate sharing incentive: if there are a total users sharing a cluster , every user’s performance should be no worse than times when using all by herself. Providing such an incentive through fair scheduling mechanisms has been widely studied in prior cluster scheduling frameworks, e.g., Quincy (Isard et al., 2009), DRF (Ghodsi et al., 2011), and Carbyne (Grandl et al., 2016a).

ML workloads however have several unique characteristics that make existing fair allocation schemes insufficient (Section 2). Unlike batch analytics workloads, ML jobs have long running tasks that need to be scheduled together, i.e., gang-scheduled. Further, each task in a job often runs for a number of iterations while synchronizing model updates at the end of each iteration. This frequent communication means that jobs are placement-sensitive, i.e., placing all the tasks for a job on the same machine or the same rack can lead to significant speedups. Having long-running tasks means that established schemes such as DRF that aim to provide instantaneous resource-fairness are not suitable, because an incoming job may have to wait for a long time before resources become available (Section 2). Further, even if the number of resources allocated is fairly divided across jobs, placement sensitivity means that jobs with same aggregate resources could have widely different performance, violating sharing incentive.

We aim to design a fair scheduler for GPU clusters that execute ML workloads. Our goal is to provide fair sharing at across ML applications (Section 2

), or apps for short, where every app consists of one or more related ML jobs, each running with different hyperparameters, to train an accurate model for a given task. To capture the effect of long running tasks and placement sensitivity, we define a new long term objective of

finish time fairness which is the ratio of the running time in a shared cluster with apps to running alone in the entire cluster. Our goal is thus to minimize the maximum finish time fairness across all ML apps while providing allocation that efficiently utilizes all cluster GPUs. In our scheduler, Themis, we achieve this using two key ideas.

First, we develop a scheduling discipline that separates long-vs-short time horizons. From the app’s perspective, the effect of finish-time fairness is only apparent when it terminates. Thus, on finer minutes-long timescales, we can trade off fairness slightly to carefully account for apps’ GPU placement sensitivity while maximizing GPU allocation efficiency. The resulting unfairness can be compensated for in future GPU allocations. In effect, our scheduling discipline works as follows: on the time-scale of app completions, it aims to ensure finish-time fairness across all apps, while it focuses on efficient placement-sensitive allocation in the short term.

Second, we present a two-level scheduling design that contains a centralized inter-app scheduler at the bottom level, and an API to integrate with existing hyperparameter tuning frameworks at the top level. A number of existing frameworks such as Hyperdrive (Rasley et al., 2017) and HyperOpt (Bergstra et al., 2015) can intelligently apportion GPU resources between various jobs in a single app, and in some cases also terminate a job early if its progress is not promising. Thus it is beneficial to have a two-level design where apps can directly use existing hyper parameter tuning frameworks. Importantly, our two-level design uses a novel semi-optimistic auction based approach that accounts for placement sensitivity, building on the first insight above. When GPUs are available on short time-scales, our scheduler makes all of them visible to a fraction of apps that are farthest in terms of their long-term finish-time fairness metrics. Each app has the opportunity to bid

for subsets of these GPUs as a part of an auction; bid values reflect the app’s new (placement sensitive) finish time fairness metric from acquiring different GPU subsets. A centralized arbiter then determines the global winning bids, which maximize the aggregate value across all bidding apps, and allocates resources appropriately. While a far-from-fair app may lose an auction, perhaps because it placed less ideally than another app, its bid values for subsequent auctions naturally increase, thereby improving the odds of it winning. Thus, our approach converges to fair allocations over the long term, while staying efficient and placement-sensitive in the short term.

Using auctions means that we need to ensure that apps are truthful when they bid for GPUs. We propose a partial allocation mechanism that incentivizes truth telling by design, is Pareto efficient, and allocates all GPUs maximally.

Overall, our design provides: good visibility, by offering GPUs to a set of apps; good efficiency, by using a centralized arbiter to avoid conflicts between apps and accommodating placement sensitivity; and fairness, by tracking finish-time metrics to determine which apps to offer GPUs to.

We implement Themis atop Apache YARN 3.2.0, which includes support for allocating GPU resources. We replay workloads from a large enterprise trace. Results from our prototype implementation show that Themis is only away from ideal finish-time fairness in comparison to for prior schemes with improvements on average app completion time compared to prior schemes. To further understand our scheduling decisions we perform an event-driven simulation using by the same trace, and our results show that Themis does better with increasing fraction of network intensive apps, and increasing cluster contention.

2. Motivation

Themis is a fair cluster scheduler for ML applications. We propose a long term allocation mechanism rooted in a new finish-time fairness metric. To realize this, we present a novel semi-optimistic based scheduling architecture which uses auctions as an abstraction.

In this section, we first motivate the need for a new mechanism and metric by highlighting the drawbacks of existing fair schemes in supporting the unique features of ML workloads. We then motivate the need for a new scheduling architecture by highlighting the drawbacks of existing schedulers.

2.1. Definitions

We define an ML app, or an app for simplicity, as a collection of one or more ML model training jobs. Each app corresponds to a user training a machine learning model for a high-level goal, such as speech recognition or object detection. Users train these models knowing the appropriate hyperparameters (in which case there is just a single job in the app), or they train a closely related set of models ( jobs) that explore hyper-parameters such as learning rate, momentum etc. (Rasley et al., 2017; Li et al., 2016) to identify and train the best target model for the activity at hand.

Each job’s constituent work is performed by a number of parallel tasks. At any given time, all of a job’s tasks collectively process a minibatch of training data; we assume that the size of the batch is fixed for the duration of a job. Each task typically processes a subset of the batch, and, starting from an initial version of the model, executes multiple iterations of the underlying learning algorithm to improve the model. We assume all jobs use the popular synchronous SGD (Chen et al., 2016).

We consider the finish time of an app to be when the best model and relevant hyper-parameters have been identified. Along the course of identifying such a model, the app may decide to terminate some of its constituent jobs early (Rasley et al., 2017; Bergstra et al., 2015); such jobs may be exploring hyper-parameters that are clearly sub-optimal (the jobs’ model accuracy improvement over iterations is significantly poorer than other jobs in the same app). For apps that contain a single job, finish time is the time taken to train this model to a target accuracy.

Goals. We seek to design a scheduler for fair and efficient allocation of cluster GPU resources across such apps.

GPUs are expensive to use and power-hungry. Thus, from a given app’s perspective, a primary attribute of fair resource allocation is that the performance of the app, i.e., its finish time, from sharing a cluster with other jobs is no worse that times that when using all to itself. If the app’s performance is worse, then the app is better off running in isolation with a -sized cluster. We refer to this as the sharing incentive.

From the overall cluster’s perspective, the available GPU resources should be used as efficiently as possible, i.e., overall GPU-time for a given workload of apps should be small.

2.2. Why a new scheduling discipline?

Figure 1. Distribution of task durations for ML training jobs from an enterprise cluster
Figure 2. Effect of GPU resource allocation configuration on job throughput for different models.

Existing fair allocation schemes, e.g., DRF (Ghodsi et al., 2011), provide instantaneous resource fairness. At any instant that resources become available, a task from an app with least “dominant resource share” is offered resources. These schemes impose two limitations that violate sharing incentive for ML apps.

1. Short vs long tasks. Instantaneous resource fairness assumes that apps have tasks with infinitesimal durations. For big-data analytics, this is not a limiting assumption and still approximates resource fairness as tasks are typically short-lived (Ousterhout et al., 2013), and frequent task completions serve as opportunities to redistribute resources. For the case of ML apps, however, this is a severely limiting assumption as task durations are typically longer as highlighted in Figure 2, and GPU clusters are heavily contented (Jeon et al., 2018a). Running ML tasks to completion could lead to newly arriving jobs waiting inordinately long for resources and thus lead to violation of the sharing incentive.

2. Placement sensitivity. Placement plays a key role in rendering existing instantaneous fairness schemes fundamentally ineffective. ML apps’s finish-times are highly sensitive to the relative placement of their tasks. Moreover, different ML models have different placement preferences. For example, as shown in Figure 2 VGG16 has a strict machine-local task placement preference while Resnet50 has no such placement preference. Existing instantaneous resource-fair allocation schemes (Ghodsi et al., 2011) do not offer apps the ability to express preferences over resources’ relative placement.111

DRF allows jobs to express task demands as a vector along resource dimensions, but to model placement of ML jobs we need to encode affinity across multiple tasks’ demands.

This mismatch can lead to violation of sharing incentive as a seemingly resource fair allocation can be very sub-optimal in terms of placement and lead to higher finish-time for an app. Waiting for optimally placed resource (Zaharia et al., 2010) is not an option as ML apps have long task durations and this will aggravate waiting time for tasks.

Fairness metric. A key question in the design of a scheduler is what fairness metric should scheduling discipline target. Two alternatives are resource and time-based fairness metrics. DRF and similar schemes target resource fairness – e.g., in DRF, contending apps’ dominant-resource shares are equalized. But, such resource-centric approaches are a poor fit as argued above.

Time-based fairness is rooted in classic scheduling (Shreedhar and Varghese, 1996). In this class of schemes, GPUs could each be allocated for a given amount of time (“lease”). A recent variant (Gu et al., 2019) attempts to equalize “attained service”, which is the total amount of GPU time allocated to a job. When a lease expires, an available GPU is allocated to a new app, where apps are considered for allocation in a given order (e.g., round-robin (Shreedhar and Varghese, 1996), or least attained service (Gu et al., 2019)). Unfortunately, time-based approaches are not resource-placement sensitive. It is difficult to encode placement preferences into the attained service metric.

Efficiency. Fairness-centered schemes may leave resources fallow: GPUs available now may be a poor fit for the farthest-from-fairshare app, but may be a good fit for another app. Given the importance of efficient GPU utilization, our scheduling discipline cannot afford to leave GPUs unused.

Overall, we need a new scheduling discipline that is not tied to short task durations, is placement sensitive, does not rely on time- or resource-based fairness, and puts efficiency on an equal footing with fairness.

2.3. Why a new scheduler architecture?

ML apps that perform hyperparameter exploration use custom frameworks with internal schedulers (Li et al., 2016; Rasley et al., 2017)

to manage allocation of resources across the apps’ constituent jobs. The frameworks typically classify jobs into multiple classes, such as “good”, “average” or “poor”, and the internal scheduler uses custom logic to apportion resources among these classes.

One possible architecture for a scheduler is a monolithic design that combines both cross-app and intra-app resource allocation. Such a design is too complex and does not leave room for apps’ custom allocation logic for apportioning GPUs among many internal ML job classes.

Thus, we consider instead a two-level scheduler architecture. The top level scheduler is the app-specific one. The bottom level manages allocation across apps.

Two state-of-the-art design approaches for the bottom-level inter-app scheduler are pessimistic and optimistic. Neither architecture is a good fit in our context.

A pessimistic architecture (Hindman et al., 2011) means that available cluster GPU resources are offered to one specific app at a time. To enforce placement preferences, the chosen ML app may reject an offer and wait. However the app may be forced to wait for many offers to get a placement-optimal allocation, and such delays can become unbounded due to ML tasks long durations, leading to sharing incentive violation.

In optimistic architectures (Schwarzkopf et al., 2013), all the resources in the cluster are made visible and potentially available to all ML apps. Any app can claim resources in a lock-free manner. The optimistic architecture operates under the assumption that contention for the same set of cluster resources across apps is unlikely, and even in the case of a conflict, a simple machine-local conflict resolution mechanism (e.g., scalar priority-based) can resolve conflicts. However, all ML apps typically compete for GPU resources, and GPU clusters are heavily contended (Jeon et al., 2018b). This can lead to many conflicts in an optimistic architecture. Also, if we rely excessively on a machine-local conflict resolution mechanism, we are not guaranteed convergence to a globally optimal solution that ensures sharing incentive for all ML apps.

Thus, we need a two-level scheduler that is neither pessimistic nor optimistic.

3. Themis

We now describe the main ideas employed by the Themis scheduler and present a design overview. The scheduler attempts to simultaneously achieve fairness while accounting for placement sensitivity, and efficiency.

Fairness metric. Our scheduler uses a new metric called finish-time fairness. This is defined as: i.e., the ratio of the shared running time () to the ideal running time (). The former (

) is the estimated running time of the ML app with its current share of GPU resources in a cluster shared with

other apps. The latter () is the running time of the ML app in a dedicated (un-shared) cluster. Ideally should be close to (and less than) .

Our scheduling policy (Sec. 4) minimizes the maximum across apps, while attempting to keep them close to ideal. We show that this policy ensures sharing incentive (Sec. 8).

Semi-optimism and auctions. Given the constraints imposed by pessimistic and optimistic architectures, we adopt a middleground. We use semi-optimistic inter-app scheduler design that separates visibility from allocation. When GPU resources are available, our scheduler makes them, along with their locations, visible to a set of apps, as opposed to making resources visible to a single app. Crucially, the set of apps to whom available GPUs are visible are those whose finish-time fairness metrics are the worst. Each app in this set places bids on subsets of visible GPUs, reflecting the placement-sensitive improvement in fairness the app would derive from acquiring a given subset. Our scheduler runs an auction that picks a “winning” app for each disjoint resource subset. The set of winning apps maximizes an aggregate function of the improvements derived, where the function prioritizes efficiency.

Short-term efficiency and long-term fairness. Note that finish-time fairness reflects the long-term effect of fair allocation, which is only apparent when an app terminates. Our scheduler leverages this by enforcing that apps achieve their ideal finish-time fair metric by the time they finish. By doing so, our approach balances fairness with efficiency and placement sensitivity. In particular, on finer minutes-long timescales, when resource auctions occur, our scheduler may trade off fairness momentarily to allocate GPUs to apps that place well and can help maximize efficiency. The momentary deviation from fairness is made up for during future GPU auctions, as we explain in Section 5.

Figure 3. Themis Design. (a) Sequence of events in Themis - starting with a pool of resources becoming available and ending with resource allocations. (b) Shows a typical bid valuation table an App submits to Arbiter. Each row in the table has a subset of the complete resource allocation and the value of with this additional allocation.

3.1. Design Overview

Figure 3 pictorially shows our architecture. The cross-app scheduler, i.e., Arbiter, is where the scheduling logic is implemented. The top level per-app schedulers are minimally modified to interact with the cross-app scheduler.

Each GPU in a Themis-managed cluster has a lease associated with it. The lease dictates how long an app can assume ownership of the GPU to run its workload. When a lease expires, the resource is made available for allocation. Themis’s Arbiter pools available resources periodically and makes them visible to ML apps. In particular, the following typical steps ensue (shown in Figure 3):


The Arbiter asks all apps for estimates of their current finish-time fairness metrics.


The Arbiter initiates auctions, by offering available resources only to a fraction of ML apps with worst finish-time fair metrics. To minimize changes in the ML app scheduler to participate in auctions, Themis introduces an Agent that is co-located with each ML app scheduler. The Agent serves as an intermediary between the ML app and the Arbiter. The knob enables trading fairness for efficiency.


The offers received by an app’s Agent are replied with bids. To enable preparation of bids, Themis implements a narrow API from the ML app scheduler to the Agent that enables propagation of app-specific information. In response to an offer, each Agent prepares a single bid. This bid contains a valuation function () that provides, for each resource subset, a “value”, i.e., the Agent’s estimate of the finish-time fair metric the app will achieve with the allocation of the resource subset. is calculated based on app-specific information such as placement preferences, total work, current allocation and max-parallelism of its constituent jobs (Section 5.2).


On receiving all the bids, the Arbiter picks winning bids according to a partial allocation algorithm (Section 5.1), and notifies each Agent of its winning allocation. The algorithm incentivizes truth-telling, is Pareto-efficient, and allocates all GPUs.


The Agent propagates the allocation to the ML app scheduler, which can then decide how to allocate them among constituent jobs.

Thus the above steps reflect how our design achieves finish time fairness and cluster efficiency across ML applications.

In the following few sections, we first describe how the goal of cluster-efficiency with finish-time fairness can be translated into an allocation by the Arbiter. Following that we describe the necessary mechanisms to build a semi-optimistic two level scheduler corresponding to each step outlined above.

4. Finish-Time Fair Policy

To achieve sharing incentive on the finish-time fairness metric , we propose a max-min fair allocation policy. This leads us to an allocation that corresponds to the solution of the following optimization program:

(2) min
subject to

In the above optimization formulation, given , apps we minimize the maximum deviation from , which is the ideal value of . is the GPU allocation vector for an ML app , where is set if GPU on machine is allocated to an app . The GPU resource allocation vector encodes the amount, the type and the placement of GPUs allocated to an app. is the estimate of total work for an app . The finish-time fairness metric () is typically a function, , of the resource allocation () and the estimate of total work for all the constituent jobs (). captures the placement sensitivity, , of the running time for app . Because of this property of , the above formulation accounts for cluster efficiency in addition to fairness.

We first show that the solution to the above induces sharing incentive in the case where all apps start at the same time, and resources are apportioned offline prior to apps starting.

Observation: The above solution enforces sharing incentive.

Justification: Say there are ML apps. Divide the cluster into ideal shares: each corresponding to a dedicated cluster. By definition, this allocation with dedicated resources offers sharing incentive222this assumes linear scaling in app running time when the app runs in dedicated cluster, i.e., , which implies . Given this, the optimal solution to the above program is one with , which implies , . Thus, the optimal solution offers sharing incentive.

In practice, we cannot assume all apps start at the same time. Also, the finish time metric is time-varying. ML apps placement sensitivity () is often not known accurately ahead of time, and becomes apparent only as jobs execute. Also, estimated total work () depends on convergence properties of the training algorithm and keeps changing with time - with the magnitude of changes diminishing over time. It also depends on how many constituent jobs the app’s internal scheduler decides to keep alive or terminate; changes substantially when the app terminates a job that is exploring “poor” hyperparameter settings.

A Strawman. To address this, one could consider a practical “online” strawman wherein each resource allocation decision is binding for a “lease” duration after which the resource is up for reallocation. Each app can send updated values of to the Arbiter just before a reallocation. The Arbiter can then use these updated values to reallocate resources to the app with the worst . However, there are two issues in this strawman.

First, allocating to just one app can be inefficient. The Arbiter may end up allocating GPUs to the app in a placement insensitive manner. The new allocation may span across many machines or racks relative to the app’s currently held GPUs, and can lead to inefficient overall GPU use. Given the long-term focus of our fairness metric, this makes the case for relaxing strict tracking of finish-time fairness and increasing the visibility of available resources, because offering resource to a different app that places better on the available GPUs can be more efficient.

Second, the Arbiter trusts the apps’ to be truthful about their finish-time fair metric. At each reallocation event, an adversarial app can obtain many of the available GPUs by over-reporting the value of , which can hurt honest apps’ sharing incentives. This makes the case for incentivizing apps’ truthfulness in reporting their finish-time fair metric.

5. Semi-Optimistic Scheduler

1:Applications {} set of active apps
4:procedure offerResources() triggered when resources are available
5:        = sort() on metric() default metric is
6:        = get top fraction of apps from is fairness knob
7:       auction resources to all apps in
8:        = []
9:       for all  do
10:             auction() to send an offer to Agent
11:              = .prepareBids() get bid table = { (, ) } from Agent for different subsets
12:             .append()
13:       end for
14:        = partialAllocation()
15:       for all (, , )  do
16:             assign resources to job in app for duration
17:              = max(, )
18:       end for
19:end procedure
Pseudocode 1 Themis Arbiter

In Themis, we separate visibility from allocation using a semi-optimistic online scheduler at the Arbiter. The design helps us control both efficiency and fairness.

The Arbiter implements a generalized version of the online algorithm presented before (Pseudocode 1) with modifications to improve efficiency and incentivize truth telling. After first asking all apps to report their current , the Arbiter makes any available GPU resource visible to fraction of ML apps sorted in decreasing order of s reported. Higher leads to stronger guarantees on finish-time fairness. Lower , which trade-off fairness, allows the Arbiter to find placement efficient allocations. We evaluate the sensitivity of of Themis to values of in Section 8.

The Arbiter computes a globally-optimal and conflict-avoiding allocation of GPUs. It does so by treating available GPUs as (bundled) goods to be auctioned to the apps. Multiple apps can bid for overlapping sets of GPUs. The auction allocates each GPU to exactly one app, thus avoiding any form of conflict resolution or concurrency control (Schwarzkopf et al., 2013). The set of app-GPU pairs are selected to optimize a “social objective”: instead of uniformly spreading GPUs across the bidding apps, the Arbiter preferentially allocates to apps that pack well and collectively see the greatest aggregate improvement in their finish-time metrics.

We describe our auctions next. Truth-telling is supported by design.

5.1. Auctions

1:Applications {} set of active apps
2:Bids valuation function for each app
3:Resources resource set available for auction
4:Resource Allocations resource allocation for each app
6:procedure partialAllocation
7:        = proportional fair (pf) allocation per app
8:        = pf allocation per app without app
9:        =
10:        = * allocation per app
11:end procedure
Pseudocode 2 Partial Allocation Mechanism

We first describe the inputs necessary to run the auction, then how the auction works given these inputs, and finally the fairness and efficiency it achieves. We defer discussion of how to gather inputs to Section 5.2.

Inputs: Resource offer, and bids. Let be the set of available GPUs that are offered to the fraction of far-from-fair (worst ) apps. Each dimension in represents the number of unused GPUs in a given machine in the cluster.

Each app on receiving an offer for replies with a single bid. Every ML app tries to win fractions of each of the machines’ unused GPUs. In a machine with available GPUs, the permissible fractions are , , …, .

App ’s bid is a valuation function, , that maps different allocations, , to app ’s new finish-time metric, , that results from adding the GPUs to the existing GPUs that app currently has, and assuming that all GPUs (existing and newly added) are used from this point on till app ’s completion. Each allocation identifies the fraction of each machine’s free GPU resources desired by the app.

Given discrete subsets of resources, for app is a table, with a row for different choices for . One of the rows covers the case where the apps receives no new allocation ( is all zeros) with the corresponding valuation being the app’s current finish-time fairness metric.

Auction overview. Once all the bids are received by the Arbiter, it applies a partial allocation (PA) mechanism (Cole et al., 2013) to pick the winning resource allocations per app. Some auction-participating apps may receive no new allocations. Pseudocode 2 shows the PA mechanism.

PA starts with an initial proportional fair allocation, then it enforces truth-telling. It may leave GPUs unallocated, which we then assign in a work conserving manner. We describe these three aspects of the auction next. Then, we describe how our auction trades off fairness and efficiency in the short term, but ensures finish time fairness in the long term.

1. Initial allocation. PA starts by calculating an intrinsically proportionally fair allocation for each app (line 6) by maximizing the product of valuation functions . The resulting solution is also Pareto-efficient: no app’s valuation can be improved without hurting some other app’s valuation.

2. Incentivizing Truth Telling. To induce truthful reporting of the valuation functions , the PA mechanism allocates app only a fraction of ’s proportional fair allocation , and takes the fraction as a hidden payment (Pseudocode 2). The is directly proportional to the decrease in collective valuation of the other bidding apps in a market with and without app (lines 7 and 8). This yields the final allocation for app (line 9).

In (Cole et al., 2013), it is shown that this mechanism incentivizes truth-telling of the valuation function, , under the assumption that is homogeneous with degree one i.e. . This assumption holds true in our setting as well: when we increase an app’s allocation from to , the relative placement of allocated machines in and does not change (because we allocate more GPUs on the same set of machines) and the value of , which is the app’s valuation function, improves times i.e. .

3. Leftover Allocation. Note that, is not a market-clearing allocation and there could be GPUs left over unallocated due to hidden payments. The PA mechanism guarantees that at most worst-case fraction of total available resources are leftover (in practice we find this to be much lower). At the end of PA mechanism, our Arbiter allocates all leftover GPUs to apps not participating in the auction in a placement-sensitive manner. Leftover allocation is made one GPU at a time to these apps in a way that the new GPU is placed on one of the machines that is already a part of an existing allocation for that app. When many such candidate apps exist for a GPU, one of the apps is picked at random. This ensures that Themis allocates all available GPUs and is work conserving.

From the perspective of an app, say , that did not participate in the auction, the set of leftover GPUs is essentially random. In particular, there is no way for to be strategic to win allocation of leftover GPUs during an auction. Thus, we preserve the overall truthfulness and Pareto efficiency properties of our cluster scheduler.

Short-term efficiency, long-term fairness. We now return to how, via auctions, we are able to trade-off fairness on the short-term, but the scheduler is still able to achieve fairness in the long term. Consider a far-from-fair app that lost an auction. It will appear in future auctions with much greater likelihood relative to another less far-from-fair app that won the auction. This is because, the winning app was allocated resources; as a result, it will see its fairness metric improve; thus, it will eventually not appear in the fraction of not-so-fairly-treated apps that participate in future auctions. In contrast, ’s fairness metric stays the same, and thus it will continue to appear in future auctions. Further an app that loses multiple auctions will eventually lose its lease on all resources and make no further progress, causing its finish time metric to become unbounded. The next auction the app participates in will likely see the app’s bid winning, because any non-zero GPU allocation to that app will lead to a huge improvement in the app’s valuation.

5.2. Interfaces between Agent and ML App Scheduler

As mentioned in Section 3, an Agent co-resident with an app helps the app participate in auctions. We now describe how Agents prepare bids based on inputs provided by apps, the API between an Agent and its app, and how Agents integrate with current hyper-parameter optimization schedulers.

Calculating . The Arbiter probes all apps’ Agents prior to auctions for their current finish-time fair metrics. In addition, if an app is chosen to participate in an auction, its Agent needs to provide, as part of its valuation function, estimates of the new resulting from potential new resource allocations from the Arbiter. We now describe how the Agent computes these estimates. For simplicity, we focus on how the agent computes the new resulting from a possible new allocation; calculating and reporting current upon being probed by the Arbiter is similar.

To calculate resulting from being allocated a particular subset of GPUs in app ’s valuation function table, the Agent follows these steps:

  1. Calculate the aggregate () of the GPU allocation vector in the bid () and the already assigned GPU allocation vector for app : = .

  2. Obtain estimates of the app ’s total work, and total work left for each constituent job in , i.e., and , respectively (more on estimating these quantities this shortly). Work is measured in GPU-hours.

  3. Obtain the placement sensitivity for each job in app , . We model as the slowdown observed when GPUs allocated for the job span different networking boundaries. We explain this below. indicates close-to-ideal placement.

    With an ideal placement, the running time of a job linearly scales with the number of GPUs () i.e. , where is serial running time with a single GPU. However, scaling is less than linear in proportion to the factor depending on the topological spread of GPUs i.e. , where depends on the spread of GPUs in the GPU allocation vector of the job. We typically have three values for , one each reflecting the case where GPUs span different slots in a machine; span multiple machines in a rack; and span racks. also depends on the overhead of communication during training and includes the model parameter sizes, typical parameter gradient sizes, and any training time communication optimizations (more on estimating shortly).

  4. The Agent estimates the shared running time for app as: ( + ), where is the allocation reserved for job in app . The in the calculation above reflects the fact that the job with the best hyper-parameters, i.e., the one that trains to a target accuracy fastest, is the one that determines the app’s finish time. Note that given the overall app-level allocation , the Agent computes the job-level allocation in a greedy manner, where GPUs are assigned to jobs in a placement sensitive manner.

  5. The Agent calculates the ideal running time as:
    (), where is the ideal GPU allocation vector. (Again, plays the same role as above.) Each job has an upper limit on the number of tasks it can parallelize across; the ideal GPU allocation vector is thus the one with this maximum parallelism and best placement sensitivity.

  6. The Agent computes the estimate of finish-time fair metric for the app as: .

  7. The Agent inserts , in the valuation function table that it provides to the Arbiter.

ML App Scheduler to Agent API. Note that preparation of the bid requires the app-scheduler to provide the amount of total work and work left per job in the app ( and ), the placement sensitivity (), and the maximum parallelism assigned per job (). Note that this is not static information, as the placement sensitivity () is dependent on the model and the cluster environment (network technology and average interfering traffic). Also, the work left per-job depends on the convergence properties of the job. Thus, at the beginning of a bid preparation, the Agent pulls this information from the ML app scheduler.

Obtaining Estimates from App Schedulers. We now discuss how existing app schedulers can be easily modified to provide Agents with the above needed information such that Agents can calculate their bids. We focus on HyperBand (Li et al., 2016) and HyperDrive (Rasley et al., 2017), two popular ML app schedulers for hyperparameter optimization in model training.

App scheduler background: HyperBand launches several ML training jobs each with user-configured equal “priority” i.e. equal or equal maximum parallelism. HyperBand kills the bottom-half of jobs with poor convergence periodically after a fixed number of iterations until a single job remains.

HyperDrive launches several ML training jobs with user-configured equal priority i.e. equal maximum parallelism to begin with. It continually monitors the jobs’ loss convergence properties to classify jobs as good, promising, and poor. HyperDrive then gives varying execution priorities to different jobs by controlling the maximum parallelism for each constituent job, with higher priorities for good jobs and terminating a job as soon as it is classified as poor.

Work estimation: In either case, deciding which jobs to kill/keep, or into which category to classify a job, is done by curve-fitting and estimating the number of iterations to completion of the job. Jobs with too many projected iterations to complete are killed (HyperBand) or classified as poor (HyperDrive). Thus, we minimally modify these schedulers to report their internally-tracked projected iterations to completion for each job as their estimates for the amount of work left.

Placement sensitivity: The only information that is typically missing from app schedulers is the placement sensitivity (). This requires profiling the iteration times for different placement options for an app’s jobs. Luckily, we find that all jobs within an app (especially those that share a model structure, but differ in other hyper-parameters) typically have correlated placement sensitivity. This helps with two practical simplifications: First, we can use a single for an entire app . Second, we can be opportunistic and exploit the diversity in placement across constituent jobs in the app during their initial few training iterations to measure the effect of different placements in the cluster. Thus, over a small period of time, we can get accurate measures of .

This concludes our discussion of how the Arbiter and Agent are designed and we next discuss how these mechanisms are implemented in Themis.

6. Discussion

Favoring Short Apps. Our scheduling discipline implicitly prioritizes apps with small values of : If such jobs don’t win initial auctions, then their s quickly deteriorate, making them likely to win allocations in future auctions. By virtue of their small overall work, such apps can then complete quickly. Note that this does not affect finish time fairness for apps with large . This property is particularly attractive for exploratory apps, where users want to quickly examine if a strawman model is worth improving, or for short apps that have just one job (with chosen optimal hyper-parameters).

Placement constraints. Some apps may have placement constraints due to strict GPU memory requirements for large models. Our auctions naturally account for app placement constraints: the valuation table entries for bids containing placement constraint-violating resource allocations would have infinite value because our placement sensitivity metric, , for such resource allocations is .

Early-terminating jobs. Jobs that are not going to lead to the best configuration are terminated by hyperparameter tuning frameworks such Hyperband or Hyperdrive. In this case, the vacated GPUs are reassigned by these app schedulers to other running jobs, causing the finish time fairness metric for the app to drop rapidly. The app will then not be chosen in the following auctions. However, over time, as the app loses its lease on currently held GPUs, its finish time metric will increase, bringing the app back into play in auctions. This again highlights how our long term fairness metric automatically accommodates the effect of early-terminating jobs.

Scheduling after failures. Themis may pack apps into GPUs that share a failure domain. For example, if we allocate all 8 GPUs for a job on a single machine, then a machine failure would mean the job loses all its resources, stalls in its progress, and has to be rescheduled immediately to start making progress again. This could trigger GPU reallocation away from other apps currently running on other machines toward the rescheduled app. We leave a systematic study of the effect of failures on scheduling for future work.

7. Implementation

We implement Themis on top of a recent release of Apache Hadoop YARN (Apache Hadoop, 2013) (version ) which includes, Submarine (Apache Hadoop Submarine, 2019)

, a new framework in Hadoop for running TensorFlow 

(Abadi et al., 2016) ML training jobs atop YARN. We made changes to both Submarine and the YARN Resource Manager (RM).

We modify the Submarine client to support submitting a group of ML training jobs as required by hyperparameter exploration ML apps (Rasley et al., 2017; Li et al., 2016). Once an app is submitted, it is managed by a Submarine Application Master (AM) and we make changes to the Submarine AM to enable managing a group of jobs in an app instead of just a single job. These changes involve implementation of the ML app scheduler (we implemented Hyperband (Li et al., 2016)) and Themis’s Agent.

To prepare accurate bids, we implement a profiler in the AM that parses TensorFlow logs written to HDFS, and tracks iteration times and loss values for all the jobs in an app. The allocation of a job changes over time and iteration times are used to accurately estimate the placement sensitivity () from different GPU placements. Loss values are used in our HyperBand implementation to find a best-fit sub-linear or super-linear curve and thus estimate the amount of work left () per-job to reach target accuracy. Our Agent implementation uses profiler statistics to prepare bids.

Themis’s Arbiter is implemented as a separate module in YARN RM. We add gRPC-based interfaces between the Agent and the Arbiter to enable probes/offers from the Arbiter to the Agent, bids back from the Agent to the Arbiter, and the final winning allocations from the Arbiter to the Agent. Further, the Arbiter tracks GPU leases to offer reclaimed GPUs as a part of the offers.

All the jobs we use in our evaluation are TensorFlow programs with configurable hyperparameters, model architecture, and dataset type. To enable elastic resource allocations during the course of a job, the TensorFlow programs were modified to checkpoint model parameters to HDFS every few iterations. After any change in the GPU allocation, the program resumes training from the most recent checkpoint.

8. Evaluation

We evaluate Themis under a variety of different scenarios through both testbed driven experiments on a 50 GPU cluster as well through event-driven simulations. Since none of the state-of-the-art schemes are open-source, we benchmark Themis against them by emulating their behavior to fit into an auction-based fair market scheme as described below:

Gandiva(Xiao et al., 2018) is an ML cluster scheduler that aims to improve cluster utilization by introspectively profiling ML app execution, and migrating jobs based on the placement preferences inferred for that job. We model Gandiva by having all apps report the placement score for the resources offered, and running the same greedy placement algorithm at the end of each lease to maximize the placement scores for all apps. We do not model time-slicing of GPUs across jobs since both Themis and Gandiva would benefit the same amount from it.

Tiresias (Gu et al., 2019) also targets ML workloads but aims to improve average job completion time (JCT) using priority-based placements. We model Tiresias using bids by having all apps report their total GPU service. The Arbiter assigns resources to apps that have the least GPU service. This model represents a version of Least Acquired Service (LAS) used by Tiresias.

SLAQ (Zhang et al., 2017) proposes a quality driven scheduling framework that focuses on improving aggregate model quality across all jobs. We model SLAQ using bids by having all apps report their decrease in loss value given the resource allocation. The Arbiter assigns resources to apps so as to maximize the aggregate decrease in loss with the resource assignment.

Figure 4. Sensitivity analysis on Fairness Knob and Lease Time

Our evaluation reveals the below key highlights:

  • Themis enables a trade-off between finish-time fairness in the long-term and placement efficiency in the short-term. This is dependent on the choice of fairness knob and lease time. Sensitivity analysis (Figure (a)a(c)c) shows and a lease duration of min enables significant gains in placement efficiency without sacrificing finish-time fairness.

  • Themis is better than other schemes on finish-time fairness and offers better tail-app completion times, while imposing modest overhead (Figure 58).

  • Themis’s benefits compared to other schemes improve with increasing fraction of placement sensitive apps and increasing contention in the cluster and these improvements hold even with inaccuracies in finish-time fair metric predictions (Figure 911).

8.1. Experimental Setup and Metrics

Simulator: We developed an event-based simulator to evaluate Themis

at large scale using real traces from a production setting at a large internet company. We use the traces to obtain a distribution of the number of hyperparameter exploration jobs per ML app. The number of tasks per application vary from 1 to 98 with the median as 23. Most tasks within the application require 4 GPUs, but a few of them require just 2 GPUs. Most tasks within an app have short running times (median  59 minutes), but there are a few long running tasks (median  123 minutes). We proportionally scale down these times for purpose of our experiments. Our workload consists of a mixture of placement insensitive (e.g., the ResNet family of models) and placement sensitive (e.g., the VGG family) applications(60:40 ratio). Since the trace did not permit actually observing the fidelity (good/promising/bad) of hyperparameter explorations, for our simulations we assume clairvoyance of the number of iterations run by each hyperparameter exploration job. Unless stated otherwise, we evaluate on a heterogeneously constructed 256 GPU cluster, which has a mixture of 4 GPU, 2 GPU, and 1 GPU machines spread across multiple racks. We model the inter-arrival times of ML training apps using a Poisson distribution with a mean inter-arrival time of 20 minutes. We adjust the contention for GPUs on the cluster by modifying the inter-arrival time.

Testbed Setup: We run our testbed driven experiments on Microsoft Azure. Our testbed consists of a cluster of 50 GPUs spread across 20 instances. We use NC-series and NV-series instances that have 1/2/4 GPUs in each instance. The GPUs we use include NVIDIA Tesla K80 and NVIDIA Tesla M60.

Metrics: We use a variety of metrics to evaluate Themis.

  • Max Fairness: The Max Fairness metric captures the worst finish time fairness across apps. Lower values of max fairness indicate a fairer allocation.

  • Jain’s Fairness: We use Jain’s Fairness

    to measure the variance of

    values across apps. Jain’s Fairness close to 1 indicates lower variance in and is better.

  • Placement Score: We define a placement score for each job in an app depending on the locality of GPUs assigned to it. We use a 4-level scoring scheme ranging from slot locality, where all GPUs are connected by NVLink, machine locality where GPUs are in same machine connected over PCIe, rack locality where GPUs are in the same rack, and no locality to indicate allocations that cross racks. Each successive level has a decrease in network bandwidth and would lead to a slowdown (i.e., placement sensitivity ) based on application properties. A score of 1.0 indicates GPUs are tightly packed while lower scores imply GPUs that are spread out.

  • GPU Time: We use GPU Time as a measure of how efficiently the cluster is utilized. For each job across all apps , we measure the amount of time it runs on the GPU as and compute total GPU time as . For two scheduling regimes and that have GPU times and , utilizes the cluster more efficiently than if .

8.2. Sensitivity Analysis

In this section, we use simulations of a heterogeneous -GPU cluster to evaluate the sensitivity of fairness knob and the lease time to the two metrics that Themis balances: finish-time fairness and cluster utilization.

Figure (a)a shows the impact on finish-time fairness as we vary the fairness knob which trades off fairness for efficiency. As expected (Section 5), max fairness decreases with an increase in . We also note that the gap between the minimum and maximum fairness values for a given value of reduces as increases. From the plot we also see that the median finish time fairness increases slightly with and this is because as discussed in Section 4 the objective for Themis is to minimize the maximum finish time fairness, and this could come at the expense of median-fairness apps. Finally, the figure also shows that there are diminishing returns and the reduction in maximum finish-time fairness is negligible as we go from to .

Figure (b)b shows the impact on GPU time versus . At higher values of , we observe a higher GPU time implying that the cluster is being used less efficiently. This is because the visibility of how many apps can bid for an offer reduces as we increase leading to fewer opportunities for the Arbiter to pack jobs efficiently. To balance the two , we choose as in the rest of our experiments, since we would like to have better fairness guarantees in multi-tenant settings.

Figure (c)c captures the variation of maximum finish-time fairness as the lease time varies. We observe that smaller lease times lead to better fairness. This can be attributed to two factors: (i) decisions on resource allocations across apps at a finer granularity helps in making more optimal allocations. (ii) having a shorter lease means that shorter apps do not have long wait times when they arrive. A shorter lease time however adds more overheads since applications would have to be checkpointed and swapped out more often, and many more auctions must be run. Based on this observation, we choose a lease time of minutes for the rest of our experiments.

8.3. Macrobenchmarks

(a) Comparison of Max. Fairness
(b) Comparison of Jain’s Index
Figure 5. Comparison of Finish Time Fairness across different scheduling schemes

We evaluate Themis against Gandiva, SLAQ, and Tiresias on a cluster of 50 GPUs333we scaled down the job durations by a factor of 5 and retain the same inter-arrival distribution for our testbed experiments. Figure 5(a) shows the maximum finish-time fair metric for all apps. Themis outperforms all other schedulers. In our experiments, the workload resulted in a peak contention of times the number of available GPUs. An ideal scheduler would be able to achieve a maximum finish-time fairness of . Themis is away from this ideal value. In contrast, Gandiva, SLAQ, and Tiresias are , , and away from the ideal, respectively. This is expected as Gandiva and SLAQ do not guarantee fairness. Tiresias’s inefficacy arises from its focus on simple resource fairness which ignores placement sensitivity.

Figure 5(b) plots Jain’s fairness metric across apps. Tiresias comes closest and is worse-off than Themis. SLAQ and Gandiva are poorer on fairness as SLAQ demotes old, slowly-converging jobs whereas Gandiva greedily packs with no fairness guarantee.

Figure 6. Comparison of App Completion Times across schemes
Figure 7. CDF of Placement Score

Figure 7 shows the job completion times (JCTs) across scheduling schemes. Themis is , , and better than Gandiva, SLAQ, and Tiresias respectively on average app completion time. Themis is better for two reasons: (i) Themis has visibility into the shared running times of apps over time, and with its long-term focus, makes more frequent allocations to shorter apps than longer ones. (ii) Themis has visibility into different ML apps’ placement-sensitivity and packs resources tightly only for the apps that require it.

8.3.1. Sources of Improvement

Figure 8. Timeline of GPU allocations

This section analyzes the reasons Themis is better. Figure 7 shows the distribution of placement scores. Themis outperforms other scheduling schemes with Gandiva coming closest. Gandiva is not as good as it greedily optimizes for better placement during introspection which does not lead to a good global placement. Tiresias and SLAQ are much worse as they are placement unaware.

Figure 8 shows a simplified timeline of GPU allocations for ML apps from our trace. For simplicity, we hand-pick two ML apps each having task. The apps vary 3X in their running times and have equal placement sensitivity. Both apps arrive at . The shorter app gets a larger GPU allocation as values are unbounded/large with no allocations and we break ties in favor of shorter apps. At , new apps arrive and the existing apps’ leases expire, and our scheme favors allocation to the new app whose is large. At , the shorter app gets GPUs and runs to completion. Finally, at , the longer app with lesser work remaining continually gets resources and runs to completion. This shows Themis preferentially completes apps with small , but at the same time it does not starve those with high .

8.3.2. System Overheads

We now evaluate system overheads in Themis.

From our profiling of the experiments above, we find that each Agent spend  29 (334) milliseconds to compute bids at the median (95-%). The 95 percentile is high because enumeration of possible bids needs to traverse a larger search space when the number of resources up for auction is high.

The Arbiter uses Gurobi  (Gurobi Optimization, [n. d.]) to compute partial allocation of resources to apps based on bids. This computation takes  354 (1398) milliseconds at the median (95-%). The high tail is once again observed when both the number of offered resources and the number of apps bidding are high. However, the time is small relative to lease time.

The network overhead for communication between the Arbiter and individual apps is negligible since we simply leverage the existing mechanisms used by Apache YARN.

Upon receiving new resource allocations, the Agent changes (adds/removes) the number of GPU containers available to its app. Our measurements indicate that this change takes about 35-50 seconds. Prior to relinquishing control over its resources, each application must checkpoint its set of parameters. We find that that this is model dependent but takes about 5-10 seconds on an average and is driven largely by the overhead of checkpointing to HDFS.

8.4. Microbenchmarks

This section evaluates the impact of specific characteristics of ML training workloads on finish-time fairness and cluster efficiency. Specifically, we evaluate the impact of placement sensitivity and contention for cluster.

8.4.1. Effect of Placement Sensitivity

(a) Effect on Max. Fairness
(b) Effect on GPU time
Figure 9. Impact of placement sensitivity for varying compute-network job distributions

This experiment analyzes the effect on finish-time fairness and cluster efficiency as the fraction of network-intensive apps in our workload increases.

Figure 9(a) shows the factor of improvement in max fairness for Themis over Tiresias. With only compute-intensive apps, Themis is only better than Tiresias. As the percentage of network-intensive apps increases, Themis performs better than Tiresias. This stems from the fact that placement awareness becomes more important when the workload consists of more network-intensive ML apps.

Figure 9(b) shows the variation in GPU Time across schemes as we vary the fraction of network-intensive apps. With only compute-intensive apps, all scheduling schemes utilize the cluster with roughly the same level of efficiency. As the percentage of network intensive apps increases, Themis utilizes the cluster more efficiently than others.

8.4.2. Effect of Contention

Figure 10. Effect of contention on our scheme.
Figure 11. Impact of error in bid valuations on max fairness

This experiment analyzes the effect of contention on finish-time fairness. Contention is increased by decreasing the inter-arrival times between apps.

Figure 11 shows that the Jain’s fairness index worsens more rapidly in Tiresias than Themis. Actual ML training workloads consist of a mixture of medium- to long-sized apps (Section 2). Themis is better than the least attained service mechanism in Tiresias. In addition to being placement-insensitive, because Tiresias prioritizes short and long apps equally, it can lead to worse finish-time fairness for short apps. Themis with its long-term focus does better on finish-time fairness for both short and long apps.

8.4.3. Impact of Error on

This experiment varies the error in bid values in all valuation tables. The percentage error is sampled at random from [-, ] range. Apps can make errors (not willingly) in computing a new estimate of due to error in estimation of work () or placement-sensitivity ().

Figure 11 shows the changes in max finish-time fairness as changes. Note this max finish-time fairness is computed on accurate and values. Even with the change in max finish-time fairness is not significant.

9. Related Work

Cluster scheduling for ML workloads has been targeted by a number of recent works including SLAQ (Zhang et al., 2017), Gandiva (Xiao et al., 2018), Tiresias (Gu et al., 2019) and Optimus (Peng et al., 2018). These systems target different objectives and we compare against them in Section 8.

We build on rich literature on cluster scheduling disciplines (Ghodsi et al., 2011; Grandl et al., 2016b, a, 2015) and two level schedulers (Hindman et al., 2011; Verma et al., 2015; Schwarzkopf et al., 2013). While those disciplines/schedulers don’t apply to our problem, we build upon some of their ideas, e.g., resource offers in (Hindman et al., 2011), and the framing of optimism vs. pessimism in (Schwarzkopf et al., 2013). Sharing incentive was outlined by DRF (Ghodsi et al., 2011), but we focus on long term fairness with our finish-time metric. Tetris (Grandl et al., 2015) proposes resource-aware packing with an option to trade-off for fairness uses multi-dimensional bin-packing as the mechanism for achieving that. In Themis, we instead focus on fairness with an option to trade-off for placement-aware packing and use auctions as our mechanism. We also build on a long list of theoretical work on mechanism design for multi-agent resource allocation (see (Chevaleyre et al., 2006)), from which we borrow and extend a technique for truthful proportionally-fair auctions (Cole et al., 2013).

Some earlier schemes (Grandl et al., 2016a, b) also attempted to emulate the long term effects of fair allocation. On shorter time-scales they opportunistically reallocate unused resources that one job gives up to another job that needs it to improve completion times; for a given job, such unused resources may arise due to barriers, with typical jobs having 1-2 such barriers (Grandl et al., 2016a). Themis differs in many respects: First, earlier systems focus on batch analytics. Second, earlier schemes rely on instantaneous resource-fairness, which has issues with placement-insensitivity and not accounting for long tasks. Third, in our context there are no occasional barriers around which a job can unilaterally give up resources. While barriers do arise due to synchronization of parameters in ML jobs, they happen at every iteration. Also, resources unilaterally given up by a job may not be usable by another job due to placement sensitivity. Finally, the earlier schemes employ a monolithic scheduler designs while we use a two-level design.

10. Conclusion

In this paper we presented Themis, a fair scheduling framework for ML training workloads. We showed how existing fair allocation schemes are insufficient to handle long-running tasks and placement sensitivity of ML workloads. To address these challenges we proposed a new long term fairness objective in finish-time fairness. We then presented a two-level semi-optimistic scheduling architecture where ML apps can bid on resources offered in an auction. Our experiments show that Themis can improve fairness and efficiency compared to state of the art schedulers.


  • (1)
  • Abadi et al. (2016) Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A system for large-scale machine learning. In OSDI.
  • Apache Hadoop (2013) Apache Hadoop 2013. Apache Hadoop NextGen MapReduce (YARN). Retrieved 9/24/2013, URL:
  • Apache Hadoop Submarine (2019) Apache Hadoop Submarine 2019. Apache Hadoop Submarine.
  • Bergstra et al. (2015) James Bergstra, Brent Komer, Chris Eliasmith, Dan Yamins, and David D Cox. 2015. Hyperopt: a python library for model selection and hyperparameter optimization. Computational Science & Discovery 8, 1 (2015).
  • Chen et al. (2016) Jianmin Chen, Xinghao Pan, Rajat Monga, Samy Bengio, and Rafal Jozefowicz. 2016. Revisiting distributed synchronous SGD. arXiv preprint arXiv:1604.00981 (2016).
  • Chevaleyre et al. (2006) Yann Chevaleyre, Paul E Dunne, Ulle Endriss, Jerome Lang, Michel Lemaitre, Nicolas Maudet, Julian Padget, Steve Phelps, Juan A Rodrguez-Aguilar, and Paulo Sousa. 2006. Issues in Multiagent Resource Allocation. Informatica 30 (2006), 3—-31.
  • Cole et al. (2013) Richard Cole, Vasilis Gkatzelis, and Gagan Goel. 2013. Mechanism design for fair division: allocating divisible items without payments. In Proceedings of the fourteenth ACM conference on Electronic commerce. ACM, 251–268.
  • Ghodsi et al. (2011) Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, and Ion Stoica. 2011. Dominant Resource Fairness: Fair Allocation of Multiple Resource Types.. In Nsdi, Vol. 11. 24–24.
  • Grandl et al. (2015) Robert Grandl, Ganesh Ananthanarayanan, Srikanth Kandula, Sriram Rao, and Aditya Akella. 2015. Multi-resource packing for cluster schedulers. ACM SIGCOMM Computer Communication Review 44, 4 (2015), 455–466.
  • Grandl et al. (2016a) Robert Grandl, Mosharaf Chowdhury, Aditya Akella, and Ganesh Ananthanarayanan. 2016a. Altruistic scheduling in multi-resource clusters. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). 65–80.
  • Grandl et al. (2016b) Robert Grandl, Srikanth Kandula, Sriram Rao, Aditya Akella, and Janardhan Kulkarni. 2016b. GRAPHENE: Packing and Dependency-Aware Scheduling for Data-Parallel Clusters. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). 81–97.
  • Gu et al. (2019) Juncheng Gu, Mosharaf Chowdhury, Kang G Shin, Yibo Zhu, Myeongjae Jeon, Junjie Qian, Hongqiang Liu, and Chuanxiong Guo. 2019. Tiresias: A GPU

    Cluster Manager for Distributed Deep Learning. In

    16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19). 485–500.
  • Gurobi Optimization ([n. d.]) Gurobi Optimization [n. d.]. Gurobi Optimization.
  • Hindman et al. (2011) Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D Joseph, Randy H Katz, Scott Shenker, and Ion Stoica. 2011. Mesos: A platform for fine-grained resource sharing in the data center.. In NSDI, Vol. 11. 22–22.
  • Isard et al. (2009) Michael Isard, Vijayan Prabhakaran, Jon Currey, Udi Wieder, Kunal Talwar, and Andrew Goldberg. 2009. Quincy: fair scheduling for distributed computing clusters. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles. ACM, 261–276.
  • Jeon et al. (2018a) Myeongjae Jeon, Shivaram Venkataraman, Amar Phanishayee, Junjie Qian, Wencong Xiao, and Fan Yang. 2018a. Multi-tenant GPU Clusters for Deep Learning Workloads: Analysis and Implications. MSR-TR-2018-13 (2018).
  • Jeon et al. (2018b) Myeongjae Jeon, Shivaram Venkataraman, Amar Phanishayee, Junjie Qian, Wencong Xiao, and Fan Yang. 2018b. Multi-tenant GPU Clusters for Deep Learning Workloads: Analysis and Implications. Microsoft Research Technical Report (MSR-TR-2018-13) (2018).
  • Li et al. (2016) Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. 2016. Hyperband: A novel bandit-based approach to hyperparameter optimization. arXiv preprint arXiv:1603.06560 (2016).
  • Ousterhout et al. (2013) Kay Ousterhout, Patrick Wendell, Matei Zaharia, and Ion Stoica. 2013. Sparrow: distributed, low latency scheduling. In SOSP. 69–84.
  • Peng et al. (2018) Yanghua Peng, Yixin Bao, Yangrui Chen, Chuan Wu, and Chuanxiong Guo. 2018. Optimus: an efficient dynamic resource scheduler for deep learning clusters. In Proceedings of the Thirteenth EuroSys Conference. ACM, 3.
  • Rasley et al. (2017) Jeff Rasley, Yuxiong He, Feng Yan, Olatunji Ruwase, and Rodrigo Fonseca. 2017. Hyperdrive: Exploring hyperparameters with POP scheduling. In Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference. ACM, 1–13.
  • Schwarzkopf et al. (2013) Malte Schwarzkopf, Andy Konwinski, Michael Abd-El-Malek, and John Wilkes. 2013. Omega: flexible, scalable schedulers for large compute clusters. (2013).
  • Shreedhar and Varghese (1996) M. Shreedhar and G. Varghese. 1996. Efficient fair queuing using deficit round-robin. IEEE/ACM Transactions on Networking 4, 3 (June 1996), 375–385.
  • Verma et al. (2015) Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric Tune, and John Wilkes. 2015. Large-scale cluster management at Google with Borg. In Eurosys.
  • Xiao et al. (2018) Wencong Xiao, Romil Bhardwaj, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, Zhenhua Han, Pratyush Patel, Xuan Peng, Hanyu Zhao, Quanlu Zhang, et al. 2018. Gandiva: Introspective cluster scheduling for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 595–610.
  • Zaharia et al. (2010) Matei Zaharia, Dhruba Borthakur, Joydeep Sen Sarma, Khaled Elmeleegy, Scott Shenker, and Ion Stoica. 2010. Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In Proceedings of the 5th European conference on Computer systems. ACM, 265–278.
  • Zhang et al. (2017) Haoyu Zhang, Logan Stafman, Andrew Or, and Michael J Freedman. 2017. SLAQ: quality-driven scheduling for distributed machine learning. In Proceedings of the 2017 Symposium on Cloud Computing. ACM, 390–404.