Modern data centers are tasked with processing astonishingly diverse workloads on a common set of shared servers (verma2015large). These jobs differ not only in their resource requirements on a single server, but also in how effectively they scale across multiple servers (delimitrou2014quasar). For instance, a simple client query may not be parallelizable, but it may complete in just milliseconds on a single server. Conversely, a data intensive job may run for hours even when parallelized across dozens of servers (moritz2018ray). The challenge facing system architects is to build data centers which, in light of this heterogeneity, achieve low response time – the time from when a job enters the system until it is completed.
The state-of-the-art in many data centers is to allow users to specify their own server requirements, and then over-provision the system. By always ensuring that idle servers are available, system designers avoid having to make tough resource allocation decisions while users always receive the resources they request. Unfortunately, these over-provisioned systems are expensive to build and waste resources (gandhi2011data). Most large-scale data centers, for example, run at an average utilization of less than 20% (verma2015large).
To try and reduce this waste, many cluster scheduling systems have been proposed in the literature (hindman2011mesos; delimitrou2014quasar; peng2018optimus; lo2015heracles; mars2011bubble; moritz2018ray). These scheduling systems aim to maintain low response times without having to over-provision the system. One way to achieve this goal (delimitrou2014quasar; peng2018optimus) is to have the system scheduler determine resource allocations rather than allowing users to reserve resources. While these schedulers often work well in practice, none of them offer theoretical response time guarantees.
1.2. The Problem
We propose a simple model of heterogeneous traffic running in a multiserver data center. Our goal is to design a resource allocation policy which dynamically allocates servers to jobs in order to minimize the mean response time across jobs. We assume jobs are preemptible, and that an allocation policy can change a job’s server allocation over time. In particular, we will consider a system of servers which processes jobs that arrive over time from a workload consisting of two distinct job classes. The first class of jobs, which we call elastic, consists of jobs which can run on any set of servers at any moment in time. We assume that elastic jobs experience a speedup factor proportional to the number of servers they run on. That is, an elastic job which completes in 2 seconds on a single server would complete in 1 second on 2 servers, or .5 seconds on 4 servers. The second class of jobs, which we refer to as inelastic, consists of jobs which are not parallelizable. While an inelastic job can run on any server, it can only run on a single server at any moment in time. A resource allocation policy must determine, at every moment in time, how to allocate servers to each job in system, both elastic and inelastic.
In practice each job also has some amount of inherent work associated with it. This inherent work, which we call a job’s size, determines how long it would take to complete the job on a single server. We assume that job sizes in our model are unknown to the system, but are drawn independently for each job from an exponential distribution. To further model the heterogeneity of a workload, we allow elastic and inelastic job sizes to be drawn from two different exponential distributions, with rates and respectively.
Even given the simplicity of the model above, devising an optimal scheduling policy is non-trivial. For instance, consider the problem of dividing servers between one elastic job and one inelastic job which are both of size 1. On the one hand, we know that completing jobs quickly benefits mean response time, so one might think to run the elastic job on all servers before running the inelastic job. On the other hand, this schedule leaves servers idle while the inelastic job completes. We could thus have created a more efficient schedule by running the elastic and inelastic jobs simultaneously, giving servers to the elastic job and server to the inelastic job. It turns out that the more efficient schedule is optimal in this case, but in general, a good scheduling policy must balance the trade-off between completing elastic jobs quickly and preventing long periods of low server utilization. This question becomes even more complex if the elastic and inelastic jobs have different sizes.
1.3. Elastic and Inelastic Jobs in the Real World
It is common to find systems which use a shared set of servers to process both elastic and inelastic jobs. Typically in such settings the elastic jobs have more inherent work than the inelastic jobs. For example, consider a cluster which must process a stream of many MapReduce jobs (dean2008mapreduce). From the cluster’s point of view, this workload produces a stream of map stages and reduce stages
. Map stages (elastic) are designed to be parallelized across any number of servers and do a large amount of processing. Reduce stages (inelastic) are inherently sequential and do much less total work than a map stage. As another example, modern machine learning frameworks(moritz2018ray)
advocate the use of a single platform for both the training and serving of models. Training jobs (elastic) are large, requiring large data sets and many training epochs. Distributed training methods such as distributed stochastic gradient descent are also designed to scale out across an arbitrary number of nodes(lian2017can). Once a model has been trained, serving the model (inelastic), which consists of feeding a computed model a single data point in order to retrieve a single prediction, is done sequentially and requires comparatively little processing power.
It is less common for elastic jobs to be smaller than inelastic jobs in practice, given the overhead involved in writing code that can be parallelized. If the amount of inherent work required for a job is small to begin with, system developers may not choose to add the additional data structures and synchronization mechanisms that would be required to make the job elastic. One exception is HPC workloads. In this setting, there are often both malleable jobs (elastic) (gupta2014towards) and jobs with hard requirements (inelastic). While malleable jobs are designed to run on any number of cores, jobs with hard requirements demand a fixed number of cores. In this case, it is unclear which class of jobs we would expect to involve more inherent work.
The model presented in this paper is flexible enough to capture all of the above examples.
1.4. Why stochastic analysis?
There has been a sizable amount of work that considers the problem of scheduling jobs onto parallel servers. The vast majority of this work has considered only inelastic jobs of known sizes, and has focused on worst-case analysis. Given the optimality of the Shortest-Remaining-Processing-Time (SRPT) policy in the degenerate case where (smith1978new), one might hope that SRPT is also optimal in the multiserver case where . Specifically, one might consider a policy called SRPT-k (grosof2018srpt) which runs the jobs with the shortest remaining processing times at every moment in time. Unfortunately, (leonardi2007approximating) shows that SRPT-k can be arbitrarily far from optimal. In fact, SRPT-k has a competitive ratio of where is the number of jobs and is the ratio of the maximum job size to the minimum job size. Additionally, (leonardi2007approximating) shows that this competitive ratio is a tight lower bound – no online algorithm can do better in the worst case. Using speed augmentation, SRPT-k is known to be constant competitive with speed for any constant (FoxM11; BussemaT06).
More recently, some work has examined the case of scheduling parallelizable jobs of known sizes onto parallel servers. This work assumes that each job has an arbitrary speedup curve which dictates its running time as a function of the number of servers on which it runs. Again using worst-case analysis, (edmonds2009scalably) shows how to achieve an -competitive ratio using -speed servers. Without using resource augmentation, (im2016competitively) provides an algorithm with a competitive ratio of , where again is the ratio of the largest job size to the smallest job size. This competitive ratio essentially matches the known worst-case lower bound for the problem.
The above results suggest that, without resource augmentation, there is little room to improve the worst-case performance of scheduling policies for parallelizable jobs. This is because the aforementioned lower bounds for worst-case scheduling directly apply to the case where jobs are given speedup curves. However, from the point of view of system designers, this problem remains unsolved! In particular, a competitive ratio of (im2016competitively) can be arbitrarily high when job sizes span a wide range, which is common in practice. Thus, a -competitive algorithm could be impractical. Additionally, the results in (edmonds2009scalably) use an elegant algorithm that is interesting theoretically, but the algorithm is difficult to implement due to frequent context switches.111The algorithm is a generalization of equipartition, splitting the system evenly amount a fraction of the jobs in the system. The problem is that results like (edmonds2009scalably; im2016competitively) and others (see Section 3) perform badly on adversarial cases which are uncommon in practice. We therefore propose shifting to stochastic analysis which discounts the impact of these adversarial cases. By considering a stochastic analysis, there is the potential to reveal new algorithmic insights into the problem. It could even be possible to find online algorithms that are optimal in expectation.
There has been recent work aimed at allocating servers to parallelizable jobs in a stochastic setting in order to minimize mean response time (berg2018towards). However, this line of work is in an early stage. Specifically, (berg2018towards) only considers the case where all jobs are homogeneous with respect to job size and job speedup. While (berg2018towards) is able to derive the optimal policy in this simpler case, they explicitly note the complexity of handling even just two different classes of jobs. In particular, the problem of allocating to servers to both elastic and inelastic jobs in a stochastic setting remains completely open. Although (berg2018towards) presents some approximate numerical analysis of the case where jobs are heterogeneous, the techniques used are computationally intensive and offer no guarantess of accuracy.
1.5. Our Contributions
This paper addresses the problem of allocating servers to both elastic and inelastic jobs. Section 2 introduces our stochastic model of elastic and inelastic jobs of unknown sizes which arrive over time to a system composed of servers. Using this model, we then present the following results:
We propose two natural server allocation policies which aim to minimize the mean response time across jobs. First, the Elastic-First policy gives strict preemptive priority to elastic jobs and aims to minimize mean response time by maximizing the rate at which jobs depart the system. Second, the Inelastic-First policy gives strict preemptive priority to inelastic jobs. By deferring elastic work for as long as possible, Inelastic-First maximizes system efficiency. It is not immediately obvious if either of these policies is optimal, or which policy is better.
We show in Section 4.1 that if elastic and inelastic jobs follow the same exponential size distribution, Inelastic-First is optimal with respect to mean response time. This argument uses precedence relations to show that deferring elastic work increases the long run efficiency of the system.
Next, in Section 4.2, we show that in the case where elastic jobs are larger on average than inelastic jobs, Inelastic-First is optimal with respect to mean response time. This requires the introduction of a novel sample path argument. Our key insight is that Inelastic-First minimizes the expected amount of inelastic work in the system as well as the expected total work in the system. As long as elastic jobs are larger than inelastic jobs on average, this suffices for minimizing mean response time.
In the case where elastic jobs are smaller on average than inelastic jobs, Inelastic-First is no longer optimal. We illustrate this via a counterexample in Section 4.3 which shows that Elastic-First can outperform Inelastic-First. In order to determine when Elastic-First outperforms Inelastic-First, we perform the first analysis of both the Elastic-First and Inelastic-First allocation policies in Section 5. This analysis leverages recent techniques for solving high-dimensional Markov chains. Our analytical results match simulation.
For the sake of completeness, we also consider the case where job sizes are known and jobs arrive at time . Here we use worst-case analysis. Using standard dual-fitting techniques (e.g. (AnandGK12; AngelopoulosLT19)), we show SRPT-k is a 4-approximation for the objective of minimizing mean response time. This demonstrates the need for stochastic modeling and analysis. Indeed, the stochastic setting yields optimality results without resorting to approximations. Due to lack of space, this final contribution is saved for the Appendix A.
2. Our Model
We consider a model where jobs arrive over time to a system of identical servers. Each job has an associated amount of inherent work which we refer to as the job size. We assume that each of the servers processes jobs with a rate of 1 unit of work per second. Hence, a job’s size is equal to its running time on a single server. We assume that job sizes are unknown to the system, and are drawn from exponential distributions.
Each job may be either elastic or inelastic. We assume that elastic jobs arrive according to a Poisson process with rate , and that elastic job sizes are drawn independently from an exponential distribution with rate . Similarly, inelastic jobs arrive independently according to a Poisson process with rate , and inelastic job sizes are drawn independently from an exponential distribution with rate . We let and
be random variables representing the initial sizes of an elastic job or an inelastic job respectively.
Every elastic job can run on any number of servers at any moment in time. Because each server processes work at rate 1, servers process work at a rate of units of work per second. Hence,
an elastic job of size completes in seconds on a single server but completes in seconds on servers.
By contrast, inelastic jobs can run on at most one server at any moment in time.
We note that all of the results presented in this paper hold equally if inelastic jobs can run on up to some fixed number of servers, . If , we there is effectively no difference between elastic and inelastic jobs, since we can never allocate more than servers in total. If , we can simply renormalize our allocation policies to consider allocating in units of servers. After renormalizing, inelastic jobs can once again receive up to one unit of allocation while elastic jobs can receive any number of units of allocation. While our results do not depend on the value of , we consider the case where for the sake of simplifying our notation.
An allocation policy, , must determine how many servers to allocate to each job at any moment in time . Specifically, can increase or decrease the allocation to a particular job as it runs. We assume that servers are capable of time sharing, and thus an allocation policy may allocate a fractional number of servers to any job. For any , we assume that an allocation of servers processes work at a rate of units of work per second. At any moment in time, , an allocation policy can allocate at most 1 server to each inelastic job, and at most servers in total.
We can model this system under any policy as a continuous time Markov chain where each state denotes the number of elastic and inelastic jobs currently in the system. That is, we define a continuous time Markov process where
Here, we define to be the number of inelastic jobs in system at time , and we define to be the number of elastic jobs in system at time . We therefore let the state denote that there are inelastic jobs and elastic jobs currently in the system.
Because job sizes are exponential and arrivals occur according to a Poisson process, at any moment in time , the distributions of remaining job sizes and the distributions of times until the next arrival for each job class can be fully specified by the numbers of inelastic jobs and elastic jobs in the system. Hence, we will only consider policies which are stationary and deterministic, meaning the policy makes the same allocation decision at every time , given that the system is in state . Specifically, we define to be the number of servers allocated to inelastic jobs in state under policy , and we define to be the number of servers allocated to elastic jobs in state under policy . Note that
In general, could be less than if there are not a sufficient number of jobs to use all servers, or if chooses to idle servers instead of allocating them to an eligible job.
We refer to a policy as work conserving if and only if, in any state ,
That is, never leaves servers idle if there is an eligible job in the system. In Appendix B we show that there exists an optimal policy which is also work conserving. It therefore suffices to only consider work conserving policies throughout our analysis.
We define the system load, to be
In Appendix C we show that for any work conserving policy, , is an ergodic Markov chain if . Because there exists an optimal work conserving policy, (1) is necessary for stability under any policy . We therefore only consider the regime where .
We will track several stochastic quantities in our system. We define the total number of jobs in the system, , as
We also define to be the total work in the system under policy at time , where total work is the sum of the remaining sizes of all jobs in the system. Similarly, we let and be the total elastic work and the total inelastic work in the system under policy at time . These quantities are the sums of the remaining sizes of all elastic or inelastic jobs respectively. When referring to the corresponding steady-state quantities, we omit the argument .
We define the random variable to be the response time of a job which arrives to the system in steady-state under policy . Here, the response time of a job is the time from when the job arrives until it is completed (i.e. its remaining size is 0). Our goal is to find the policy which minimizes the mean response time.
We will investigate the performance of two allocation policies, Elastic-First (EF) and Inelastic-First (IF). EF gives strict preemptive priority to elastic jobs, and processes jobs in first-come-first-serve (FCFS) order within each job class. That is, in any state where , EF allocates all servers to the elastic job with the earliest arrival time. In any state where , EF allocates one server to each inelastic job, in FCFS order, until either all jobs have received a server or all servers have been allocated. By contrast, IF gives strict preemptive priority to inelastic jobs while processing jobs in FCFS order within each job class. Under IF, in any state where , one server is allocated to each inelastic job and the remaining servers are allocated to the elastic job with the earliest arrival time if there is one. In any state where , all servers are allocated to the inelastic jobs with the earliest arrival times.
3. Prior Work
Although many real-world systems are tasked with allocating servers to heterogeneous workloads, these systems do not allocate servers optimally in order to minimize the mean response time across jobs. Most large-scale cluster schedulers allow users to explicitly reserve the number of servers they want (verma2015large; hindman2011mesos; mars2011bubble; lo2015heracles; moritz2018ray), only allowing the system to choose the placement of each job onto its requested number of servers. Some systems have proposed allowing the system to determine the number of servers allocated to each job (delimitrou2014quasar; peng2018optimus; liaw2019hypersched)
in order to reduce response times. However, these systems rely on heuristics and do not make theoretical guarantees.
In the theoretical literature, the closest work to the results presented in this paper come from the stochastic performance modeling community. In particular, (berg2018towards) develops a model of jobs whose sizes are drawn from an exponential distribution and which receive a sublinear speedup from being allocated additional servers. However, (berg2018towards) only provides optimality results when jobs are homogeneous, following a single speedup function and a single exponential size distribution. We emphasize that our paper is the first ever to consider more than one speed-up curve in the setting with stochastic arrivals over time and stochastic job sizes. Essentially all other work in the stochastic community has considered non-parallelizable inelastic jobs. Much of the prior work has been limited to scheduling jobs on a single server (conway2003theory). While there has certainly been work on scheduling in stochastic multiserver systems (e.g (JACM02; grosof2018srpt; gupta2007insensitivity; BachmatSarfati08; AHW94; Sigmetrics09a)), this literature assumes that a job occupies at most one server at a time (that is, all jobs are inelastic). One notable model that considers jobs that run on multiple servers is the queueing model motivated from MapReduce (kim1989analysis; wang2019delay; nelson1988approximate). This work assumes that each job consists of a set of pieces that can be processed on different machines at the same time. These pieces can be processed in any order and, critically, a job only completes when all of its pieces have completed. This model can only be analyzed exactly when the number of servers is .
In the worst case setting, the problem of scheduling jobs on identical parallel servers was introduced in (mcnaughton1959scheduling) and has been considered extensively. However, in the classical version of the problem, all jobs are considered to be inelastic. Given inelastic jobs with known sizes and known release times, (leonardi2007approximating) shows a tight lower bound on the competitive ratio of where is the number of jobs and is the ratio of the maximum job size to the minimum job size. The policy which achieves the best competitive ratio is SRPT-k, which at every moment schedules the jobs with the smallest remaining processing times.
Several prior works have also considered scheduling parallelizable jobs in the worst-case setting. The speed-up curve model was first addressed by (Edmonds00). The best result for mean response time is (edmonds2009scalably) which gave a constant competitive algorithm with minimal speed augmentation. This paper introduced the influential LAPS scheduling algorithm that has been used in a variety of settings (GuptaIKMP12; EdmondsIM11). The work of (im2016competitively) considers the problem without speed augmentation and gives a competitive algorithm with mild assumptions on the speed-up curves. Recently, there has been a line of work on the Directed-Acyclic-Graph (DAG) model for parallelism. Here a constant competitive algorithm with speed augmentation is known (AgrawalLLM16). The work of (AgrawalL0LM19) gave an speed constant competitive algorithm for mean response time that is practical, using minimal preemptions. Note, however, that the best possible competitive ratio in any model with release times is still lower bounded by , since all jobs could be inelastic in the worst case.
4. Optimality Results
The following sections establish two results. First, we show that if , then IF is optimal for minimizing mean response time. Second, we show that if , then IF is not necessarily optimal.
In Section 4.1, we consider the special case where . In this case where we have homogeneous sizes, analysis is particularly easy. Unfortunately, the technique used to demonstrate optimality, which is based on the notion of precedence relations in continuous time Markov chains, does not extend to when .
In Section 4.2, we consider the case where . Here, we consider a novel sample path argument which allows us to demonstrate the optimality of IF.
Lastly, in section 4.3, we consider the case where . Here, we construct a very simple example demonstrating that IF is not optimal in this environment. Furthermore, in this example, we show the policy EF actually outperforms IF. We do not know what policy is optimal in this regime.
4.1. Optimality when
We first consider the case where . In this case, IF is optimal with respect to minimizing mean response time. As stated in Section 1.2, the optimal policy should balance the trade-off between completing jobs quickly and preserving system efficiency. When , IF maximizes system efficiency without reducing the overall completion rate of jobs. We argue this formally in Theorem 1 by leveraging a result from (berg2018towards).
Theorem 1 ().
IF is optimal with respect to minimizing mean response time when .
Consider the server allocations made by a policy in any state . We define the total rate of departures under in the state to be
Following the terminology of (berg2018towards), we say that is in the class of GREEDY policies if
That is, a policy is in GREEDY if it achieves the maximal rate of departures in every state.
Furthermore, (berg2018towards) defines a class of policies called GREEDY*. A policy is said to be in GREEDY* if, in every state , it minimizes the number of servers allocated to elastic jobs while still maximizing the total rate of departures. That is, a policy is in GREEDY* iff
It is shown in (berg2018towards), using precedence relations, that for any policy
To leverage this result, we note that when in our model, a policy is in GREEDY if and only if it does not idle servers unnecessarily.
We now argue that IF, which is non-idling, must be in GREEDY*. In states where IF allocates zero servers to elastic jobs, is clearly minimal. In any state where , servers cannot be reallocated from elastic jobs to inelastic jobs, since all inelastic jobs must already be in service. Hence, reducing in this case results in a policy which is not in GREEDY. is therefore minimal amongst GREEDY policies in any state , and IF is in GREEDY*.
We show in Appendix B that there exists an optimal policy which is non-idling. Hence, when , there is an optimal policy in GREEDY. This implies that there must be an optimal policy in GREEDY* as well. Because any policy in GREEDY* has the same rate of departures of elastic and inelastic jobs in every state , every policy in GREEDY* has the same mean response time. Thus, IF, which is in GREEDY*, is optimal with respect to mean response time.
Why the prior argument does not generalize
Unfortunately, the results of (berg2018towards) do not extend to the case where . In particular, the proof of (2) uses a precedence relation between any two states and . This claim essentially states that a policy in state would perform better by transitioning to state than it would by transitioning to state . In the case where , this makes perfect intuitive sense. In this case, both states and contain the same amount of expected total work. Hence, it is better to be in state , which benefits from having an additional elastic job. Consider how this intuition changes when . In this case, state has less expected total work, but state has more expected elastic work. It turns out that the precedence relation shown in (berg2018towards) no longer holds when . Moreover, even if the precedence relations were to hold when , (berg2018towards) would yield that GREEDY* is optimal amongst GREEDY policies, not optimal amongst all policies. We must therefore devise a new argument to reason about the optimal allocation policy when elastic and inelastic jobs follow different size distributions.
4.2. Optimality when
We will show IF is optimal in the more general case of . While our goal is to minimize mean response time, we note that via Little’s Law (harchol2013performance), it suffices to minimize the mean total number of jobs in the system. 222Little’s Law states that for any ergodic system with average total arrival rate , the mean response time, is related to the mean total number of jobs in system, via the formula
First, we start by defining a class of policies which serve inelastic jobs on a first-come-first-serve (FCFS) basis; elastic jobs can be served in any order. In more detail, a policy is said to be in class if the following hold true:
serves inelastic jobs in FCFS order. In particular, if allocates servers to inelastic jobs at time ( may be fractional, and there may be more than inelastic jobs in the systems), the allocation must give servers to the inelastic jobs with the earliest arrival times. If there is a remaining fraction of a server, it may then be allocated to the inelastic job with the next earliest arrival time.
Road map: Theorem 2 argues that we only need to compare IF to policies in . Specifically, contains some optimal policy that minimizes the mean number of jobs in system and mean response time.
Next, in Theorem 3 we present a novel sample path argument which shows that IF has stochastically less work in the system than any policy in . We will directly leverage this fact to show that, out of all policies , IF has the least expected inelastic work in system and also the least expected total work in system.
Finally, In Theorem 5 we show that, of all policies in , IF minimizes the expected number of jobs in system. Thus, by Little’s Law, IF is optimal with respect to mean response time.
Analysis. We now present Theorem 2.
Theorem 2 ().
The class contains a policy which minimizes both mean response time and mean number of job in system. Specifically
where is the total number of jobs in the system in steady-state under policy , and is the response time of a job in the system under in steady-state.
Recall that we will consider only stationary, deterministic, work-conserving policies which make allocation decisions based on state . Let be a stationary, deterministic, work-conserving policy with the minimal mean number of jobs in system. Figure 1 shows the transition rates out of state under .
We see that the transition rates out of the current state under policy depend solely on the number of servers allocated to each type of job. Thus, neither the order in which we serve the jobs nor how many jobs of each type are running matter. In particular, we can construct a policy such that, for any state ,
and serves inelastic jobs in FCFS order. The policy has the same Markov chain as , so the expected numbers of jobs in system under and are identical. Because is work-conserving, is also work-conserving. Hence, is in and achieves the minimal mean number of jobs in system. ∎
The power of Theorem 2 is that, to show IF is optimal with respect to mean response time, it now suffices to show:
However, it is hard to directly compare the numbers of jobs under different policies. We get around this roadblock by instead analyzing how the remaining work in the system under IF relates to other policies . In particular, we obtain the following strong result.
Theorem 3 ().
For all policies , if we assume that
where is the total remaining work under policy at time , is the remaining inelastic work under policy at time , and denotes stochastic dominance.
Fix an arbitrary policy , and let us consider a fixed arrival sequence, that is, a fixed sequence of arrival times and job sizes. We couple and IF under this sequence. Here, it suffices to consider arrival sequences where the total number of job arrivals up to any time
is finite, as this occurs with probability 1.
Recall that and are respectively the remaining inelastic and elastic work in the system at time under scheduling policy . Furthermore, also recall that , the total work at time , is given by:
In order to show the desired stochastic dominance relations, it will suffice to show that on any such arrival sequence
First, we see it is immediate that, under our arrival sequence, for all . Since IF and process inelastic jobs in FCFS order, each inelastic job enters service at least as early under IF as it does under . Furthermore, IF never preempts inelastic jobs. Hence, at each time , the remaining size of each inelastic job that has arrived by time is no larger under IF than it is under . Since the inelastic work in system is just the sum of the remaining sizes of inelastic jobs, the total inelastic work at time under IF is less than the total inelastic work at time under .
It remains to show that
We prove our claim by induction. For a base case, it is clear that , as the policies have the same set of jobs at time zero, and no work has been completed yet. For any time , we partition the interval into subintervals (see Figure 2) such that either
IF allocates all servers on , or
IF allocates strictly less than servers on .
We now induct on , and show that implies .
If the interval falls into case (1), IF is completing work at the maximal rate of any policy. In particular, IF completes exactly work on . Let denote the work completed by on . Then, we must have . Since IF and experience the same set of arrivals on this interval, we have:
Thus, we have , as desired.
If the interval falls into case (2), IF allocates strictly less than servers on . We aim to show that . Observe that IF can have no elastic jobs in its system on . This is because we have defined IF to be work-conserving. Hence, if there was an elastic job, IF would run it on all available servers.
Observe that, assuming no elastic job arrives at time ,
Likewise, we know
We get the inequality above because cannot have negative elastic work at time . Finally, we have
where the last inequality follows from the fact that for all . Thus, we have .
As a side note, some elastic work could arrive at exactly time . However, this increases the total work in both systems by the same amount and thus has no effect on the ordering of these quantities.
Thus, for any interval , if , then we have . Since , it follows that this inequality holds at the end of the last subinterval. The end of this final subinterval is exactly time . Thus, for any , we have , as desired.
We have thus found a coupling of and IF such that the amount of total work and the amount of inelastic work in each system is ordered at every moment in time. This implies that
In other words, IF is the best policy in for minimizing remaining inelastic and total work in the system. One possible explanation for this is that, by deferring parallelizable work, IF ensures that all servers are saturated with work for as long as possible.
We now understand that, out of all policies in , IF is optimal with respect to minimizing both expected remaining inelastic work and expected remaining total work at any time . We now establish a relationship between expected remaining work and expected number of jobs in system.
Lemma 4 ().
For any policy , we have:
where and are respectively the inelastic work and number of inelastic jobs in the system in steady-state under policy . Furthermore, is the size of an inelastic job, distributed as . and are the analogous quantities for elastic jobs.
We do the proof for the inelastic relationship, but the proof for the elastic relationship is identical. Let the random variable denote the number of inelastic jobs in the system under policy at time . Assume that is used as an index for the jobs which are in the system at time , and define as the remaining size of inelastic job under policy at time .
Recall is the remaining inelastic work in system at time under policy . We have the following equivalence:
By the memoryless property of the exponential distribution, the remaining size of jobs also follow an exponential distribution. Specifically, , regardless of policy or time . Thus, and are independent and we have that
As shown in Appendix C, converges to as . This implies the convergence of . Thus, taking the limit as yields:
as desired 333Technically, we have only proven that converges to some value, but not that it converges to . This would be sufficient for our subsequent results. It turns out that converges to as , but we omit this proof for brevity.. ∎
We can now show that IF has the lowest expected number of jobs in system when .
Theorem 5 ().
For any policy , if , we have:
And via Little’s Law, we have:
Because there exists an optimal work-conserving policy in , it suffices to consider any policy . We write total work under as . Likewise, we have the equality . First, from Lemma 4, we have the following equalities:
Furthermore, by the stochastic dominance results of Theorem 3,
Thus, we have:
We have therefore established that IF is optimal with respect to mean response time when .
4.3. Failure when
Now, we consider the case when . Here, we demonstrate that IF is not optimal in minimizing mean response time. In fact, IF is not even optimal in the simplified environment where there are only two servers and no arrivals. We construct our counterexample in Theorem 6 below.
Theorem 6 ().
In general, IF is not optimal for minimizing mean response time when .
Assume we have servers, , and there are no arrivals. We show that, if the system starts with two inelastic jobs and one elastic job, the policy EF outperforms IF.
We directly compute the mean response time for both policies, starting with IF. We let denote response time under IF, and denote response time under elastic first. We have:
On the other hand, we see:
In particular, we have . Thus, in general, IF is not optimal when . In fact, in this environment, we see EF outperforms IF. ∎
5. Response Time Analysis Results
From the results of Section 4, we know that IF is optimal with respect to mean response time when . However, Section 4 also shows that EF can outperform IF when . This begs the question of which allocation policy, IF or EF, performs better for given values of and .
In this section we derive the mean response time for EF under a range of values of , , , , and . The analysis for the IF policy is similar, and thus we defer it to Appendix D. We outline our approach here:
In Section 5.1 we present the Markov chains for EF. This Markov chain is 2D-infinite.
In Section 5.2 we present a technique from the stochastic literature called Busy Period Transitions (PerfEval06; Sigmetrics03b) which reduces the 2D-infinite chain to a 1D-infinite chain. Although the Busy Period Transitions approach produces an approximation, it is known to be highly accurate, with errors of less than 1% (PerfEval06; Sigmetrics03b; SPAA03; QUESTA05; ICDCS03).
In Section 5.3 we apply standard Matrix-Analytic methods to solve the 1D-infinite Markov chain, obtaining the stationary distribution and finally the mean response time EF.
The results of our analysis for IF and EF are shown in Figures 4, 5, and 6. We compared our analysis with simulation, and all numbers agree within 1%. We note that (berg2018towards) used MDP-based techniques to analyze allocation policies in a similar model. These previous results required truncating the state space, and were computationally intensive. The techniques presented in this section do not require truncating the state space, can be tuned to arbitrary precision, and are comparatively efficient.
Figure 4 presents a high-level view of our results, showing only the relative performance of IF and EF as the system load, , is moved from (a) low load to (b) medium load to (c) high load. In every case, IF outperforms EF when , as expected from the optimality of IF in this region. When , Figure 4 shows us that EFcan outperform IF, and that the region where EF is better grows as increases.
Figure 5 shows the absolute mean response times under IF and EF as a function of . We again examine the system under various fixed values of . The dotted lines at denote the case where . We therefore know that IF is optimal to the right of this line in every graph, while EF may dominate IF to the left of this line. We see that our choice of allocation policy has a major impact on mean response time.
While Figures 4 and 5 assume that , our analysis works equally well with any number of servers, . Figure 6 shows how the mean response time under IF and EFchanges as increases while system load, , remains constant.
5.1. Markov Chains for If and Ef
Figure (a)a shows the Markov chain which exactly describes EF. The corresponding IF chain is given in Appendix D. Recall that the state denotes having inelastic jobs and elastic jobs in the system. This chain is infinite in 2 dimensions – the number of inelastic jobs and the number of elastic jobs. Because there is no general method for solving 2D-infinite Markov chains, we provide a technique for converting this chain to a 1D-infinite Markov chain in Section 5.2.
5.2. Converting From 2D-Infinite to 1D-Infinite
We start by describing how to reduce the dimensionality of the Markov chain for EF. To do this, we make three key observations about its structure.
Observation 1: Response time of elastic jobs is trivial. Under EF, elastic jobs have preemptive priority over inelastic jobs. Thus, their behavior is independent of the state of inelastic jobs in the system. We can therefore model the response time of elastic jobs as an queueing system with arrival rate and service rate , which is well understood in the queueing literature (kleinrock1976queueing). What remains is to understand the response time of the inelastic jobs.
Observation 2: The busy period transformation. Looking at Figure (a)a, we notice that the chain has a repeating structure when there is at least 1 elastic job in the system (). We leverage this repeating structure to reduce the Markov chain for EF to a 1D-infinite chain. Specifically, while there are elastic jobs in the system, EF does not process any inelastic jobs. The length of time where EF is not processing any inelastic jobs can be viewed as an busy period. In an system, a busy period is defined to be the time between when a job arrives into an empty system until the system empties. In our case, this busy period is the time from when an elastic job arrives into a system with no elastic jobs until the system next has 0 elastic jobs. In Figure (b)b, we show how to the entire portion of the Markov chain where with a set of special states which represent the duration of an busy period for the elastic jobs.
Observation 3: Creating 1D chain for inelastic jobs. Looking at Figure (b)b, we note the bolded transition arrows (labeled “B”) emanating from the busy period states. Because the duration of an busy period is not exponentially distributed, we must replace these special transitions with a mixture of exponential states (a Coxian distribution) which accurately approximates the duration of a busy period. A technique for matching the first three moments of the busy period with a Coxian is given in (PerfEval06). The 1D-infinite chain resulting from this technique is described in Figure (c)c.
We use the same three-step technique to make an analogous simplification of the Markov chain for IF (see Appendix D).
Given these 1D-infinite chains, we now apply standard matrix analytic techniques to solve for mean response time.
5.3. Matrix Analytic Method
We now explain how to analyze IF and EF using the 1D-infinite Markov chains developed in the previous section. We do this by applying matrix analytic methods (Neuts81; LatoucheRamaswami99; Neut89). Matrix analytic methods are iterative procedures which compute the stationary distribution of a repeating, 1D-infinite Markov chain.
Consider, for example, Figure (c)c which shows the 1D-infinite chain for EF. Observe that each column of this chain, after the first column, has identical transitions. The idea of matrix analytic methods is to represent the stationary distribution of column as a product of the stationary distribution of column and some unknown matrix . The matrix is determined iteratively through a numeric procedure (Neuts81; LatoucheRamaswami99; Neut89). This procedure yields the stationary distribution of the chain. Using the stationary distribution we can easily determine the mean number of inelastic jobs, and hence the mean response time for inelastic jobs (recall that the response time for elastic jobs under EF is trivial).
An analogous argument can be applied to solve the 1D-infinite chain for IF.
In this paper, we establish optimality results and provide the first analysis of policies for scheduling jobs which are heterogeneous with respect to their parallelizability. Specifically, we study a model where jobs are either inelastic or elastic: inelastic jobs can only run on a single server and elastic jobs parallelize linearly across many servers. We prove that the policy Inelastic-First (IF), which gives inelastic jobs preemptive priority over elastic jobs, is optimal for minimizing the mean response time across jobs in the common case where elastic jobs are larger on average than inelastic jobs. We then provide analysis of mean response time under the Elastic-First (EF) and Inelastic-First (IF) policies. Our techniques include a novel sample path argument for proving stochastic dominance, and a method for solving 2D-infinite Markov chains.
There are many open questions in scheduling jobs which are heterogeneous with respect to their parallelizability. One immediate follow-up of our work is to find optimal policies when elastic jobs are smaller on average than inelastic jobs. We show in this paper that in this setting EF can outperform IF; however it’s not clear that EF is the optimal allocation policy. Furthermore, the model studied in this paper can be generalized in many ways to capture a broad range of application scenarios. For example, one can consider a model where the elastic jobs are not fully elastic as in this paper, but are elastic up to a certain number of servers. More generally, we can have more than two classes of jobs with different levels of parallelizability and different job size distributions. The problem of finding optimal policies and providing analysis in these models is wide open.
Appendix A Approximation when Jobs Arrive at the Same Time
In this section we show that a generalization of SRPT-k is a 4-approximate algorithm for mean response time if all jobs arrive at the same time. This case is entirely deterministic. This result generalizes beyond elastic and inelastic jobs; in particular, this result holds even in more general parallelizability settings where every job is parallelizable up to processors. That is, if job is given processors, the rate it is processed is .
To prove the theorem, we will use a dual fitting analysis. Consider the following LP relaxation of the problem. In the following, we use to denote the inherent size of job . The variable is how much job is processed at time .
It is easy to show that the above LP lower bounds the optimal flow time of a feasible schedule. This is essentially an LP for a speed single machine plus the standard corrective term in the objective. See (ChadhaGKM09) for similar relaxations. The dual of is as follows.
The algorithm that will be used is a natural generalization of SRPT-k to the case of parallelizable jobs. The algorithm sorts the jobs according to their inherent size in increasing order. For the rest of the analysis we assume that the jobs are in this order such that , where is the total number of jobs . At any point in time, the algorithm gives the cores to the jobs in this priority order. Each job is assigned up to processors and then the algorithm considers the next job in the list with the remaining processors. We let be the total amount of work strictly ahead of job .
To analyze the algorithm, we will assume the processors the algorithm has are of speed . Later we will set . That is, each processor completes units on a job each timestep it works on a job. We compare to an optimal solution with one speed processors. The following theorem allows us to do this with minimal loss in the approximation ratio. This allows us to compare to the slower optimal solution.
Lemma 1 ((GuptaMUX17)).
Let denote the value of the total response time of the optimal algorithm where the optimal algorithm has processors of speed . Then for any ,
We now define the dual variables. Let denote the set of jobs released and unsatisfied at time in the algorithm’s schedule. Let and let . Our main claim is the following.
Lemma 2 ().
Let denote the algorithm’s total completion time. It is the case that . Moreover, correspond to a feasible dual solution when .
The majority of the section will be devoted to proving this lemma. We first observe that this is sufficient to prove our theorem.
Theorem 3 ().
The SRPT-k algorithm is a -approximation for mean response time when all jobs arrive at time .
Now return to proving Lemma 2. We being by establishing the value of the objective function.
Lemma 4 ().
First notice that . This is precisely . Thus, it is sufficient to prove . To do so, we show that is an upper bound on job ’s response time. Indeed, we know that either all processors are working on work in with speed if is unsatisfied or job is being worked on with processors with speed . ∎
Next we will show that this setting of the dual variables corresponds to a feasible dual solution.
Lemma 5 ().
The dual solution is feasible when .
We need to show the following for all jobs and times :
Consider the left hand side for a fixed job and time . Let be the remaining work left on job at time and be the amount of job that has been processed up to time