Distributed computing frameworks (e.g., MapReduce , Spark , Hive ) have enabled processing of very large data sets across a cluster of servers. The processing is typically done by executing a set of jobs or tasks in the servers. A key component of such systems is the resource manager (scheduler) that assigns incoming jobs to servers and reserves the requested resources (e.g. CPU, memory) on the servers for running jobs. For example, in Hadoop , the resource manager reserves the requested resources, by launching resource containers in servers. Jobs of various applications can arrive to the cluster, which often have very diverse resource requirements. Hence, to improve throughput and delay, a scheduler should pack as many jobs (containers) as possible in the servers, while retaining their resource requirements and not exceeding server’s capacities.
A salient feature of resource demand is that it is hard to predict and cannot be easily classified into a small or moderate number of resource profiles or “types
”. This is amplified by the increasing complexity of workloads, i.e., from traditional batch jobs, to queries, graph processing, streaming, machine learning jobs, etc., that rely on multiple computation frameworks, and all need to sharethe same cluster. For example, Figure 1 shows the statistics of memory and CPU resource requirement requested by jobs in a Google cluster , over the first day in the trace. If jobs were to be divided into types according to their memory requirement alone, there would be more than types. Moreover, the statistics change over time and these types are not sufficient to model all the job requirements in a month, which are more than . We can make a similar observation for CPU requirements, which take more than discrete types. Analyzing the joint CPU and memory requirements, there would be more than distinct types. Building a low-complexity scheduler that can provide high performance in such a high-dimensional regime is extremely challenging, as learning the demand for all types is infeasible, and finding the optimal packing of jobs in servers, even when the demand is known, is a hard combinatorial problem (related to Bin Packing and Knapsack problems ).
Despite the vast literature on scheduling algorithms, their theoretical study in such high-dimensional setting is very limited. The majority of the past work relies on a crucial assumption that there is a predefined finite set of discrete types, e.g. [6, 7, 8, 9, 10, 11]. Although we can consider every possible resource profile as a type, the number of such types could be formidably large. The application of scheduling algorithms, even with polynomial complexity in the number of types, is discouraging in such setting. A natural solution could be to divide the resource requests into a smaller number of types. Such a scheduler can be strictly suboptimal, since, as a result of mapping to a smaller number of types, jobs may underutilize or overutilize the resource compared to what they actually require. Moreover, in the absence of any prior knowledge about the resource demand statistics, it is not clear how the partitioning of the resource axis into a small number of types should be actually done.
Our work fulfills one of the key deficiencies of the past work in the modeling and analysis of scheduling algorithms for distributed server systems. Our model allows a very large or, in the extreme case, even infinite
number of job types, i.e., when the jobs’ resource requirements follow a probability distribution over a continuous support. To the best of our knowledge, there is no past work on characterizing the optimal throughput and what can be achieved when there are no discrete job types. Our goal is to characterize this throughput and design algorithms that: (1) have low complexity, and (2) can provide provable throughput guaranteeswithout the knowledge of the traffic or the resource requirement statistics.
I-a Related Work
Existing algorithms for scheduling jobs in distributed computing platforms can be organized in two categories.
In the first category, we have algorithms that do not provide any throughput guarantees, but perform well empirically or focus on other performance metrics such as fairness and makespan. These algorithms include slot-based schedulers that divide servers into a predefined number of slots for placing tasks [12, 13], resource packing approaches such as [14, 15], fair resource sharing approaches such as [16, 17], and Hadoop’s default schedulers such as FIFO , Fair scheduler , and Capacity scheduler .
In the second category, we have schedulers with throughput guarantees, e.g., [6, 8, 9, 10, 11]. They work under the assumption that there is a finite number of discrete job types. This assumption naturally lends itself to MaxWeight algorithms , where each server schedules jobs according to a maximum weight configuration chosen from a finite set of configurations. The number of configurations however grows exponentially large with the number of types, making the application of these algorithms discouraging in practice. Further, their technique cannot be applied to our setting which can include an infinite number of job types.
There is also literature on classical bin packing problem , where given a list of objects of various sizes, and an infinite number of unit-capacity bins, the goal is to use the minimum number of bins to pack the objects. Many algorithms have been proposed for this problem with approximation ratios for the optimal number of bins or waste, e.g. [23, 24, 25]. There is also work in a setting of bin packing with queues, e.g. [26, 27, 28], under the model that an empty bin arrives at each time, then some jobs from the queue are packed in the bin at that time, and the bin cannot be reused in future. Our model is fundamentally different from these lines of work, as the number of servers (bins) in our setting is fixed and we need to reuse the servers to schedule further jobs from the queue, when jobs depart from servers.
I-B Main Contributions
Our main contributions can be summarized as follows:
Characterization of Maximum Achievable Throughput. We characterize the maximum throughput (maximum supportable workload) that can be theoretically achieved by any scheduling algorithm in the setting that the jobs’ resource requirements follow a general probability distribution over possibly infinitely many job types. The construction of optimal schedulers to approach this maximum throughput relies on a careful partition of jobs into sufficiently large number of types, using the complete knowledge of the resource probability distribution .
Oblivious Scheduling Algorithms. We introduce scheduling algorithms based on “Best-Fit” packing and “universal partitioning” of resource requirements into types, without the knowledge of the resource probability distribution . The algorithms have low complexity and can provably achieve at least and of the maximum throughput, respectively. Further, we show that is tight in the sense that no oblivious scheduling algorithm, that maps the resource requirements into a finite number of types, can achieve better than of the maximum throughput for all general resource distributions .
Empirical Evaluation. We evaluate the throughput and queueing delay performance of all algorithms empirically using both synthetic and real traffic traces.
Ii System Model and Definitions
We consider a collection of servers denoted by the set . For simplicity, we consider a single resource (e.g. memory) and assume that the servers have the same resource capacity. While job resource requirements are in general multi-dimensional (e.g. CPU, memory), it has been observed that memory is typically the bottleneck resource [20, 29]. Without loss of generality, we assume that each server’s capacity is normalized to one.
Jobs arrive over time, and the -th job, , requires an amount of the (normalized) resource for the duration of its service. The resource requirements
are i.i.d. random variables with ageneral, with average . Note that each job should be served by one server and its resource requirement cannot be fragmented among multiple servers. In the rest of the paper, we use the terms job size and job resource requirement interchangeably.
We assume time is divided into time slots . At the beginning of each time slot , a set of jobs arrive to the system. We use to denote the cardinality of . The process , , is assumed to be i.i.d. with a finite mean
and a finite second moment.
There is a queue that contains the jobs that have arrived up to time slot and have not been served by any servers yet. At each time slot, the scheduler can select a set of jobs from and place each job in a server that has enough available resource to accommodate it. Specifically, define , where is the set of existing jobs in server at time . At any time, the total size of the jobs packed in server cannot exceed its capacity, i.e.,
Note that jobs may be scheduled out of the order that they arrived, depending on the resource availability of servers. Let denote the cardinality of and denote the cardinality of (the number of jobs in the queue). Then the queue and its size evolve as
Once a job is placed in a server, it completes its service after a geometrically distributed amount of time with mean, after which it releases its reserved resource. This assumption is made to simplify the analysis, and the results can be extended to more general service time distributions (see Section VIII for a discussion).
Stability and Maximum Supportable Workload
The system state is given by which evolves as a Markov process over an uncountably infinite state space 111The state space can be equivalently represented in a complete separable metric space, as we show in Section -B. We investigate the stability of the system in terms of the average queue size, i.e., the system is called stable if . Given a job size distribution , a workload is called supportable if there exists a scheduling policy that can stabilize the system for the job arrival rate and the mean service duration .
Maximum supportable workload is a workload such that any can be stabilized by some scheduling policy, which possibly uses the knowledge of the job size distribution , but no can be stabilized by any scheduling policy.
Iii Characterization of Maximum Supportable Workload
In this section, we provide a framework to characterize the maximum supportable workload given a job resource distribution . We start with an overview of the results for a system with a finite set of discrete job types.
Iii-a Finite-type System
It is easy to characterize the maximum supportable workload when jobs belong to a finite set of discrete types. In this case, it is well known that the supportable workload region is the sum of convex hull of feasible configurations of servers, e.g. [6, 8, 9, 10, 11], which are defined as follows.
Definition 1 (Feasible configuration).
Suppose there is a finite set of job types, with job sizes . An integer-valued vector
. An integer-valued vectoris a feasible configuration for a server if it is possible to simultaneously pack jobs of of type , jobs of type , , and jobs of type in the server, without exceeding its capacity. Assuming normalized server’s capacity, any feasible configuration must therefore satisfy , , . We use to denote the (finite) set of all feasible configurations.
We define to be the probability that size of an arriving job is , to be the vector of such arrival probabilities, and to be the workload. We also refer to as the workload vector. As shown in [6, 8, 9, 10], the maximum supportable workload is
where is the convex hull operator, and the vector inequality is component-wise. Also (or ) denotes supremum (or infimum). Hence any is supportable by some scheduling algorithm, while no can be supported by any scheduling algorithm.
The optimal or near-optimal scheduling policies then basically follow the well-known MaxWeight algorithm . Let be the number of type- jobs waiting in queue at time . At any time for each server , the algorithm maintains a feasible configuration that has the “maximum weight” [8, 9] (or a fraction of the maximum weight ), among all the feasible configurations . The weight of a configuration is formally defined below.
Definition 2 (Weight of a configuration).
Given a queue size vector , the weight of a feasible configuration is defined as the inner product
Iii-B Infinite-type System
In general, the support of the job size distribution can span an infinite number of types (e.g., can be a continuous function over ). We introduce the notion of virtual queue which is used to characterize the supportable workload for any general distribution .
Definition 3 (Partition and Virtual Queues (VQs)).
Define a partition of interval as a finite collection of disjoint subsets , , such that . If the size of an arriving job belongs to , we say it is a type- job. For each type , we consider a virtual queue which contains the type- jobs waiting in the queue for service.
As in the finite-type system, given a partition , we can define the probability that a type- job arrives as , the arrival probability vector as , and the workload vector as . However, under this definition, it is not clear what configurations are feasible, since the jobs in the same virtual queue can have different sizes, even though they are called of the same type. Hence we make the following definition.
Definition 4 (Rounded VQs).
We call “upper-rounded ”, if the sizes of type- jobs are assumed to be , . Similarly, we call them “lower-rounded ”, if the sizes of type- jobs are assumed to be , .
Given a partition , let and be respectively the maximum workload under which the system with upper-rounded virtual queues and the system with the lower-rounded virtual queues can be stabilized. Since these systems have finite types, these quantities can be described by (4) applied to the corresponding finite-type system with workload vector .
Let also and where the supremum and infimum are over all possible partitions of interval . Next theorem states the result of existence of maximum supportable workload.
Consider any general (continuous or discontinuous) probability distribution of job sizes with cdf . Then there exists a unique such that . Further, given any , there is a partition such that the associated upper-rounded virtual queueing system (and hence the original system) can be stabilized.
The proof of Theorem 1 has two steps. First, we show that for any partition . Second, we construct a sequence of partitions, that depend on the job size distribution , and become increasingly finer, such that the difference between the two bounds vanishes in the limit.
Full proof can be found in Appendix -A. ∎
Theorem 1 implies that there is a way of mapping the job sizes to a finite number of types using partitions, such that by using finite-type scheduling algorithms, the achievable workload approaches the optimal workload as partitions become finer. However, the construction of the partition crucially relies on the knowledge of the job size distribution , which may not be readily available in practice. Further, the number of feasible configurations grows exponentially large as the number of subsets in the partition increases, which prevents efficient implementation of discrete type scheduling policies (e.g. MaxWeight) in practice.
Next, we focus on low-complexity scheduling algorithms that do not assume the knowledge of a priori, and can provide a fraction of the maximum supportable workload .
Iv Best-Fit Based Scheduling
algorithm was first introduced as a heuristic forBin Packing problem : given a list of objects of various sizes, we are asked to pack them into bins of unit capacity so as to minimize the number of bins used. Under Best-Fit, the objects are processed one by one and each object is placed in the “tightest” bin (with the least residual capacity) that can accommodate the object, otherwise a new bin is used. Theoretical guarantees of Best-Fit in terms of approximation ratio have been extensively studied under discrete and continuous object size distributions [23, 24, 25].
There are several fundamental differences between the classical bin packing problem and our problem. In the bin packing problem, there is an infinite number of bins available and once an object is placed in a bin, it remains in the bin forever, while in our setting, the number of bins (the equivalent of servers) is fixed, and bins have to be reused to serve new objects from the queues as objects depart from the bins, and new objects arrive to the queue. Next, we describe how Best-Fit (BF) can be adapted for job scheduling in our setting.
Iv-a Bf-J/s Scheduling Algorithm
Consider the following two adaptations of Best-Fit (BF) for job scheduling:
BF-J (Best-Fit from Job’s perspective):
List the jobs in the queue in an arbitrary order (e.g. according to their arrival times). Starting from the first job, each job is placed in the server with the “least residual capacity” among the servers that can accommodate it, if possible, otherwise the job remains in the queue.
BF-S (Best-Fit from Server’s perspective):
List servers in an arbitrary order (e.g. according to their index). Starting from the first server, each server is filled iteratively by choosing the “largest-size job” in the queue that can fit in the server, until no more jobs can fit.
BF-J and BF-S need to be performed in every time slot. Under both algorithms, observe that no further job from the queue can be added in any of the servers. However, these algorithms are not computationally efficient as they both make many redundant searches over the jobs in the queue or over the servers, when there are no new job arrivals to the queue or there are no job departures from some servers. Combining both adaptations, we describe the algorithm below which is computationally more efficient.
BF-J/S (Best-Fit from Job’s and Server’s perspectives):
It consists of two steps:
Perform BF-S only over the list of servers that had job departures during the previous time slot. Hence, some jobs that have not been scheduled in the previous time slot or some of newly arrived jobs are scheduled in servers.
Perform BF-J only over the list of newly arrived jobs that have not been scheduled in the first step.
Iv-B Throughput Guarantee
The following theorem characterizes the maximum supportable workload under -J/S.
Suppose any job has a minimum size . Algorithm -J/S can achieve at least of the maximum supportable workload , for any .
We present a sketch of the proof here and provide the full proof in Appendix -B
. The proof uses Lyapunov analysis for Markov chainwhose state includes the jobs in queues and servers and their sizes. The Markov chain can be equivalently represented in a Polish space and we prove its positive recurrence using a multi-step Lyapunov technique  and properties of -J/S. We use a Lyapunov function which is the sum of sizes of all jobs in the system at time . Given that jobs have a minimum size, keeping the total size bounded implies the number of jobs is also bounded.
The key argument in the proof is that by using -J/S as described, all servers operate in more than “half full”, most of the time, when the total size of jobs in the queue becomes large. To prove this, we consider two possible cases:
The total size of jobs in queue with size is large:
In this case, these jobs will be scheduled greedily whenever the server is more than half empty. Hence, the server will always become more than half full until there are no such jobs in the queue.
The total size of jobs in queue with size is large:
If at time slot , a job in server is not completed, it will complete its service within the next time slot with probability , independently of the other jobs in the server. Given the minimum job size, the number of jobs in a server is bounded so it will certainly empty in a finite time. Once this happens, jobs will be scheduled starting from the largest-size one, and the server will remain more than half full, as long as there is a job of size more than to replace it. This step is true because of the way Best-Fit works and does not hold for other bin packing algorithms like First-Fit.
See the full proof in Appendix -B. ∎
V Partition Based Scheduling
-J/S demonstrated an algorithm that can achieve at least half of the maximum workload , without relying on any partitioning of jobs into types. In this section, we propose partition based scheduling algorithms that can provably achieve a larger fraction of the maximum workload , using a universal partitioning into a small number of types, without the knowledge of job size distribution .
V-a Universal Partition and Associated Virtual Queues
Consider a partition of the interval into the following subintervals:
We refer to this partition as partition , where
is a fixed parameter to be determined shortly. The odd and even subintervals inare geometrically shrinking. Figure 2 gives a visualization of this partition.
Jobs in queue are divided among virtual queues (Definition 3) according to partition . Specifically, when the size of a job falls in the subinterval , , we say this job is of type and it is placed in a virtual queue , without rounding its size. Moreover, jobs whose sizes fall in are placed in the last virtual queue , and their sizes are rounded up to .
We use to denote the size (cardinality) of at time and use to denote the vector of all VQ sizes.
V-B Vqs (Virtual Queue Scheduling) Algorithm
To describe the VQS algorithm, we define the following reduced set of configurations which are feasible for the system of upper-rounded s (Definition 4)
Definition 5 (Reduced feasible configuration set).
The reduced feasible configuration set, denoted by , consists of the following configurations:
where denotes the basis vector with a single job of type , , and zero jobs of any other types.
Note that each configuration either contains jobs from only one , , or contains jobs from and one other .
The “VQS algorithm” consists of two steps: (1) setting active configuration, and (2) job scheduling using the active configuration:
Setting active configuration:
Under , every server has an active configuration which is renewed only when the server becomes empty. Suppose time slot is the -th time that server is empty (i.e., it has been empty or all its jobs depart during this time slot). At this time, the configuration of server is set to the max weight configuration among the configurations of (Definitions 2 and 5), i.e.,
The active configuration remains fixed until the next time that the server becomes empty gain, i.e.,
Suppose the active configuration of server at time is . Then the server schedules jobs as follows:
If , the server reserves of its capacity for serving jobs from , so it can serve at most one job of type at any time. If there is no such job in the server already, it schedules one from .
Any configuration has at most one other than . The server will schedule jobs from the corresponding , starting from the head-of-the-line job in , until no more jobs can fit in the server. The actual number of jobs scheduled from in the server could be more than depending on their actual sizes.
The reason for choosing times to renew the configuration of server is to avoid possible preemption of existing jobs in server (similar to [6, 9]). Also note that active configurations in are based on upper-rounded s. Since jobs are not actually rounded in s, the algorithm can schedule more jobs than what specified in the configuration.
V-C Throughput Guarantee
The algorithm can provide a stronger throughput guarantee than BF-J/S. A key step to establish the throughput guarantee is related to the property of configurations in the set , which is stated below.
Consider any partition which is a refinement of partition , i.e., any subset of is contained in an interval in (6). Given any set of jobs with sizes in in the queue, let and be the corresponding vector of sizes under partition and partition . Then there is a configuration such that
where is the set of “all” feasible configurations based on upper-rounded s for partition .
For simplicity of description, consider to be a partition of into subintervals , . The proof arguments are applicable to any other types of subsets of as long as each subset is contained in an interval in (6).
Given the proposition’s assumption, we can define sets , , such that iff . Any job in , , under partition , belongs to under partition , therefore
Let . Note that in any feasible configuration , can be or . To show (10), we consider these two cases separately:
Case 1. :
We claim at least one of the following inequalities is true
If the claim is not true, we reach a contradiction because
where is due to the assumption that none of inequalities in (12) hold and using the fact that if , is due to the fact if , and is due to the server’s capacity constraint for feasible configuration .
Case 2. :
In this case . We further distinguish three cases for compared to : , , and . In the second case, we further consider two subcases depending on being or . Here we present the analysis of the case , . The rest of the cases are either trivial or follow a similar argument and can be found in Appendix -C.
Let , then one of the following inequalities has to be true
otherwise, we reach a contradiction, similar to Case 1, i.e.,
where is due to the assumption that none of inequalities in (13) hold, and is due to the constraint that the jobs in the configuration , other than the job types in , should fit in a space of at most (the rest is occupied by a job of size at least ). It is then easy to verify that if for some then inequality (10) is true for configuration as
Similarly if for some then inequality (10) is true for configuration as
The following theorem states the result regarding throughput of .
achieves at least of the optimal workload , if arriving jobs have a minimum size of at least .
Hence, given a minimum job’s resource requirement , has to be chosen larger than in the VQS algorithm. Theorem 3 is not trivial as it implies that by scheduling under the configurations in (7), on average at most of each server’s capacity will be underutilized because of capacity fragmentations, irrespective of the job size distribution . Moreover, using reduces the search space from configurations to only configurations, while still guaranteeing of the optimal workload .
A natural and less dense partition could be to only consider the cuts at points for . This creates a partition consisting of subintervals . The convex hull of only the first configurations of contains all feasible configurations of this partition. Using arguments similar to proof of Theorem 3, we can show that this partition can only achieve of the optimal workload . One might conjecture that by refining partition (6) or using different partitions, we can achieve a fraction larger than of the optimal workload ; however, if the partition is agnostic to the job size distribution , refining the partition or using other partitions does not help. We state the result in the following Proposition.
Consider any partition consisting of a finite number of disjoint sets , . Any scheduling algorithm that maps the sizes of jobs in to (i.e., schedules based on upper-rounded s) cannot achieve more than of the optimal workload for all .
See Appendix -E for the proof. ∎
Theorem 3 assumed that there is a minimum resource requirement of at least . This assumption can be relaxed as stated in the following corollary.
Consider any general distribution of job sizes . Given any , choose to be the smallest integer such that , then the algorithm achieves at least of the optimal workload .
See Appendix -F for the proof. ∎
Vi -: Incorporating Best-Fit in
While the algorithm achieves in theory a larger fraction of the optimal workload than -J/S, it is quite inflexible compared to -J/S, as it can only schedule according to certain job configurations and the time until configuration changes may be long, hence might cause excessive queueing delay. We introduce a hybrid - algorithm that achieves the same fraction of the optimal workload as , but in practice has the flexibility of . The algorithm has two steps similar to : Setting the active configuration is exactly the same as the first step in , but it differs in the way that jobs are scheduled in the second step. Suppose the active configuration of server at time is , then:
If , the server will try to schedule the largest-size job from that can fit in it. This may not be possible because of jobs already in the server from previous time slots. Unlike , when jobs from are scheduled, they reserve exactly the amount of resource that they require, and no amount of resource is reserved if no job from is scheduled.
Any configuration has at most one other than . Server attempts to schedule jobs from the corresponding , starting from the largest-size job that can fit in it. Depending on prior jobs in server, this procedure will stop when either the number of jobs from in the server is at least , or becomes empty, or no more jobs from can fit in the server.
Server uses BF-S to possibly schedule more jobs in its remaining capacity from the remaining jobs in the queue.
The performance guarantee of - is the same as that of , as stated by the following theorem.
If jobs have a minimum size of at least , - achieves at least . Further, for a general job-size distribution , if is chosen such that , then - achieves at least .
The proof is similar to that of Theorem 3. However, the difference is that the configuration of a server (jobs residing in a server) is not predictable, unless it empties, at which point we can ensure that it will schedule at least the jobs in the max weight configuration assigned to it, for a number of time slots proportional to the total queue length. The fact that the scheduling starts from the largest job in a virtual queue is important for this assertion, similarly to the importance of Best Fit in the proof of Theorem 2.
In case is chosen such that , the arguments in Corollary 1 are applicable here as well.
The full proof is provided in Appendix -G. ∎
Vii Evaluation Results
Vii-a Synthetic Simulations
Vii-A1 Instability of and tightness of bound.
We first present an example that shows the tightness of the bound on the achievable throughput of . Consider a single server where jobs have two discrete sizes and . The jobs arrive according to a Poisson process with average rate jobs per time slot and with each job size being equally likely. Each job completes its service after a geometric number of time slots with mean . Observe that by using configuration (i.e., 1 spot per job type) any arrival rate below jobs per time slot is supportable. This is not the case though for that schedules based on configurations , so it can either schedule two jobs of size or one job of size . This results in to be unstable for any arrival rate greater than . Both of the other proposed algorithms, BF-J/S and VQS-BF, circumvent this problem. The evolution of the total queue size is depicted in Figure 2(a)
Vii-A2 Instability of -J/s
We present an example that shows BF-J/S is not stable while VQS can stabilize the queues. Consider a single server of capacity and that job sizes are sampled from two discrete values and . The jobs arrive according to a Poisson process with average rate jobs per time slot, and job of size are twice as likely to appear than jobs of size . Each job completes its service after a fixed number of time slots . The evolution of the queue size is depicted in Figure 2(b). This shows an example where is stable, while both -J/S and - are not.
To justify the behavior of the latter two algorithms, we notice that under both the server is likely to schedule according to the configuration that uses two jobs of size and one of size . Because of fixed service times, jobs that are scheduled at different time slots, will also depart at different time slots. Hence, it is possible that the scheduling algorithm will not allow the configuration to change, unless one of the queues empties. However, there is a positive probability that the queues will never get empty since the expected arrival rate is more than the departure rate for both types. The arrival rate vector is while the departure rate vector .
on the other hand will always schedule either five jobs of size or two of size . The average departure rate in the first configuration is , and in the second configuration . The arrival vector is in convex hull of these two vectors as and therefore is supportable.
Vii-A3 Comparison using Uniform distributions
To better understand how the algorithms operate under a non-discrete distribution of job sizes, we test them using a uniform distribution. We choose servers, each with capacity . We perform two experiments: the job sizes are distributed uniformly over in the first experiment and uniformly over in the second one. Hence is in the first experiment and in the second one.
The service time of each job is geometrically distributed with mean time slots so departure rate is . The job arrivals follow a Poisson process with rate jobs per time slot (and thus ), where is a constant which we refer to as “traffic intensity” and is the number of servers in these experiments. A value of is a bound on what is theoretically supportable by any algorithm. In each experiment, we change the value of in the interval . The results are depicted in Figure 5.
Overall we can see that is worse than other two algorithms in terms of average queue size. Algorithms -J/S and - look comparable in the first experiment for traffic intensities up to , otherwise -J/S has a clear advantage. An interpretation of results is that and - have particularly worse delays when the average job size is large, since large jobs cannot be scheduled most of the time, unless they are part of the active configuration of a server. That makes these algorithm less flexible compared to -J/S for scheduling such jobs.
Vii-B Google Trace Simulations
We test the algorithms using a traffic trace from a Google cluster dataset . We performed the following preprocessing on the dataset:
We filtered the tasks and kept those that were completed without interruptions/errors.
All tasks had two resources, CPU and memory. To convert them to a single resource, we used the maximum of the two requirements which were already normalized in scale.
The servers had two resources, CPU and memory, and change over time as they a updated or replaced. For simplicity, we consider a fixed number of servers, each with a single resource capacity normalized to .
Trace events are in microsec accuracy. In our algorithms, we make scheduling decisions every msec.
We used a part of the trace corresponding to about a million task arrivals spanning over approximately 1.5 days.
We compare the algorithms proposed in this work and a baseline based on Hadoop’s default FIFO scheduler . While the original FIFO scheduler is slot-based , the FIFO scheduler considered here schedules jobs in a FIFO manner, by attempting to pack the first job in the queue to the first server that has sufficient capacity to accommodate the job. We refer to this scheme as FIFO-FF which should perform better than the slot-based FIFO, since it packs jobs in servers (using First-Fit) instead of using predetermined slots.
We scale the job arrival rate by multiplying the arrival times of tasks by a factor . We refer to as “traffic scaling” because larger implies that more jobs arrive in a time unit. The number of servers was fixed to , while traffic scaling varied from to . The average queue sizes are depicted in Figure 5. As traffic scaling increases, -J/S and - have a clear advantage over the other schemes, with - also yielding a small improvement in the queue size compared to -J/S. It is interesting that - has a consistent advantage over -J/S at higher traffic, albeit small, although both algorithms are greedy in the way that they pack jobs in servers.
Viii Discussion and Open Problems
In this work, we designed three scheduling algorithms for jobs whose sizes come from a general unknown distribution. Our algorithms achieved two goals: keeping the complexity low, and providing throughput guarantees for any distribution of job sizes, without actually knowing the prior distribution.
Our results, however, are lower bounds on the performance of the algorithms and simulation results show that the algorithms -J/S and - may support workloads that go beyond their theoretical lower bounds. It remains as an open problem to tighten the lower bounds or construct upper bounds that approach the lower bounds.
In addition, we made some simplifying assumptions in our model but results indeed hold under more general models. One of the assumptions was that the servers are homogeneous. -J/S and our analysis can indeed be easily applied without this assumption. For and -, the scheduling can be also applied without changes when servers have resources that differ by a power of which is a common case. As a different approach, we can maintain different sets of virtual queues, one set for each type of servers.
Another assumption was that service durations follow geometric distribution. This assumption was made to simplify the proofs, as it justifies that a server will empty in a finite expected time by chance. Since this may not happen under general service time distributions (e.g. one may construct adversarial service durations that prevent server from becoming empty), in all our algorithms we can incorporate a stalling technique proposed in  that actively forces a server to become empty by preventing it from scheduling new jobs. The decision to stall a server is made whenever server operates in an “inefficient” configuration. For -J/S that condition is when the server is less than half full, while for and -, is when the weight of configuration of a server is far from the maximum weight over .
Finally we based our scheduling decisions on a single resource. Depending on workload, this may cause different levels of fragmentation, but resource requirements will not be violated if resources of jobs are mapped to the maximum resource (e.g. like our preprocessing on Google trace data). A more efficient approach is to extend -J/S to multi-resource setting, by considering a Best-Fit score as a linear combination of per-resource occupancies. It has been empirically shown in  that the inner product of the vector of the job’s resource requirements and the vector of server’s occupied resources is a good candidate. We leave the theoretical study of scheduling jobs with multi-resource distribution as a future research.
-  “Apache Hadoop,” https://hadoop.apache.org, 2018.
-  “Apache Spark,” https://spark.apache.org, 2018.
-  “Apache Hive,” https://hive.apache.org, 2018.
-  J. Wilkes, “Google Cluster Data,” https://github.com/google/cluster-data, 2011.
-  S. Martello and P. Toth, Knapsack Problems: Algorithms and Computer Implementations. New York, NY, USA: John Wiley & Sons, Inc., 1990.
-  S. T. Maguluri, R. Srikant, and L. Ying, “Stochastic models of load balancing and scheduling in cloud computing clusters,” in Proceedings of IEEE INFOCOM, 2012, pp. 702–710.
-  A. L. Stolyar, “An infinite server system with general packing constraints,” Operations Research, vol. 61, no. 5, pp. 1200–1217, 2013.
-  S. T. Maguluri and R. Srikant, “Scheduling jobs with unknown duration in clouds,” in Proceedings 2013 IEEE INFOCOM, 2013, pp. 1887–1895.
-  ——, “Scheduling jobs with unknown duration in clouds,” IEEE/ACM Transactions on Networking, vol. 22, no. 6, pp. 1938–1951, 2014.
-  J. Ghaderi, “Randomized algorithms for scheduling VMs in the cloud,” in IEEE INFOCOM, 2016, pp. 1–9.
-  K. Psychas and J. Ghaderi, “On non-preemptive VM scheduling in the cloud,” Proc. ACM Meas. Anal. Comput. Syst. (ACM SIGMETRICS 2018), vol. 1, no. 2, pp. 35:1–35:29, Dec. 2017.
-  M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, and A. Goldberg, “Quincy: fair scheduling for distributed computing clusters,” in Proc. of the ACM SIGOPS symposium on operating systems principles, 2009, pp. 261–276.
-  S. Tang, B.-S. Lee, and B. He, “Dynamic slot allocation technique for mapreduce clusters,” in IEEE International Conference on Cluster Computing (CLUSTER), 2013, pp. 1–8.
-  R. Grandl, G. Ananthanarayanan, S. Kandula, S. Rao, and A. Akella, “Multi-resource packing for cluster schedulers,” ACM SIGCOMM Computer Communication Review, vol. 44, no. 4, pp. 455–466, 2015.
-  A. Verma, L. Pedrosa, M. Korupolu, D. Oppenheimer, E. Tune, and J. Wilkes, “Large-scale cluster management at Google with Borg,” European Conference on Computer Systems - EuroSys, pp. 1–17, 2015.
-  A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker, and I. Stoica, “Dominant resource fairness: Fair allocation of multiple resource types,” NSDI, vol. 167, no. 1, pp. 24–24, 2011.
-  M. Chowdhury, Z. Liu, A. Ghodsi, and I. Stoica, “Hug: Multi-resource fairness for correlated and elastic demands.” in NSDI, 2016, pp. 407–424.
-  M. Usama, M. Liu, and M. Chen, “Job schedulers for big data processing in hadoop environment: testing real-life schedulers using benchmark programs,” Digital Communications and Networks, vol. 3, no. 4, pp. 260–273, 2017.
-  “Hadoop: Fair Scheduler,” https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html, 2018.
-  “Hadoop: Capacity Scheduler,” https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html, 2018.
-  L. Tassiulas and A. Ephremides, “Stability properties of constrained queueing systems and scheduling policies for maximum throughput in multihop radio networks,” IEEE Transactions on Automatic Control, vol. 37, no. 12, pp. 1936–1948, 1992.
-  M. R Garey and D. S Johnson, “Computers and intractability: A guide to the theory of NP-completeness,” WH Freeman & Co., 1979.
-  D. S. Johnson, A. Demers, J. D. Ullman, M. R. Garey, and R. L. Graham, “Worst-case performance bounds for simple one-dimensional packing algorithms,” SIAM Journal on Computing, vol. 3, no. 4, pp. 299–325, 1974.
C. Kenyon and M. Mitzenmacher, “Linear waste of best fit bin packing on skewed distributions,”Random Structures & Algorithms, vol. 20, no. 3, pp. 441–464, 2002.
-  E. G. Coffman Jr, M. R. Garey, and D. S. Johnson, “Approximation algorithms for bin packing: A survey,” in Approximation algorithms for NP-hard problems. PWS Publishing Co., 1996, pp. 46–93.
-  D. Shah and J. N. Tsitsiklis, “Bin packing with queues,” Journal of Applied Probability, vol. 45, no. 4, pp. 922–939, 2008.
-  E. Coffman and A. L. Stolyar, “Bandwidth packing,” Algorithmica, vol. 29, no. 1-2, pp. 70–88, 2001.
-  D. Gamarnik, “Stochastic bandwidth packing process: stability conditions via lyapunov function technique,” Queueing systems, vol. 48, no. 3-4, pp. 339–363, 2004.
-  V. Nitu, A. Kocharyan, H. Yaya, A. Tchana, D. Hagimont, and H. Astsatryan, “Working set size estimation techniques in virtualized environments: One size does not fit all,” Proc. of the ACM on Measurement and Analysis of Computing Systems, vol. 2, no. 1, p. 19, 2018.
-  S. Foss and T. Konstantopoulos, “An overview of some stochastic stability methods,” Journal of the Operations Research Society of Japan, vol. 47, no. 4, pp. 275–303, 2004.
-  P. Billingsley, Convergence of Probability Measures 2e. John Wiley & Sons, Inc, 1999.
-  S. Ethier and T. Kurtz, Markov Processes: Characterization and Convergence, ser. Wiley Series in Probability and Statistics. Wiley, 2009. [Online]. Available: https://books.google.com/books?id=zvE9RFouKoMC
-  R. Tweedie, “Criteria for classifying general markov chains,” Advances in Applied Probability, vol. 8, no. 4, pp. 737–771, 1976.