Online Scheduling Fair of Spark Workloads with Mesos using Different Fair Allocation Algorithms

03/02/2018 ∙ by Yuquan Shan, et al. ∙ Carleton University Penn State University 0

In the following, we present example illustrative and experimental results comparing fair schedulers allocating resources from multiple servers to distributed application frameworks. Resources are allocated so that at least one resource is exhausted in every server. Schedulers considered include DRF (DRFH) and Best-Fit DRF (BF-DRF), TSF, and PS-DSF. We also consider server selection under Randomized Round Robin (RRR) and based on their residual (unreserved) resources. In the following, we consider cases with frameworks of equal priority and without server-preference constraints. We first give typical results of a illustrative numerical study and then give typical results of a study involving Spark workloads on Mesos which we have modified and open-sourced to prototype different schedulers.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the following, we present illustrative example and experimental results comparing fair schedulers allocating resources (indexed ) from multiple servers (indexed , with resource capacities ) to distributed application frameworks (indexed , with resource demands per task ). Resources are allocated so that at least one resource is exhausted in every server.

Schedulers considered include DRF (DRFH) and Best-Fit DRF (BF-DRF) [1, 11], TSF [10], and PS-DSF [2]. We also consider server selection under Randomized Round Robin (RRR) and based on their residual (unreserved) resources. In the following, we consider cases with frameworks of equal priority and without server-preference constraints. We first give typical results of an illustrative numerical study and then give typical results of a study involving Spark workloads on Mesos, which we have modified and open-sourced to prototype different schedulers.

2 Illustrative numerical study of fair scheduling by progressive filling

In this section, we consider the following typical example of our numerical study with two heterogeneous distributed application frameworks () having resource demands per unit workload:

(1)

and two heterogeneous servers () having two different resources with capacities:

(2)

For DRF and TSF, the servers are chosen in round-robin fashion, where the server order is randomly permuted in each round; DRF under such randomized round-robin (RRR) server selection is the default Mesos scheduler, cf. next section. One can also formulate PS-DSF under RRR wherein RRR selects the server and the PS-DSF criterion only selects the framework for that server. Frameworks are chosen by progressive filling with integer-valued tasking (), i.e., whole tasks are scheduled.

Numerical results for scheduled workloads for this illustrative example are given in Tables 1 & 2, and unused resources are given in Tables 3 and 4. 200 trials were performed for DRF, TSF and PS-DSF under RRR server selection, so using Table 2

we can obtain confidence intervals for the averaged quantities given in Table

1 for schedulers under RRR. For example, the 95% confidence interval for task allocation of the first framework on the second server (i.e.) under TSF is

Note how PS-DSF’s performance under RRR is comparable to when frameworks and servers are jointly selected [2]

, and with low variance in allocations. We also found that RRR-rPS-DSF performed just as rPS-DSF over 200 trials.

sched. (1,1) (1,2) (2,1) (2,2) total
DRF [1, 11] 6.55 4.69 4.69 6.55 22.48
TSF [10] 6.5 4.7 4.7 6.5 22.4
RRR-PS-DSF 19.44 1.15 1.07 19.42 41.08
BF-DRF [11] 20 2 0 19 41
PS-DSF [2] 19 0 2 20 41
rPS-DSF 19 2 2 19 42
Table 1: Workload allocations for different schedulers under progressive filling for illustrative example with parameters (1) and (2). Averaged values over 200 trials reported for the first three schedulers operating under RRR server selection.
sched. (1,1) (1,2) (2,1) (2,2)
DRF [1, 11] 2.31 0.46 0.46 2.31
TSF [10] 2.29 0.46 0.46 2.29
RRR-PS-DSF 0.59 0.99 1 0.49
Table 2:

Sample standard deviation of allocations

for different schedulers under RRR server selection with. Averaged values over 200 trials reported.
sched. (1,1) (1,2) (2,1) (2,2)
DRF [11] 62.56 0 0 62.56
TSF [10] 62.8 0 0 62.8
RRR-PS-DSF 1.8 4.6 4.86 1.92
BF-DRF [11] 0 10 1 3
PS-DSF [2] 3 1 10 0
rPS-DSF 3 1 1 3
Table 3: Unused capacities for different schedulers under progressive filling for illustrative example with parameters (1) and (2). Averaged values over 200 trials reported under RRR server selection.
sched. (1,1) (1,2) (2,1) (2,2)
DRF [1, 11] 11.09 0 0 11.09
TSF [10] 10.99 0 0 10.99
RRR-PS-DSF 0.59 0.99 1 0.49
Table 4: Sample standard deviation of unused capacities for different schedulers under RRR server selection over 200 trials.

We found task efficiencies improve using residual forms of the fairness criterion. For example, the residual PS-DSF (rPS-DSF) criterion is

That is, this criterion makes scheduling decisions by progressive filling using current residual (unreserved) capacities based on the current allocations . From Table 1, we see the improvement is modest for the case of PS-DSF.

Improvements are also obtained by best-fit server selection. For example, best-fit DRF (BF-DRF) first selects framework by DRF and then selects the server whose residual capacity most closely matches their resource demands [11].

3 Online experiments with Mesos

The execution traces presented in the figures are typical of the multiple trials we ran.

3.1 Introduction including background on Mesos

The Mesos master (including its resource allocator, see [4]) works in dynamic/online environment with churn in the distributed computing/application frameworks it manages. When all or part of a Mesos agent111an agent is a.k.a. server, slave or worker and is typically a virtual machine becomes available, a framework is selected by Mesos and a resource allocation for it is performed. The framework accepts the offered allocation in whole or part. When a framework’s tasks are completed, the Mesos master may be notified that the corresponding resources of the agents are released, and then the master will make new allocation decisions to existing or new frameworks. Newly arrived frameworks with no allocations are given priority. We consider two implementations of fair resource scheduling algorithms in Mesos.

In oblivious222called “coarse grain” in Mesos. allocation, the allocator is not aware of the resource demands of the frameworks333Indeed, the frameworks themselves may not be aware.. A framework running an uncharacterized application may be configured to accept all resources offered to it.

In workload-characterized allocation, each active framework simply informs the Mesos allocator of its resource demands per task, . The Mesos allocator selects a framework and allocates a single task worth of resources from a given agent with unassigned (released) resources.

In the following, we compare different scheduling algorithms implemented as the Mesos allocator. Given a pool of agents with unused resources, PS-DSF [2], rPS-DSF and best-fit (BF) [11] allocations will depend on particular agents. When a Mesos framework (Spark job) completes, its resources from different agents are released. We have observed that at times the Mesos allocator sequentially schedules agents with available resources (i.e., the agents are released according to some order), while at other times the released agents are scheduled as a pool so that the agent-selection mechanism would be relevant. Initially, the agents are always scheduled by the Mesos allocator as a pool.

In our Mesos implementation, the workflow of these two different allocations is shown in Figure 1.

Figure 1: Flowchart of Coarse-grained/Oblivious and Fine-grained/Workload-Characterized Allocation.

3.2 Running Spark on Mesos

In our experiments, the frameworks will operate under the distributed programming platform Spark in client mode. Each Spark job (Mesos framework) is divided into multiple tasks (threads). Multiple Spark executors will be allocated for a Spark job. The executors can simultaneously run a certain maximum number of tasks depending on how many cores on the executor and how many cores are required per task; when a task completes, the executor informs the driver to request another, i.e. executors pull in work. Each executor is a Mesos task in the default “coarse-grained” mode [7] and an executor resides in a container of a Mesos agent [3]. Plural executors can simultaneously reside on a single Mesos agent. An executor usually terminates as the entire Spark job terminates [6]. When starting a Spark job, the resources required to start an executor () and the maximum number of executors that can be created to execute the tasks of the job, may be specified. The Spark driver will attempt to use as much of its allocated resources as possible.

In a typical configuration, Spark employs three classical parallel-computing techniques: jobs are divided into microtasks (typically based on fine partition of the dataset on which work is performed); when underbooked, executors pull work (tasks) from a driver; and the driver employs time-out at program barriers444where parallel executed tasks all need to complete before the program can proceed to detect straggler tasks and relaunch them on new executors (speculative execution) [8]. In this way, Spark can reduce (synchronization) delays at barriers while not needing to know either the execution speed of the executors nor the resources required to achieve a particular execution time of the tasks. On the other hand, microtasking does incur potentially significant overhead compared to an approach with larger tasks whose resource needs have been better characterized, i.e., as resources per task555what may be called “coarse grain” in the context of Spark..

3.3 Experiment Configuration

In our experiments, there are two Spark submission groups (“roles” in Mesos’ jargon): group Pi submits jobs that accurately calculate via Monte Carlo simulation; group WordCount submits word-count jobs for a 700MB+ document. The executors of Pi require 2 CPUs and about 2 GB memory (Pi is CPU bottlenecked), while those of WordCount require 1 CPU and about 3.5 GB memory (WordCount is memory bottlenecked). Each group has five job submission queues, which means there could be ten jobs running on the cluster at the same time. Each queue initially has fifty jobs. Again, each job is divided into tasks and tasks are run in plural Spark executors (Mesos tasks) running on different Mesos agents.

The Mesos agents run on six servers (AWS c3.2xlarge virtual-machine instances), two each of three types in our cluster. A type-1 server provides 4 CPUs and 14 GB memory, so it would be well utilized by 4 WordCount tasks. A type-2 server provides 8 CPUs and 8 GB memory, so it would be well utilized by 4 Pi tasks. A type-3 server provides 6 CPUs and 11 GB memory, so it would be well utilized by 2 Pi and 2 WordCount tasks. The Mesos master operates in a c3.2xlarge with 8 cores and 15 GB memory.

The experiment setup is illustrated in Figure 2.

Figure 2: Experiment setup.

3.4 Prototype implementation

We modified the allocator module of Mesos (version 1.5.0) to use different scheduling criteria; in particular, criteria depending on the agent/server so that agents are not necessarily selected in RRR fashion when a pool of them is available. We also modified the driver in Spark to pass on a framework ’s resource needs per task () in workload-characterized mode. Our code is available here [5, 9].

3.5 Experimental Results for Different Schedulers

We ran the same total workload for the four different Mesos allocators all under Randomized Round-Robin (RRR) agent selection: oblivious DRF (Mesos default), oblivious PS-DSF, workload-characterized DRF, and workload-characterized PS-DSF. (In this section, we drop the “RRR” qualifier). A summary of our results is that overall execution time is improved under workload characterization and under allocations that are agent/server specific.

3.5.1 DRF vs. PS-DSF in oblivious mode

The resource allocation under different fairness criteria are shown in Figure 3. It can be seen that PS-DSF can achieve higher resource utilization than DRF because it “packs” tasks better into heterogeneous servers. As a result, the entire job-batch under PS-DSF finishes earlier. Also note that at the end of the experiment, there is a sudden drop in allocated memory percentage. This is because the memory-intensive Spark WordCount jobs finish earlier and CPU is the bottleneck resource for the remaining Spark Pi jobs.

Figure 3: Comparison between DRF and PS-DSF in oblivious mode.

3.5.2 Schedulers in workload-characterized mode

The experimental results under workload-characterized mode, as shown in Fig. 4, are consistent with their oblivious counterparts - PS-DSF has higher resource utilization than DRF. Also note that the resource utilizations in workload-characterized mode have less variance than those in oblivious mode, which will be explained in Sec. 3.5.3.

In Figure 5, we compare TSF [10] under RRR666Note that [10] also describes experimental results for a Mesos implementation of TSF., rPS-DSF (under RRR), and BF-DRF (again, “best fit” is an agent-selection mechanism when there is a pool of agents to choose from). From the figure, the execution times of BF-DRF and -rPS-DSF are comparable to PS-DRF (but cf. Section 3.7) and shorter than TSF (which is comparable to DRF).

Figure 4: Comparison between DRF and PS-DSF in workload-characterized mode.
Figure 5: Comparison among TSF, Best-fit DRF and rPS-DSF (in the workload characterized mode).

3.5.3 Oblivious versus Workload Characterized modes

We also compared oblivious and workload-characterized allocation for the same scheduling algorithm. Again, when a Spark job finishes, its executors may not simultaneously release resources from the Mesos allocator’s point-of-view. So under oblivious allocation, it’s possible that multiple Spark frameworks can share the same server, as is typically the case under workload-characterized scheduling. However, oblivious allocation is a coarse-grained enforcement of progressive filling, where the resources are less evenly distributed among the frameworks - some frameworks may receive the entire remaining resources on a agent in a single offer, leaving nothing available for others. From Figures 6-7, note how under oblivious allocation the amount of allocated resources drops more sharply when a Spark job ends, and variance of utilized resources under oblivious allocation is larger than under workload-characterized. Consequently, the entire job-batch tends to finish sooner under workload-characterized allocator, as we see in Figures 6-7.

Figure 6: Comparison between oblivious and workload-characterized modes under DRF.
Figure 7: Comparison between oblivious and workload-characterized modes under PS-DSF.

3.6 With Homogeneous Servers

We also did experiments in a cluster with six type-3 servers (6 CPUs, 11 GB memory). In Figure 8 we show that DRF and PS-DSF have nearly identical performance with homogeneous servers.

Figure 8: Workload-characterized DRF and PS-DSF with homogeneous servers.

3.7 BF-DRF versus rPS-DSF

Finally, with a different experimental set-up, we compare BF-DRF (which first selects the framework and then selects the “best fit” from among available agents/servers) and a representative of a family of server-specific schedulers, rPS-DRF under RRR. Consider a case where there are three servers, one of each of the above server types (types 1-3).

Suppose under a current allocation, we have one Spark-Pi and two Spark-WordCount executors on the type-1 server, two Spark-Pi and one Spark-WordCount executors on the type-2 server, and two Spark-Pi and two Spark-WordCount executors on the type-3 server. So, whenever a Pi or WordCount framework releases its executor’s resources back to the cluster, its DRF “score” is reduced so the scheduler will always sends a resource offer to the same client framework in this scenario. On the other hand, rPS-DSF will make a decision considering the amount of (remaining) resources on the server, and so will make a more efficient allocation.

We illustrate this with the example of Figure 9. In this experiment, we let each group submit their Spark jobs through five queues with 20 jobs each. To create the above scenario, instead of exposing all the servers to the client frameworks, we register servers one by one from type-1 to type-3. From the figure, note that both rPS-DSF and BF-DRF have an initiall inefficient memory allocation, but rPS-DSF is able to adapt and quickly increase its memory efficiency, while BF-DRF does not.

Figure 9: Performance of Best-fit DRF and rPS-DSF given initial suboptimal allocation.

References