Parallel data processing represents a large and important class of workloads running on public cloud computing platforms such as Amazon Elastic Computing Cloud (EC2), Google Compute Engine (GCE), and Microsoft Azure. Load balancing - dividing work among the nodes of a cluster - has an important role in determining the performance of these workloads. The clusters underlying such workloads are often prone to exhibiting heterogeneity in the processing capacities of their constituent nodes. Such heterogeneity may have negative implications for workload performance (such as completion times) and cost incurred by the workload owner or “tenant” (due to resource wastage in under-utilized nodes).
Heterogeneity may arise from myriad causes including (i) an intentional procurement of such nodes by the tenant (mandated by some other requirements) or (ii) time-varying resource interference within the cloud infrastructure. Budget-conscious tenants (such as academic researchers operating on limited budgets but also fledgling startups operating in a cut-throat market) are especially likely to contend with heterogeneity due to (ii). Such tenants often procure relatively cheap virtual machines (VMs) such as spot or burstable or small-sized regular instances. These “wimpy” instances are created by providers to monetize their dynamic spare capacity and are managed using aggressive statistical multiplexing/overbooking of physical resources. As a result, such instances are known to exhibit significant dynamism and unpredictability in their effective capacities [63, 62].
The Problem: These heterogeneous settings exacerbate the well-known “straggler problem” , wherein a small subset of slow tasks stall an entire parallel computation that they are part of by causing a synchronization delay at a program barrier (i.e., all parallel tasks need to complete before the program can proceed and end up waiting for the slow task(s)). While we offer a broad survey of salient solutions/ideas Section 8, our specific interest in this paper lies in the following: what are the relative pros and cons of approaches based on fine-grained vs. coarse-grained partitioning in overcoming heterogeneity-induced slowdowns?
As a prominent representative of the former approach, consider  which advocates that parallel jobs should be divided into relatively small-sized tasks (“microtasks”) via fine homogeneous partitioning of the input dataset on which processing is being performed111Specifically,  suggests microtasks take on the order of ms to execute on contemporary systems.. Microtasking can lead to good load balancing when combined with a “pull-based” operation: when underbooked or idle, nodes pull work (tasks) from a pending queue so faster workers simply pull in more work. Furthermore, the synchronization delays are reduced without needing knowledge of either the speed of the nodes or the resources required to achieve particular execution-times for the tasks. Because of this property, we refer to such approaches as oblivious load balancing. However, there also exist studies, e.g., , that challenge the microtasking idea, pointing out that the relatively large overhead of microtasking can, in some cases, significantly slow down computation.
Research Approach and Contributions: To explore the microtasking vs. macrotasking trade-off, we leverage initial modifications we have made to Apache Spark , a popular parallel data processing application framework. Our modifications have been designed with the goal of enabling a more intelligent, cost-conscious tenant to use cloud resources more efficiently. Being cost-conscious, such a tenant may have selected instances that just meet its requirements (rather than more expensive instances whose capacity is likely to go idle occasionally). This may involve custom VMs/containers as in Resources as a Service (RaaS) , less expensive spot/revocable instances, burstable/bursting instances (with only intermittently available resources), or “wimpy” regular instances that are allocated, e.g., only a fraction of a core. Furthermore, such a tenant may have characterized its workload’s resource needs to achieve certain performance goals (“demand-side” characterization). Finally, such a tenant may also employ suitable prediction and monitoring techniques - we present some examples in Sec. 5 and 6 - that let it estimate the (relative) effective capacities of its cluster’s nodes in the near future (“supply-side” characterization). Using homogeneous microtasking as a reference, we investigate how/when the straggler problem can be mitigated through heterogeneous partitioning using such demand- and supply-side characterization.
The contribution of this paper is fourfold.
Although microtasking can provide certain qualities-of-service and efficiencies without detailed information about the cluster or workload, its usefulness may be hindered by its overhead. We interpret part of this overhead via a simple analytical model.
We consider variants of heterogeneous macrotasking (HeMT) corresponding to different degrees of accuracy/certainty in supply/demand characterization ranging from an oblivious, incrementally-adjusted HeMT to a more sophisticated version where offline/online knowledge of node capacities is also leveraged.
Using a variety of experiments on our Spark/Mesos prototype on Amazon EC2 with different workloads and nodes (ranging from regular to burstable EC2 instances), we show the efficacy of our ideas. Our final set of experiments employ two important multi-stage workloads, i.e., PageRank and K-Means.
We identify and suggest several interesting directions for future work, especially related to adaptive scheduling across the application frameworks and cluster manager layers, including improved information exchange between them via enhanced APIs.
Outline: This rest of this paper is organized as follows. We overview the Spark application framework (and Mesos cluster manager) in Sec. 2. In Sec. 3, we discuss (homogeneous) microtasking in detail. Heterogeneous macrotasking is introduced in Sec. 4. In Sec. 5, we study a simple “oblivious” approach to online adapting the size of heterogeneous macrotasks based on synchronization delays (variations in task execution times) at program barriers. Heterogeneous macrotasking for workers based on statically provisioned containers or on burstable instances is studied in Sec. 6. Multistage workloads are considered in Sec. 7. Other related work is discussed in Sec. 8. After a discussion of future work in Sec. 9, we conclude in Sec. 10.
2 Overview of the Application Framework Spark and the Cluster Manager Mesos
The architecture of a typical cloud tenant is shown in Fig. 1. The tenant’s users first register their jobs (application frameworks) with the cluster manager. The cluster manager will then allocate resources from the managed (virtual or physical machines) to the users’ jobs so that they can run in a distributed fashion. In this paper, the example cluster manager is based on Apache Mesos and the distributed application frameworks are based on Apache Spark.
Apache Mesos  is a widely-used cluster manager. To start a Mesos cluster, a Mesos master process has to be started and Mesos agent processes, responsible for reporting available resources to the master, need to run on each resource-providing machine (often referred as “computing node” or simply “node” in the following) and register with the master. The Mesos master informs the registered frameworks about resource availabilities through resource offers. Upon receiving such an offer, a framework decides how much resources it will use and informs the Mesos master.
Apache Spark  is a distributed computation framework which we run over Mesos. The Spark driver, running on either cluster’s machine or user’s own machine, acts as a centralized job orchestrator. It divides a job into multiple computation stages. The mutually dependent stages are separated by data shuffling. A stage can only start when all its dependent stages have been completed. Each stage may contain multiple parallel tasks. To run those tasks, the Spark driver first acquires resources from its cluster (via Mesos) to launch executors (task runners) on the distributed nodes. Without considering any task placement constraints such as locality preferences or blacklists, the Spark driver sends the pending tasks to the idling executors.
Spark’s parallelism, i.e., how many parallel tasks can run on a single stage, by default depends on the number of computing slots (e.g., the number of CPU cores on those spawned executors) or the distribution of the input data (e.g., how many blocks in the input data file). Users can configure their preferred parallelism if they want more tasks, each processing less workload, or fewer tasks, each processing more workload. Spark tasks are sized equally or according to data locality, without considering the processing speed of executors.
3 Homogeneous Microtasking (HomT)
Homogeneous microtasking, i.e., equally dividing a job into tiny tasks which greatly out-number the number of computing slots, can well balance the workload without accurately knowing the computation speed of the computing nodes in the cluster [45, 21]. This is supported by the upper-bound on resource idling time stated in the following claim.
For pull-based task assignment (i.e., an executor pulls one more task if it is done with its assigned task and there are more tasks remaining), suppose all the tasks in a stage are pending at some initial time 0, the workload is evenly partitioned into tasks and the speed of the nodes is constant. Then the resource idling time (latest node finish time minus earliest node finish time) is upper-bounded by the single task duration of the slowest node.
Let be the task duration on node , be the finish time of node (the time when node finishes its last task). Assume node is the first node to be idle. Note that , at , there is no pending task222Otherwise, node would pull another pending task to run. and all the remaining tasks are running on the other nodes. Let be the starting time of the running task on node , so the finish time of the last task would be . Note that . Therefore, . ∎
However, the advantages of homogeneous microtasking are limited under several scenarios.
First, the scheduling overhead grows with the number of tasks. This problem makes microtasking approach less practical in some distributed computing frameworks such as Apache Hadoop whose tasks could take up to several seconds to launch .
Second, as discussed in previous work [73, 45, 44], one of the major concerns of dividing a job into tiny tasks is disk I/O inefficiency. When a large task is further divided into many tiny ones, one original sequential disk read would be divided into many small random disk reads, hence leading to relatively lower I/O throughput. Also, if a task input is small, then depending on the reading buffer (normally ranging from kBs to MBs), the task may finish after only a couple of I/O requests, so the advantage of pipelined read-process may no longer exist. So increase in stage completion times result, as shown in Figs. 9 and 13-15 in Sec. 6.1 and 6.2, respectively.
Another problem of microtasking we found in our experiments is related to the distributed file system, e.g., the Hadoop Distributed File System (HDFS) : when the job is network I/O bottlenecked, if multiple tasks simultaneously access the same HDFS data block, then they are more likely to read on the same datanode, which may lead to inefficient overall CPU and network bandwidth usage in the HDFS cluster. On the other hand, Spark sequentially schedules tasks so that consecutive tasks are more likely to access the same block as large tasks are increasingly divided into smaller ones.
HDFS is designed to store very large files in a distributed fashion. It follows a master/slave architecture. Namenode, the master, manages the file system metadata and coordinates the data placement. Datanodes serve as slaves that perform the actual data reads and writes. To operate on HDFS, a client first contacts the namenode, which will redirect the client to the correct datanodes for write and read. The HDFS architecture is illustrated in Fig. 2.
Each file is usually divided into a sequence of blocks, and each block is replicated for fault tolerance. HDFS does not allow a single datanode to store multiple replicas for the same block . To simplify our analysis, we make two assumptions: First, rack-awareness  in block placement is turned off333In fact, Hadoop rack-awareness has less randomness and thus intensifies uplink competition since data blocks are less broadly spread.. Second, a simple placement policy is assumed such that the namenode randomly chooses one among the equally-distant datanodes to place a data block as well as its replicas. So when a remote user uploads a data block, HDFS randomly chooses datanodes, each storing exactly one replica of that block. Upon a read request, HDFS tries to choose the closest replica (on the same datanode or on the same rack) to the reading client. If there are multiple candidate replicas, then the replica choice is distributed uniformly at random among them. One typical HDFS replica distribution is shown in Fig. 3.
In a HDFS cluster, suppose the number of datanodes is and the replica factor is , and
(the usual case). If two tasks access the same HDFS block, then the probability that they will read on the same datanode, competing for its uplink bandwidth444They may also compete for the disk bandwidth on that datanode. But considering disk bandwidth is usually larger than the network bandwidth, disk bandwidth is not the concern herein. is
If two tasks access different HDFS blocks, the probability that they will read on the same datanode is
where is the probability that there are datanodes that store replicas of both blocks. Assume HDFS randomly choose datanodes to store replicas when the data is uploaded (appears so, need to verify if it is true in HDFS implementation),
with equality when .
The plot of and versus for shown in Fig. 4 numerically supports the above conclusion that . That is, they indicate that two tasks reading on the same block are more likely to compete for the uplink bandwidth on the same datanode.
In another experiment to support the above analysis, a small HDFS cluster, with datanodes and replication factor , had limited datanode uplink-bandwidth of Mbps. Thus, the Spark tasks are always bottlenecked by network I/O. The datanodes are equally distant to Spark’s computing nodes so the selection of datanode for one block is random. The results are shown in Fig. 5. As can be observed, the stage completion time increases with the number of tasks/partitions.
4 Heterogeneous Macrotasking (HeMT) - Background
To avoid HomT overhead, the number of tasks can be set equal to the number of available “computation slots” (executors). However, in case of heterogeneous executors, synchronization delay may ensue if such “macrotasks” are equally sized. This motivates heterogeneous macrotasking (HeMT).
HeMT will require a reasonably accurate estimation of workload (reflected by task execution time) which can be easily obtained for many modern jobs due to their repetitive nature; e.g., many production workloads [24, 29, 56]
, and many machine-learning related jobs such EM and K-Means that consist of multiple iterations of the same computational complexity. Much recent work on task scheduling, e.g., , is based on such an assumption.
We implemented this HeMT partitioning algorithm on Spark and compared it with Spark’s default partitioning scheme in the following, as well as the aforementioned HomT. Spark’s default partitioning does not consider any resource heterogeneity of the cluster - it divides the input data regardless of the speed of computing nodes - and Spark tends to evenly divide on-memory data into as many partitions as the number of computing slots (usually processing cores). For files located on disk, e.g. HDFS files, baseline Spark, like Hadoop, assigns one file block to a task. Spark naturally supports HomT: users can specify a desired number of partitions and Spark would evenly divide data according to this number.
The aim the of experimental studies described in the following is to illustrate the benefits and challenges of HeMT. We implemented HeMT in Spark  using information from middleware (here, Apache Mesos cluster manager ) or directly from monitoring services (e.g., AWS CloudWatch). For scalability, the application frameworks perform most elements of (workload specific) HeMT learning, while the middleware scheduler may only perform more sophisticated workload scheduling (consolidation)555Analogies can be made with “end-to-end” approaches such as exokernels or TCP congestion control in the Internet.. The information exchange in our Spark-Mesos prototype is summarized in Figure 6.
5 Oblivious Adapted HeMT (OA-HeMT)
In some environments, e.g., those without resource isolation leading to significant interprocess interference, determining the true workload processing power of available computational nodes may be challenging. So, a simple “oblivious” approach is needed to allow application frameworks and cluster managers to dynamically estimate the processing speed of available computational nodes according to the previous workloads of the same job, so that the future tasking can be well balanced.
A Spark-Mesos prototype was implemented to enable such an oblivious ad-hoc adaptive HeMT. The Mesos cluster manager obtains and passes on to the Spark application framework estimated executor processing speed through additional fields in their RPC messaging. Based on this information and the associated task sizes, Spark estimates the execution speed of different available executors and thereby determines how to partition future work into well-balanced tasks.
5.1 A general approach
Consider a sequence of datasets of sizes that need to be processed in the same way, i.e., the same job applied to each dataset. The dataset is divided (by the application framework) into a number of tasks, one for each executor assigned to process the dataset (by the cluster manager). These tasks are created by dividing the dataset .
For each executor , let be the most recent estimate of its “speed” for the job under consideration. Let be the set of executors that have not before been assigned to this job. For all , let where is the average for (example other choices could be the minimum or maximum rather than the average or the average speed over all executors that have been applied to this job in the past). Let
where is the number of executors assigned to the job. Executor is assigned a dataset of size . That is, the faster executor (larger ) is assigned to work on a larger dataset (larger ).
Let be the execution time of executor on the assigned task of size of the job. For all executors , their speed can be updated according to a simple first-order autoregressive estimator
where forgetting factor satisfies .
For the initial () job, is evenly divided among the executors and subsequently .
The straightforward tradeoff in the choice of is that smaller means that the speed estimate is more responsive to the latest speed datapoint . But it’s entirely possible that different datasets of the same size, i.e., , will require different execution times for the same job type under consideration. Over time, such variations will be “averaged out” in the executor speed estimates; i.e., each executor will experience the same task-difficulty distribution “per unit” input data (unless there is some bias so that some executors tend to receive more difficult tasks per unit input data for a given job). This motivates a forgetting factor that this not close to zero.
Note that each application framework (different job types) will need to maintain its own estimates of (workload specific) executor speeds.
5.2 An experimental result
To see the effect of such adaptive workload partitioning, we performed an experiment with a two-node cluster where each node provides one CPU core. No resource isolation technology was used, so Spark executors could share CPU cycles with other processes. A sequence of fifty Spark WordCount jobs were presented through a submission queue. We introduced interfering processes  on one node at two different points in time during the experiment thus reducing the processing speed of Spark executors on that node. How Spark jobs were adaptively partition to re-balance their workloads is shown in Fig. 7.
One can see how overall job execution times (determined by the slowest task) increased dramatically but then rapidly fell as the task sizes were adapted with zero forgetting factor (here, for a given executor, execution time variation per unit document size - measured in MBytes - was low).
We performed another experiment involving two hosts being statically provisioned with one and cores (cf. Sec. 6.1), i.e., heterogeneous executors by initial provisioning. The results are shown in Fig. 8. Spark learns the optimal way of partitioning the workload after two trials, so the map-stage execution time is reduced to around seconds, which is in agreement with the results shown in Fig. 9 where a near optimal data partitioning can be simply derived a priori using resource allocation information provided by Mesos.
6 Heterogeneous MacroTasking (HeMT) with Provisioned Instance Types
In this section, we show how information regarding executors (e.g., known resource allocations to executors, information the service-level agreements) or some runtime “state” of an executor (e.g., token-bucket state) can be used to determine initial/baseline heterogeneous task sizes. As above, task sizes can be further adapted online based on, e.g., execution time information received from either cluster manager (e.g., Mesos) or monitoring services (e.g., AWS CloudWatch).
6.1 HeMT for Statically Provisioned Containers
We now compare HeMT with HomT through experiments with heterogeneous executors each assigned a different fraction of a core. Our implementation supports flexible CPU usage limitation by using containers. Baseline Spark does not support partial CPU usage. So we modified the Spark driver to be able to: accept a Mesos offer with partial CPU core; launch an executor using the resources in the offer; and record the actual resources available to this executor so that the driver can use this information to rebalance the workload. Additionally, if Spark’s executor is spawned on the container with a partial CPU core, we let Spark’s executor believe that it has one full core so that it is able to communicate with the driver asking for a task.
To evaluate the performance of HeMT with containerized and statically allocated resources, we did a set of Spark experiments on Mesos. The network bandwidth is large enough ( Mbps) so that CPU is the only bottleneck. In those experiments, we submitted Spark WordCount jobs with different tasking configurations to a Mesos cluster. WordCount is a simple two-stage Spark job in which most computations are done in the first map stage, so we can determine the effect of load-balancing by observing the execution times of the first stage. Our jobs processed 2GB of data residing on a remote HDFS cluster. For each Spark job, we assign two executors to run associated tasks: one with one full core, the other with partial core 666Spark by default would create one map task for each HDFS block. So to make our experiments start with two tasks, we increase the HDFS block size from 128 MB to 1 GB. We use the same HDFS configuration in Sec. 6.2. For experiments with usual HDFS configuration, see Sec. 7.. We used Mesos-supported CFS (complete fair scheduler) bandwidth control  to limit the CPU usage of the containers.
An example experimental result is shown in Fig. 9. As we can see from the red beams, HeMT has good performance since it informs Spark of the CPU allocations and our modified Spark balances its workload accordingly. The U-shaped homogeneous tasking curve is similar to the results of : When tasks sizes are too large (tasks too few in number) there are sychronization delays. When task sizes are too small (tasks are too large in number) there is microtasking overhead. Though  gives a rule of thumb, an optimal homogeneous task size also needs to be learned.
6.2 HeMT for Burstable Instances
HeMT can also be adapted to support lower cost777compared to on-demand instances allocated with maximum resources available to the burstable, general purpose Amazon Web Services (AWS) burstable instance types (T2). Access to CPU and network I/O  resources of those instances are governed by a token-bucket mechanism. The CPU credits are earned and spent at a millisecond-level resolution, and the CPU credits accumulated when CPU(s) are idling can be used for future CPU bursts. A CPU credit can be used for one CPU running with
utilization for one minute, and it can be divided down to a millisecond timescale for a short CPU activity burst. Different types of T2 burstable instances have different CPU credits earning rate, which is in line with their own baseline performance (the performance when an instance has zero credit), and CPU caps (peaks) at which CPU credits stop accumulating. The existing CPU credits indicated through AWS cloudwatch API provide a basis for HeMT workload skewing888AWS cloudwatch metrics on the free tier update every 5 minutes. If detailed monitoring is enabled at extra cost, then AWS cloudwatch update frequency can reach a maximum of only once per minute . Therefore, AWS cloudwatch may not be helpful for resource-status reporting in short-lived clusters or those with high dynamism..
Given the current amounts of CPU credits and the baseline CPU performance, suppose we are able to estimate the computational workload of our job, the calculation of the amount of work that a node can process within a certain time, , can be easily evaluated: Suppose a t2.small instance initially has CPU credits. If its vCPU is continually busy, then its CPU credits will be used up in minutes, and, afterwards, its CPU performance will drop to the baseline. The workload it can process in minutes can be calculated as
reflecting the size of shaded area in Fig. 10.
To divide a workload () to multiple servers with different amount of initial CPU credits so that they can finish at the same time, for each server, we first transform the time-credits plot in Fig. 10 into the time-workload plot as shown in Fig. 11.
Superposing such time-workload graphs together into a single piecewise linear function (), we can find such that . We then divide the workload proportionally according to , where is the index of node .
For example, suppose we have three computation nodes with 4, 8, 12 initial CPU credits respectively, and the current data process job requires a CPU running at performance for 20 minutes. We can first superpose the time-workload graphs for these three nodes together as , then find such that , as shown in Fig. 12. Finally we divide the entire workload for the three nodes according to their weights .
As the CPU credit-monitoring API provided by AWS is not very responsive, it may not be useful to apply HeMT for short tasks. So, we created the following experimental scenario: First, there are only two types of nodes, one with sufficient CPU credits that will not be depleted throughout the entire life-span of a job, and the other with zero CPU credits999We can only ensure zero CPU credits on a particular node when starting the job subject to the AWS monitor’s update latency., where sysbench  was used to deplete CPU credits. That is, to create heterogeneity, we made depleted one node’s CPU credits so its CPU works with baseline performance (40% of CPU for AWS t2.medium instance). We let our map tasks fetch GB input data from HDFS. Again to make our experiment start with two tasks, we set the HDFS block size to GB, so each task would process GB data (one HDFS data block) in Spark’s default setting.
Figs. 13-15 shows the completion time of the map stage under different configurations. This set of experiments was done in a small Spark cluster with two executors, each with one core (on two separate AWS burstable instances). The tasks read input from a remote HDFS cluster consisting of four datanodes, each an AWS t2.small instances. In these experiment, the network bandwidth is large enough ( Mbps) so that CPU is the only bottleneck.
We also implemented the HeMT approach, i.e., keeping the number of tasks unchanged, but skewing their input data size according to the speed of node on which they are assigned. The yellow beams of Figs. 13 -15 shows one-confidence interval of a naive implementation of workload skewing (HeMT), where we partition the data strictly according to CPU peak and baseline performance (1:0.4 for one core on AWS t2.medium). That is, the task running on the faster node gets of the total data, while the other gets the rest. However, we found that the node with zero CPU credit runs even slower than of peak speed. We suspect that this is because the task running with baseline performance is likely facing a higher degree of CPU cache and TLB contention than the other task (e.g., it is possible that the first task is sharing a physical CPU with one or more other workloads while the second task has an entire CPU to itself). Our workload had a significant fraction of memory instructions that are delayed due to such cache/TLB contention. By employing short/trial probing tasks, we found that data partitioning by 1:0.32 further improves load-balancing, i.e., this fudge factor is learned from runtime observations (recall Sec. 5). This is shown by the red beam in Fig. 13 - the performance of HeMT with this fudge factor improves performance over the best configuration HoMT (8-way) we had tried.
We observed a similar result (with larger variance however) when reduced available bandwidth toMbps by using a network traffic shaper, wondershaper , as shown in Fig. 14. Since CPU remains the bottleneck, we observed thus decreasing network bandwidth does not cause any significant impact on processing speed as expected.
We found the behavior of microtasking and macrotasking approaches were different when available uplink bandwidth of a HDFS datanode was reduced to Mbps, see Fig. 15. In this case, for the node with sufficient CPU credits, network I/O becomes the bottleneck, while CPU remains to be the bottleneck for the task running on the node with zero CPU credits. Note that 8-way partitioning is no longer one of the best HomT/microtasking approaches because this relatively coarse-grained partitioning101010where a task running on credit-abundant node would run for about seconds, while a task running on the other node would run for about seconds fails to well-balance the workload in this case. It can also be observed that HeMT, including the naive CPU-credit-based partitioning (even it does not make sense in this scenario since the node with sufficient CPU credits is now bottlenecked by the network), started to significantly outperform HomT, because latter is more likely to incur datanode uplink contention, as explained in Sec. 3.
7 HeMT - repartitioning on multiple computation stages
Our heterogeneous macrotasking can certainly be applied to more realistic workload. A typical MapReduce workload consists of one or more jobs, each job has multiple basic computation stages presented in the previous sections concatenated together through data shuffling. So for the first computation stage, we can simply divide the initial input data according to the computation capacities of the executors.
A partitioner defines the how a task assigns its intermediate results to different “buckets” which will be fetched by different tasks in the following stage respectively. For the following stages, task data are fetched from the intermediate outputs of the tasks in the previous stages. The tasks in the previous stages first shuffle the processed records into different buckets (each corresponding to one fetching task in a future stage) according to a partitioner function, then those buckets are written onto storage media for associated future tasks to fetch. The default hash partitioner shuffles those records into those buckets in a statistically even fashion. So, we need to define a new partitioner that can skew the shuffle buckets for HeMT. We show one implementation of skewing using hash code in Algorithm 1111111Certainly, more sophisticated partitioning algorithm can be made given more information regarding key distribution and processing complexity of each record..
The comparison of effective data flows when using the default hash partitioner and our skewed hash partitioner respectively is shown in Fig. 16. Relevant idea of balancing workload through partitioner can be found in [21, 61].
We present the performance of HeMT using two typical workloads - K-Means and PageRank. Those two have different and representative computation patterns. K-Means consists of repetitive simple two-stage Spark jobs. PageRank, on the other hand, is a single Spark job containing multiple computation stages concatenated together through shuffling.
Again, we run K-Means on the cluster with two executors hosted on two containers, one was allocated with one CPU core, the other was allocated with cores. To make results more consistent, instead of setting a convergence criterion to stop the iterations, we fix the number of iterations to . The input source is 256 MB data file on HDFS, with block size 128 MB (So there are two blocks). The entire job finish times of HeMT and HoMT are shown in Fig. 17, which is consistent with the single-stage results shown in the previous sections.
On the same cluster, we run PageRank with MB input data for iterations. The results are shown in Fig. 18. Note that the PageRank, compared with K-Means, is more sensitive to microtasking, because each iteration of PageRank is relatively short (around 10s in the default 2-way partitioning), therefore each task is shorter as well. For example, if we use 64-way partitioning, then each task generally lasts for only - seconds. Therefore, the relative task scheduling overhead would be larger in PageRank workload.
8 Related Work
We discuss related work, mainly in the context of recent developments in cloud computing, in this section. Pointers to older work may be found in the references within the papers we discuss here.
Parallel scheduling advances including straggler mitigation: Computational skew (heterogeneous tasking) has been investigated in several studies in order to mitigate the straggler problem.  focuses on the map stage in a mapreduce job. It breaks the one-to-one mapping between a mapper task and data segments in Hadoop. By allowing mapper tasks to continue fetching other data segments, more data ends up being fetched by the faster mappers. In this way, the workload is automatically balanced especially when it is finely partitioned. Unlike the classic mapreduce implementations in Hadoop and Spark where mappers are running independently, this approach requires mappers to be synchronized on read during runtime, which could affect its scalability.
[32, 33] also target computational skew. They consider a homogeneous cluster where computational skew is the cause of the straggler problem. Even data segments of the same size may require different processing time by the same executor. With good estimates of processing complexity of partitions, the straggling partitions are split in either static  or dynamic  fashion.
 showed that enlarging task size by merging multiple co-located block into one task can save task scheduling and initialization times and thereby accelerate the execution of Hadoop. Some studies and practices such as [51, 45, 6, 5] take an opportunistic approach by letting the driver employ time-out at program barriers to detect straggler tasks and relaunch them on new executors (speculative execution).
Other studies rely on detailed monitoring and performance predictions to mitigate the straggler problem. For example, 
predicts a straggler node by applying a support vector machine on features involving resource utilization, thread states and memory statistics. then conservatively prevents tasks from being assigned on those nodes that are predicted to be stragglers.
Much prior work has been conducted on the representation and prediction of resource capacity demand (workload characterization) and supply (executor characterization). Also, feasible scheduling decisions, by both application frameworks and the tenant’s middleware cluster managers, have been proposed for more efficient use of the tenant’s available resources, cf. Sec. 8. Online learning approaches have also been proposed to this end, e.g., . Orchestration of budget conscious tenants using low-cost instances has also been an active area of study, particularly through the use of preemptible spot and/or burstable/bursting instances. Finally, there are several different open-source application frameworks and cluster managers which can be modified to experiment with more advanced learning and scheduling mechanisms.
Representation and prediction of resource capacity demand and supply: Workload prediction for individual applications is an area of extensive research in cloud-like environments, e.g., [25, 26, 28], as is the related problem of translating workload-specific predictions (e.g., request arrivals) into resource allocations (e.g., servers) to meet desired performance goals, e.g., [68, 58, 19, 10, 34, 9].
For example, suppose a sequence of roughly equal sized data batches need to be processed. For each data batch, suppose there are a certain number of tasks, each of one of say
different types (and of a specific size) that have to be executed in some order (including some in parallel). At any given point in time, suppose the “state” of the application is given by (i) the type of task running in each of its executors and its execution time so far, and (iii) any queued tasks. A reinforcement learning approach, e.g.,
, could train a neural network to map the system state to scheduling decisions (which task to assign to the next available executor) in order to minimize overall execution time for the data batch (to minimize synchronization delays at current program barriers/joins in particular). For several initial data batches, random scheduling decisions could be made and their consequences in terms of execution time observed and used for training. Here, the resource requirements of each task to meet certain performance criteria is not considered.
Workload admission control and resource allocation: This has also been extensively studied, typically assuming some knowledge of (IT) resources demand (and their correlations) and supply. Such problems are related to multidimensional knapsack, bin-packing and load-balancing problems, e.g., [11, 46, 59, 13]
, which are NP-hard. They have been extensively studied, including relaxations to simplified problems that yield approximately optimal solutions, e.g., by Integer Linear Programs solved by iterated/online means, and including stochastic versions, e.g.,[42, 14]. A relevant sub-field, starting with dominant-resource fairness (DRF) , has been concerned with extending classical single-resource, single server fair scheduling to multiple resources and multiple servers as appropriate for highly utilized cloud environments e.g., [12, 22, 65]. Most of this scheduling work attempts to incrementally maximize performance while optimally utilizing shared resource between applications/jobs [37, 15, 55, 69, 48, 16, 18].
Orchestration of budget-conscious tenants in the public cloud: This area has been receiving significant attention, particularly in the form of using Amazon’s spot instances (spanning concerns of cost-efficacy, resource availability and optimal bidding), e.g., [50, 53]; using resource reservations, e.g., [71, 17]; exploiting price/performance/availability trade-offs across geo-distributed sites (including multiple cloud markets), e.g., [67, 43, 38]; design and analysis of cloud aggregators/brokers and “derivative” clouds, e.g., [2, 36]. Attention has also been given to exploiting the relatively new burstable instances, e.g., [63, 62].
Open-source cluster management frameworks: Whereas numerous cluster orchestration frameworks exist as open-source software [27, 31, 60, 47, 40, 54], they tend to implement resource management policies suitable for private settings. Public cloud orchestration frameworks are naturally proprietary in nature. In the following, we leverage the existing code-base of Apache Mesos  that employs a default scheduling mechanism DRF  adapted to a cluster of servers.
9 Future Work
In future work, we will consider application frameworks and middleware embodying more advanced, integrated online learning frameworks that leverage information from offline workload profiling and service-level agreements to more precisely (online) characterize workloads’ resource needs (demand) and executors’ capacity (supply). Actions by different application frameworks based on such learning include HeMT at a fast timescale and determination of preferred types of executors based on cost/performance tradeoffs. For a budget-conscious tenant, we also plan to integrate such actions by application frameworks with scheduling by the cluster manager (middleware), i.e.,  and server-specific alternatives [30, 49], the latter improving the efficiency of resource use. That is, the cluster manager’s scheduler would be based on online estimates of the resource needs of tasks of its application frameworks in order to obtain adequate performance (“fine grain” mode scheduling).
10 Concluding Remarks
We investigated the pros and cons of two opposite views on load balancing - homogeneous microtasking and heterogeneous macrotasking - in large-scale parallel processing workloads that routinely run on modern public cloud platforms. Using tiny, equal-sized tasks (homogeneous microtasking, HomT) has long been regarded as an effective way of load balancing in parallel computing systems. When combined with nodes pulling in work upon becoming idle, more powerful nodes finish their work sooner and, therefore, pull in additional work faster. As a result, HomT is deemed especially desirable in settings with heterogeneous (and possibly possessing dynamically changing) processing capacities. However, HomT poses additional scheduling and I/O overheads that may make it more costly in some scenarios. In this paper, we analyzed these advantages and disadvantages of HomT using a combination of analytical modeling and data-driven experiments on an Apache Spark based prototype (Spark’s built-in scheduler, when parameterized appropriately, already implements HomT.) We then proposed an alternative load balancing scheme - Heterogeneous Macrotasking (HeMT), wherein workload (input data) was intentionally and carefully
partitioned into tasks of possibly different sizes. We explored different heuristics for such heterogeneous partitioning based on the amount of information available about the nodes’ processing capacities. Our goal was to study when HeMT could overcome the performance disadvantages of HomT. We implemented a prototype of HeMT within the Apache Spark application framework with complementary enhancements to the Apache Mesos cluster manager. Our experimental results showed that HeMT outperformed HomT when accurate workload-specific estimates of nodes’ processing capacities could be learned. In our experiments, Spark with HeMT was able to improve average job completion times by about 10% compared to the default system.
This research was supported in part by NSF CNS 1717571 grant and a Cisco Systems URP gift.
-  A. Kopytov. Sysbench. https://github.com/akopytov/sysbench, 2016.
-  M. Aazam and E.-N. Huh. Broker as a Service (BaaS) pricing and resource estimation model. In Proc. IEEE CloudCom, 2014.
-  Orna Agmon Ben-Yehuda, Muli Ben-Yehuda, Assaf Schuster, and Dan Tsafrir. The resource-as-a-service (raas) cloud. In Proceedings of the 4th USENIX Conference on Hot Topics in Cloud Ccomputing, HotCloud’12, pages 12–12, Berkeley, CA, USA, 2012. USENIX Association.
-  Amazon Web Services. Amazon cloudwatch pricing. https://aws.amazon.com/cloudwatch/pricing/. [Online; accessed 21-Jun-2018].
-  Ganesh Ananthanarayanan, Michael Chien-Chun Hung, Xiaoqi Ren, Ion Stoica, Adam Wierman, and Minlan Yu. Grass: Trimming stragglers in approximation analytics. In Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation, NSDI’14, pages 289–302, Berkeley, CA, USA, 2014. USENIX Association.
Ganesh Ananthanarayanan, Srikanth Kandula, Albert Greenberg, Ion Stoica, Yi Lu,
Bikas Saha, and Edward Harris.
Reining in the outliers in map-reduce clusters using mantri.In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation, OSDI’10, pages 265–278, Berkeley, CA, USA, 2010. USENIX Association.
-  Apache Hadoop. Hdfs design. https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html. [Online; accessed 20-Jul-2018].
-  Apache Hadoop. Hdfs rack awareness. https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/RackAwareness.html. [Online; accessed 28-Sep-2018].
-  M.N. Bennani and D.A. Menasce. Resource allocation for autonomic data centers using analytic performance models. In Proc. IEEE ICAC, 2005.
-  A. Chandra, W. Gong, and P. Shenoy. Dynamic resource allocation for shared data centers using online measurements. In Proc.ACM SIGMETRICS, 2003.
-  C. Chekuri and S. Khanna. On multi-dimensional packing problems. SIAM Journal of Computing, 33(4):837–851, 2004.
-  M. Chowdhury, Z. Liu, A. Ghodsi, and I. Stoica. HUG: Multi-resource fairness for correlated and elastic demands. In Proc. USENIX NSDI, March 2016.
-  H.I. Christensen, A. Khan, S. Pokutta, and P. Tetali. Multidimensional Bin Packing and Other Related Problems: A Survey. https://people.math.gatech.edu/tetali/PUBLIS/CKPT.pdf, 2016.
-  M.C. Cohen, V.Mirrokni, P. Keller, and M. Zadimoghaddam. Overcommitment in Cloud Services Bin packing with Chance Constraints. In Proc. ACM SIGMETRICS, Urbana-Campaign, IL, June 2017.
-  C. Curino, D. E. Difallah, C. Douglas, S. Krishnan, R. Ramakrishnan, and S. Rao. Reservation-based scheduling: If you’re late don’t blame us! In Proc. SOCC, 2014.
-  C. Delimitrou and C. Kozyrakis. Quasar: Resource-efficient and QoS-aware Cluster Management. In Proc. ASPLOS, 2014.
-  C. Delimitrou and C. Kozyrakis. HCloud: Resource-Efficient Provisioning in Shared Cloud Systems. In Proc. ASPLOS, Atlanta, 2016.
-  C. Delimitrou, D. Sanchez, and C. Kozyrakis. Tarcil: Reconciling Scheduling Speed and Quality in Large Shared Clusters. In Proc. ACM SoCC, 2015.
-  R.P. Doyle, J.S. Chase, O.M. Asad, W. Jin, and A.M. Vahdat. Model-based resource provisioning in a web service utility. In Proc. USENIX USITS, 2003.
-  R.O. Duda, P.E. Hart, and D.G. Stork. Pattern Classification, 2nd Ed. Wiley, 2001.
-  Y. Fan, W. Wu, D. Qian, Y. Xu, and W. Wei. Load balancing in heterogeneous mapreduce environments. In Proc. IEEE HPCC & EUC, 2013.
-  E. Friedman, A. Ghodsi, and C.-A. Psomas. Strategyproof allocation of discrete jobs on multiple machines. In Proc. ACM Conf. on Economics and Computation, 2014.
-  A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker, and I. Stoica. Dominant resource fairness: Fair allocation of multiple resource types. In Proc. USENIX NSDI, 2011.
-  D. Gmach, J. Rolia, L. Cherkasova, and A. Kemper. Workload analysis and demand prediction of enterprise data center applications. In 2007 IEEE 10th International Symposium on Workload Characterization, pages 171–180, Sept 2007.
-  J. L. Hellerstein, Fan Zhang, and P. Shahabuddin. An approach to predictive detection for service management. In Integrated Network Management VI. Distributed Management for the Networked Millennium. Proceedings of the Sixth IFIP/IEEE International Symposium on Integrated Network Management. (Cat. No.99EX302), 1999.
-  Joseph L. Hellerstein, Fan Zhang, and Perwez Shahabuddin. A statistical approach to predictive detection. Comput. Netw., 35(1):77–95, 2001.
-  B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A.D. Joseph, R. Katz, S. Shenker, and I. Stoica. Mesos: A Platform for Fine-grained Resource Sharing in the Data Center. In Proc. USENIX NSDI, 2011.
-  V. Jalaparti, H. Ballani, P. Costa, T. Karagiannis, and A. Rowstron. Bridging the tenant-provider gap in cloud services. In Proc. SoCC, 2012.
-  S. Kavalanekar, B. Worthington, Qi Zhang, and V. Sharda. Characterization of storage workload traces from production windows servers. In 2008 IEEE International Symposium on Workload Characterization, pages 119–128, Sept 2008.
-  J. Khamse-Ashari, I. Lambadaris, G. Kesidis, B. Urgaonkar, and Y.Q. Zhao. Per-Server Dominant-Share Fairness (PS-DSF): A Multi-Resource Fair Allocation Mechanism for Heterogeneous Servers. In Proc. IEEE ICC, Paris, May 2017.
-  Kubernetes. Production-grade Container Orchestration. http://kubernetes.io/, 2016.
YongChul Kwon, Magdalena Balazinska, Bill Howe, and Jerome Rolia.
Skew-resistant parallel processing of feature-extracting scientific user-defined functions.In Proceedings of the 1st ACM Symposium on Cloud Computing, SoCC ’10, pages 75–86, New York, NY, USA, 2010. ACM.
-  YongChul Kwon, Magdalena Balazinska, Bill Howe, and Jerome Rolia. Skewtune: Mitigating skew in mapreduce applications. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, SIGMOD ’12, pages 25–36, New York, NY, USA, 2012. ACM.
-  R. Levy, J. Nagarajarao, G. Pacifici, M. Spreitzer, A. Tantawi, and A. Youssef. Performance management for cluster based web services. In Germán Goldszmidt and Jürgen Schönwälder, editors, Integrated Network Management VIII: Managing It All, pages 247–261. Springer US, 2003.
-  Linux Kernel. Cfs bandwidth control. https://www.kernel.org/doc/Documentation/scheduler/sched-bwc.txt. [Online; accessed 15-Aug-2018].
-  K. Liu, J. Peng, W. Liu, P. Yao, and Z. Huang. Dynamic resource reservation via broker federation in cloud service: A fine-grained heuristic-based approach. In Proc. IEEE GLOBECOM, 2014.
-  Hui Lu, Brendan Saltaformaggio, Ramana Kompella, and Dongyan Xu. vFair: Latency-aware Fair Storage Scheduling via per-IO Cost-based Differentiation. In Proc. ACM SoCC, 2015.
-  Jose Luis Lucas-Simarro, Rafael Moreno-Vozmediano, Ruben S Montero, and Ignacio M Llorente. Scheduling strategies for optimal service deployment across multiple clouds. Future Generation Computer Systems, 29(6):1431–1441, 2013.
-  H. Mao, M. Schwarzkopf, S.B. Venkatakrishnan, and M. Alizadeh. Learning graph-based cluster scheduling algorithms. In Proc. SysML, Stanford, CA, USA, Feb. 2018.
-  Marathon - A Container Orchestration Framework for Mesos and DC/OS. https://mesosphere.github.io/marathon/, last accessed, Sept. 2017.
-  Mesos multi-scheduler. https://github.com/yuquanshan/mesos/tree/multi-scheduler.
A. Meyerson, A. Roytman, and B. Tagiku.
Online multidimensional load balancing.
Proc. Approximation, Randomization, and Combinatorial Optimization Algorithms and Techniques - Springer LNCS vol 8096, 2013.
-  F. Nawab, V. Arora, D. Agrawal, and A. El Abbadi. Minimizing commit latency of transactions in geo-replicated data stores. In Proc. ACM SIGMOD.
-  E.B. Nightingale, J. Elson, J. Fan, O. Hofmann, J. Howell, and Y. Suzue. Flat datacenter storage. In Proc. USENIX OSDI, Hollywood, CA, 2012.
-  K. Ousterhout, A. Panda, J. Rosen, S. Venkataraman, R. Xin, S. Ratnasamy, S. Shenker, and I. Stoica. The case for tiny tasks in compute clusters. In Proc. USENIX HotOS, 2013.
-  J. Puchinger, G.R. Raidl, and U. Pferschy. The multidimensional knapsack problem: Structure and algorithms. INFORMS Journal on Computing, 22(2):250–265, Spring 2010.
-  O. Sefraoui, M. Aissaoui, and M. Eleuldj. Openstack: Toward an open-source solution for cloud computing. International Journal of Computer Applications, 55(3), 2012.
-  M. Shafiee and J. Ghaderi. Scheduling Coflows in Datacenter Networks: Improved Bound for Total Weighted Completion Time. In Proc. ACM SIGMETRICS, 2017.
-  Y. Shan, A. Jain, G. Kesidis, B. Urgaonkar, J. Khamse-Ashari, and I. Lambadaris. Scheduling distributed resources in heterogeneous private clouds. In Proc. IEEE MASCOTS, Milwaukee, Sept. 2018.
-  P. Sharma, S. Lee, T. Guo, D. Irwin, and P. Shenoy. Spotcheck: Designing a derivative IAAS cloud on the spot market. In Proc. EuroSys, 2015.
-  Apache Spark - Spark Configuration. https://spark.apache.org/docs/latest/configuration.html.
-  Spark with resource demand vectors. https://github.com/yuquanshan/spark/tree/d-vector.
-  S. Subramanya, T. Guo, P. Sharma, D. Irwin, and P. Shenoy. SpotOn: A batch computing service for the spot market. In Proc. ACM Symp. on Cloud Computing, pages 329–341, 2015.
-  Docker Swarm. https://docs.docker.com/engine/swarm/, last accessed, Sept. 2017.
-  P. Tembey, A. Gavrilovska, and K. Schwan. Merlin: Application- and platform-aware resource allocation in consolidated server systems. In Proc. ACM SOCC, 2014.
-  Ashish Thusoo, Zheng Shao, Suresh Anthony, Dhruba Borthakur, Namit Jain, Joydeep Sen Sarma, Raghotham Murthy, and Hao Liu. Data warehousing and analytics infrastructure at facebook. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD ’10, pages 1013–1020, New York, NY, USA, 2010. ACM.
-  Ehsan Totoni, Subramanya R. Dulloor, and Amitabha Roy. A case against tiny tasks in iterative analytics. In Proceedings of the 16th Workshop on Hot Topics in Operating Systems, HotOS ’17, pages 144–149, New York, NY, USA, 2017. ACM.
-  B. Urgaonkar, G. Pacifici, P. Shenoy, M. Spreitzer, and A. Tantawi. An analytical model for multi-tier internet services and its applications. ACM SIGMETRICS Performance Evaluation Review, 33(1):291–302, 2005.
-  M.J. Varnamkhasti. Overview of the algorithms for solving the multidimensional knapsack problems. Advanced Studies in Biology, 4(1):37–47, 2012.
-  V.K. Vavilapalli, A.C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino, O. O’Malley, S. Radia, B. Reed, and E. Baldeschwieler. Apache Hadoop YARN: Yet Another Resource Negotiator. In Proc. ACM SOCC, SOCC, 2013.
-  Rares Vernica, Andrey Balmin, Kevin S. Beyer, and Vuk Ercegovac. Adaptive mapreduce using situation-aware mappers. In Proceedings of the 15th International Conference on Extending Database Technology, EDBT ’12, pages 420–431, New York, NY, USA, 2012. ACM.
-  C. Wang, B. Urgaonkar, A. Gupta, G. Kesidis, and Q. Liang. Combining spot and on-demand instances for cost effective caching. In Proc. ACM EuroSys, Belgrade, 2017.
-  C. Wang, B. Urgaonkar, N. Nasiriani, and G. Kesidis. Using Burstable Instances in the Public Cloud: What, When and How? In Proc. ACM SIGMETRICS, Urbana-Champaign, IL, June 2017.
-  C. Wang, Q. Wu, Y. Tan, W. Wang, and Q. Wu. Locality based data partitioning in mapreduce. In 2013 IEEE 16th International Conference on Computational Science and Engineering, pages 1310–1317, Dec 2013.
-  W. Wang, B. Li, B. Liang, and J. Li. Multi-resource fair sharing for datacenter jobs with placement constraints. In Proc. Supercomputing, Salt Lake City, Utah, 2016.
-  Wondershaper. https://github.com/magnific0/wondershaper.
-  Z. Wu, M. Butkiewicz, D. Perkins, E. Katz-Bassett, and H.V. Madhyastha. Spanstore: Cost-effective geo-replicated storage spanning multiple cloud services. In Proc. ACM SOSP, 2013.
-  W. Xu, X. Zhu, S. Singhal, and Z. Wang. Predictive control for dynamic resource allocation in enterprise data centers. In Proc. IEEE/IFIP NOMS, 2006.
-  Neeraja J. Yadwadkar, Ganesh Ananthanarayanan, and Randy H. Katz. Wrangler: Predictable and faster jobs using fewer resources. In SoCC, 2014.
-  Y. Yan, Y. Gao, Y. Chen, Z. Guo, B. Chen, and T. Moscibroda. Tr-spark: Transient computing for big data analytics. In Proc ACM SoCC, 2016.
-  M. Yao and C. Lin. An online mechanism for dynamic instance allocation in reserved instance marketplace. In Proc. IEEE ICCCN, 2014.
-  Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. Spark: Cluster computing with working sets. In Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing, HotCloud’10, pages 10–10, Berkeley, CA, USA, 2010. USENIX Association.
-  H. Zhang, B. Cho, E. Seyfe, A. Ching, and M.J. Freedman. Riffle: Optimized Shuffle Service for Large-scale Data Analytics. In Proc. EuroSys, 2018.