Log In Sign Up

PingAn: An Insurance Scheme for Job Acceleration in Geo-distributed Big Data Analytics System

Geo-distributed data analysis in a cloud-edge system is emerging as a daily demand. Out of saving time in wide area data transfer, some tasks are dispersed to the edge clusters satisfied data locality. However, execution in the edge clusters is less well, due to limited resource, overload interference and cluster-level unreachable troubles, which obstructs the guarantee on the speed and completion of jobs. Synthesizing the impact of cluster heterogeneity and costly inter-cluster data fetch, we expect to make effective copies across clusters for tasks to provide both success and efficiency of the arriving jobs. To this end, we design PingAn, an online insurance algorithm making redundance across-cluster copies for tasks, promising (1+ε)-speed o(1/ε^2+ε)-competitive in sum of the job flowtimes. PingAn shares resource among a part of jobs with an adjustable ε fraction to fit the system load condition and insures for tasks following efficiency-first reliability-aware principle to optimize the effect of copies on jobs' performance. Trace-driven simulations demonstrate that PingAn can reduce the average job flowtimes by at least 14% more than the state-of-the-art speculation mechanisms. We also build PingAn in Spark on Yarn System to verify its practicality and generality. Experiments show that PingAn can reduce the average job completion time by up to 40% comparing to the default Spark execution.


page 1

page 2

page 3

page 4


Communication Contention Aware Scheduling of Multiple Deep Learning Training Jobs

Distributed Deep Learning (DDL) has rapidly grown its popularity since i...

Do the Hard Stuff First: Scheduling Dependent Computations in Data-Analytics Clusters

We present a scheduler that improves cluster utilization and job complet...

Towards Reliable (and Efficient) Job Executions in a Practical Geo-distributed Data Analytics System

Geo-distributed data analytics are increasingly common to derive useful ...

CloudCoaster: Transient-aware Bursty Datacenter Workload Scheduling

Today's clusters often have to divide resources among a diverse set of j...

DRESS: Dynamic RESource-reservation Scheme for Congested Data-intensive Computing Platforms

In the past few years, we have envisioned an increasing number of busine...

Griffon: Reasoning about Job Anomalies with Unlabeled Data in Cloud-based Platforms

Microsoft's internal big data analytics platform is comprised of hundred...

Peacock: Probe-Based Scheduling of Jobs by Rotating Between Elastic Queues

In this paper, we propose Peacock, a new distributed probe-based schedul...

1. Introduction

Today, several cloud applications are moving some of their functionality to edge devices to improve user-perceived fluency of interactions. The edges are considered as the extension of traditional data centers and they together constitute a large scale cloud-edge system  (Tan et al., 2017; Mao et al., 2017; Tong et al., 2016; Chen et al., 2017; Calder et al., 2013). A large amount of user data, e.g. logs, transaction records and traces, is generated and stored in edges. Analyses on these geo-distributed user data precipitate many realtime commerce-crucial decisions, for instance, user behavior predictions, load balancing and attack detections  (Gupta et al., 2014; Morita et al., 2014; Hsieh et al., 2017).

Traditional centralized data analysis needs to transfer all required data to one site, which is time consuming. Thus, modern data analytic platforms tend to disperse tasks to edges and process data locally. Previous works  (Kloudas et al., 2015; Vulimiri et al., 2015; Hung et al., 2015; Pu et al., 2015; Viswanathan et al., 2016; Wang and Li, 2017) carefully design the tasks scheduling policy to minimize the costly WAN data transmission, so as to speed up job completion.

However, achieving efficient geo-distributed data analyses still encounters obstacles besides the WAN bandwidth limitation. Owing to the limited resources, such as computing slots and import-export bandwidth, edge clusters may be easily overloaded under dynamic user access patterns, or even suffer a cluster-level unreachable trouble. Thus, tasks in edges sometimes perform badly even fail. In addition, edge clusters are heterogeneous. One task running on different edges may have totally different execution quality, while it is very similar among data centers.

One potential method to avoid the unpredictability of edges is utilizing idle resources to clone some critical tasks in multiple edges or data centers to guarantee the job completion time. Actually, several intra-cluster data analytic platforms have already adopted similar idea to handle straggling tasks  (Zaharia et al., 2008; Dean and Ghemawat, 2008; Ananthanarayanan et al., 2010; Ren et al., 2015; Chaiken et al., 2008; Ananthanarayanan et al., 2013, 2014; Xu and Lau, 2015b), but these cluster-scale straggler-handling mechanisms are unsuited for the inter-cluster insurance in a cloud-edge system.

Firstly, the normal task execution in a cluster assumes slot-independent, i.e., for a task, the execution time on each normal slot is similar, while it is quite different among different edges. Secondly, the difference of data fetch time caused by the task location is almost imperceptible inside cluster because that the data always has copies inside cluster and intra-cluster bandwidth is abundant, e.g. HDFS has three copies by default. In contrast, cloning a task in another edge or data center incurs inter-cluster data transmission over scarce WAN bandwidth, thus, the difference of data fetch among inter-cluster copies is non-negligible. In order to effectively speed up the jobs, we need to consider the impact of cluster heterogeneity and remote data fetch when insuring.

The revenue of insuring is embodied in the improvement of the task’s expected execution speed (efficiency) and the probability of completion (reliability). To measure efficiency improvement, we capture the heterogeneous performance of geo-distribution resource from the recent execution logs and quantify the effect of a task’s insurance plan as the change of its execution speed. For reliability, we quantify the effect of insuring as the increase of task’s completion probability and utilize the inter-cluster copies to maximally avoid failure caused by the cluster-level unreachable troubles.

Based on these quantifications, in this paper, we design PingAn, an online fine-grained insurance algorithm aiming to minimize the sum of job flowtimes. First, PingAn permits the first fraction of jobs with the least unprocessed data size to share the computing resource. Considering that for a task, the marginal revenue of an extra copy decreases as the task’s copy number increases, tuning to accommodate the system’s load condition is expected to motivate the best effect of copies under limited resource. Some hints about selection are also given in the paper according to the experiment results.

When concretely insuring for tasks, we care for both efficiency and reliability on copy’s cluster selection. However, towards our aim, trading off the gains of efficiency against the loss of ignoring reliability is hard and vice versa. Thus, irresolution arises in the course of insuring, such as at the moment of selecting cluster for a copy, arranging copies for a task, deciding the insuring order of tasks in a job and disposing the collision of preferential clusters among concurrent jobs.

PingAn insures for tasks adhering to the efficient-first reliability-aware principle which relys on the factor that the cluster-level trouble occasionally occurs but seriously harms a wave of jobs’ performance. Further, PingAn improves the efficiency via confining the worst execution rate for each task and averting a worse usage for each slot.

We prove that our online insurance algorithm, PingAn, is a competitive online algorithm in theory and verify the improvement effect of PingAn via trace-driven simulations. Further, we develop PingAn in Spark on Yarn system to handle real-world workloads due to its practicality and generality. To be practical, PingAn works without any priori knowledge of jobs beyond the current job progress. To be general, PingAn serves for general geo-distributed data analysis jobs with any precedence constraints among tasks.

To summarize, we make three main contributions:

  • We model the dynamic performance of geo-distributed resource to quantify its impact on task completion and formulate our online insurance problem as an optimization problem devoting to make an insurance plan to minimize the sum of job flowtimes.

  • We design PingAn, an online insurance algorithm and prove it is in the sum of job flowtimes where . In simulations, we demonstrate that PingAn can drastically improve the job performance in a cloud-edge environment under any system load condition and reduce the average job flowtimes by at least 14 than the best speculation mechanisms under heavy load and the improvement is up to 62 under lighter load.

  • We develop a prototype of PingAn in Spark on Yarn system and run jobs in comparison with the default Spark executions. Experiments show that PingAn can guarantee the efficient and reliable job executions in real-world implementation.

2. Related Work

Geo-distributed data analyses: Works  (Kloudas et al., 2015; Vulimiri et al., 2015; Hung et al., 2015; Pu et al., 2015; Viswanathan et al., 2016; Hu et al., 2016) as pioneers devote to the performance problem of geo-distributed data analysis. Iridium  (Hung et al., 2015) coordinates data and task placement to improve query response. Clarinet  (Pu et al., 2015) makes query execution plans with a wide-area network awareness. Flutter  (Hu et al., 2016) minimizes the completion time of stage via optimizing task assignment across data centers . These proposed solutions reduce WAN transfer to improve job performance and assumes unlimited computing resources in data centers at all times. However, in a cloud-edge scenario, the resource-limited and unreliable edge clusters harms the job performance. We consider the edges’ limitation and ensure the task execution via inter-cluster task copying.

Passive detection speculation mechanism: This part of works mitigate the abnormal task impact on job completion via monitoring tasks’ execution and restart a new copy for identified straggler. Initially, Google MapReduce system speculatively schedules copies for the remaining tasks at the end of job  (Dean and Ghemawat, 2008). It restrains the long-tail tasks but wastes resource on lots of normal tasks. Thus LATE  (Zaharia et al., 2008) and its extended works identifies slow tasks accurately via delicately comparing tasks’ progress rate. Mantri  (Ananthanarayanan et al., 2010) schedules a copy for a task only when the task’s total resource consumption decreases. Hopper  (Ren et al., 2015) designs the best speculation-aware job scheduler under its task duration model.

However, the above cluster-scale speculation mechanisms lose efficacy in cloud-edge environment. First, monitoring lots of remote tasks is costly. Further, the cluster-level unreachable troubles, for instance, power supply interruption, master server shutting down, the failure of high layer exchanger which leads to a network disconnection and many more complicated cases caused by a series of operation accidents, obstruct the system to timely detect straggling tasks. Then, the normal task standard is indecisive since the cluster heterogeneity, which delays speculation. Besides, the time-consuming WAN transfer in a restart copy further destroy the acceleration effect of speculation. We insuring for tasks at the start of execution to avoid these problems.

Proactive clone mechanism: This part of works devote to reduce the straggler occurrence of a job via task cloning at start. Dolly  (Ananthanarayanan et al., 2013) refines the straggler-occurring likelihood of some jobs beyond a certain threshold.  (Xu and Lau, 2015b) adopts task cloning to speed up job completion and proposes an competitive online scheduling algorithm to optimize the sum of the job flowtimes. Work in  (Xu and Lau, 2015a) proposes Smart Cloning Algorithm to maximize the sum of job utility via task cloning. Given the cluster heterogeneity and WAN transfer demand, the copy execution in different clusters differs. Thus, in a cloud-edge system, the above cluster-scale cloning mechanisms which only decide the copy number for each task fail to achieve the effect of copies on job performance improvement. To this end, we make the fine-grained insurance plan to optimize copy effect.

3. System Overview and Insurance Problem

3.1. Geo-distributed Data Analysis System with PingAn

PingAn utilizes users’ geo-distributed resource to guarantee the efficiency of their routine data analyses. We develop PingAn in Spark  (Zaharia et al., 2012) on Yarn  (Vavilapalli et al., 2013) system across multiple clusters as shown in Figure 1. The resource of each cluster (dash line in Figure 1) is managed via one ResourceManager(RM) in Yarn. RM receives jobs from Spark client and resolves the job’s description to generate a corresponding AppMaster(AM) for each job. Inside AM, DAGScheduler creates TaskSet for the ready tasks. PingAn works as shown in Figure 1(a).

  1. DAGScheduler fetches the data location information of tasks from OutputRecorder in AM and inserts it into TaskSet. The OutputRecorder records the intermediate data location once a completed task reports its output message to it.

  2. TaskSet is then send to PingAn and waits in TaskSetPool. Multiple TaskSets in TaskSetPool are queued in an ascending order of unprocessed data size.

  3. PerformanceModeler(PM) in PingAn regularly collects the execution information in each cluster from RMs and models the dynamic of resource capacity.

  4. Insurancer in PingAn periodically fetches the TaskSets from TaskSetPool

    and draws up an insurance plan for each task with an aware of its completion time in each cluster which is estimated depending on the resource performance model in


  5. TaskSet along with its insurance plan is sent back to TaskScheduler for execution. AM sends the resource(container) requests to RMs in the clusters specified in the insurance plan and launched the tasks on the obtained containers.

(a) The cooperation between Spark on Yarn and PingAn
(b) The collection of execution information
Figure 1. The workflow of PingAn in Spark on Yarn.

The collection of execution information is depicted in Figure 1(b). As we see, AppMaster has two tasks running in two clusters (dash line).

  1. After a task finishing its work, it reports its output location and execution information to AM. The execution information contains the data processing speed and the inter-cluster transfer speed (with a specification of the two end clusters) obtained by the task.

  2. AM sends the execution information to RM of the cluster running the task.

  3. RM periodically sends the collected information to PM in PingAn. PM builds resource performance models and serves for Insurancer. The modeling is described in Section 3.2.

3.2. Quantification of Cluster Selection’s Impact on Execution

In this subsection, we first use the execution information to model the resource performance and quantify the impact of multi-copy execution on the task’s efficiency. Then, we quantify the multi-copy execution’s impact on task’s reliability.

For quantifying the impact on efficiency, first, we tally the data processing speed of recent tasks in a cluster and obtain a distribution of the speed value to reflect the unstable computing capacity of cluster . Specifically, a newly launched task has probability to run with processing speed in cluster .

Notice that we use data processing speed instead of cpu processing speed because that the latter is tiring to monitor and it is impossible to be used to estimate task’s time due to unit difference. In order to eliminate the data processing speed bias caused by task type, we meticulously model such a distribution for each RDD operation which composes the Spark job and ensure a task’s data processing speed distribution according to its operation.

Second, in the same way, we use to denote the distribution of data transfer bandwidth from cluster to to reflect the unstable cluster-pair’s transfer capacity. The bandwidth of a transfer is captured at the download end. A task may need multiple transfer and denoted by its input location set. When the task launches on cluster , its average transfer bandwidth is given as follows

Define as the distribution of .

The execution rate of a task is denoted as where is the number of the task’s copies. The execution rate of a task’s copy hinges on the bottleneck of data transfer and data processing. Thus, we have to indicate the execution rate of a copy in cluster . Let be the corresponding distribution of . Both of the distribution and

can be easily derived via the composition computation of multiple discrete random variables. If a task has just one copy in cluster

, we have . If the task has another extra copy in cluster , then and so on.

For quantifying the impact on reliability, first, let be the probability of encountering cluster-level unreachable trouble in cluster

, which is obtained via the statistic. We assume time to be slotted and the cluster-level failure are independent over time. Hence, we can assume the cluster-level failure to follow a binomial distribution. To recap, a copy in cluster

has probability to exempt from cluster-level trouble at each time slot. denotes the probability of the task encountering no cluster-level trouble during its execution and is given by

where is the execution time of task in cluster . If the task has one extra copy in the same cluster, where the probability of task encountering troubles at each time slot is invariant because that once the cluster-level trouble happens, both two copies fail. If the task has one extra copy in the other cluster , the trouble encountering probability of the task at a time slot decreases to , thus we have . More copies are as the same analogy.

3.3. Formulation

Consider a system consisting of clusters denoted by the set . The clusters’ topology conforms to a heavy-tailed distributions, which means that each large-scale data center are linked by multiple small edges and multiple data centers are interconnected  (Tong et al., 2016). Some neighboring edges are also connective. denotes the number of computing slots in the cluster. The egress and ingress bandwidth restriction for the cluster is referred to as and respectively.

Assume a set of jobs arriving over time. Job arrives at time and consists of tasks . The flowtime of job is where is the job’s completion time. Our insurance problem aims to minimize the sum of job flowtimes’s expectation. The formulation is outlined below.

In formulation, indicates insuring copy for task in cluster . As an effective insurance, constraint in Eq. (3) ensures each task at least one copy to complete its work (Here, task with just one copy means that task executes without speculation). Eq. (4) states that a task in a job can only be scheduled after the job’s arrival time. Eq. (5)-Ep. (8) ensures the execution of tasks satisfying the precedence order in a job. Let represents the partial order in job . The start time of tasks () should obey the partial order , which means that for each pair of ordered tasks, , it satisfies where is the execution time of task . In addition, in Eq. (6) denotes the task’s datasize and in Eq.(7) denotes the completion time of task . Constraints in Eq. (9), Eq. (10) and Eq. (11) forestall the terrible contention for cluster’s computing slots and gate bandwidths at any time slot. Eq. (12) is due to that the completion time of a job depends on its last task’s completion time.

The Difficulty of Insurance Problem: Without considering the speedup via task cloning and limiting the gate bandwidth of clusters, our problem can be simplified to the scheduling problem in  (Zheng et al., 2013) which has proven to be NP-hard. Therefore, our problem is naturally NP-hard and we devote to solve the problem with an approximation bound.


4. PingAn Insurance

4.1. Algorithm Design

Motivated by the work in  (Xu and Lau, 2015b) which speed up the online job via task cloning in a single cluster, we extend its idea that the jobs with the least remaining workload shares the machines of a cluster and design our PingAn insurance algorithm. PingAn works at the beginning of each time slot. Out of practicality, we only use the job knowledge available at the current scheduled stage. The effective workload of a job can be characterized by the unprocessed data size of its current stage. The jobs with higher priority has less unprocessed data size than jobs with lower priority.

Let be the number of alive jobs at time and be a value in . The first jobs with the least unprocessed data size fairly shares the geo-distributed slots, which means that each prior job is admitted to obtain at most slots and the other jobs with lower priority can obtain nothing. After deciding the promissory slots for each job, PingAn insures the cluster-specified slots to each task after multiple rounds. Notice that, in any round of insurance, the total slots number insured for a job is limited to its promissory slot number .

Input: , the set of alive jobs at current time slot ; , the available slots at time ; , the resource condition at time ; , the gate bandwidth limit of clusters;
Output: An insurance plan
1 Sort the jobs in according to the ascend order of unprocessed datasize;
2 for each  do
3       Compute , the slot number promised to job ;
4       Count , the number of slots running ’s tasks;
6, the assigned slot number in a round;
7 for each and  do
8       Extract waiting tasks to ;
9       for each and  do
10             Try to do efficient-first insurance for ;
11             if insuring succeeds then
12                   ++, ++;
16if  == 0 then
17       return;
20 for each and  do
21       Extract tasks assigned a slot to ;
22       Compute for each ;
23       Sort tasks in in the ascend order of ;
24       for each and  do
25             Try to do reliability-aware insurance for ;
26             if insuring succeeds then
27                   ++, ++;
31if  == 0 then
32       return;
34while true do
35       ;
36       for each and  do
37             Extract tasks copied in the last round to ;
38             for each and  do
39                   Try to do resource-saving insurance for ;
40                   if insuring succeeds then
41                         ++, ++;
45      if  == 0 then
46             return;
Algorithm 1 PingAn Insurance Algorithm

In the first round, PingAn only insures at most one slot for each task in order of job priority according to an efficiency-first principle. When its turn arrives, a task can obtain a slot and run with currently the best execution rate , as long as the related gate bandwidth restrictions are satisfied and the execution rate is not worse than fraction of the global optimal rate obtained by the task when only it executes in the system. If the bandwidth are not enough or the current best slot’s rate is too worse, the task waits for the next insurance.

In the second round, PingAn utilizes the current idle slots to improve the reliability of each job in priority order following reliability-aware principle. Inside each job, PingAn prefers to insure an extra copy for the tasks with the worse trouble-exemption probability . After meeting the bandwidth restrictions and the lower limit of execution rate, the slot is selected from the cluster where the copy execution can improve the task’s to the greatest extent.

In the third or the later round, beside following the efficient-first principle, PingAn starts to consider the opportunity cost of a slot being an extra copy since the slot can be saved to complete many more tasks in the next insurances. Given that an insured task in the third round already has a copy with the best efficiency and an extra copy improving the execution reliability, a slot used to run the third copy of the task plays a relatively less role on performance improvement than using the slot to run the first or second copy of the other tasks. Thus, in the third and the later round, PingAn conservatively insures a copy for a task only if it saves both time and resources consumed, which is referred to as resource-saving copy. To be specific, supposing to decide whether to schedule the copy of a task in cluster (), PingAn calculates the execution rate and the corresponding execution time of the task if the extra copy performs. Only if the , the extra copy is permitted to insuring for the task. Algorithm 1 summarizes in detail how PingAn insures for the jobs.

As applied in PingAn, the efficient-first principle means that the efficiency should be satisfied priori to the reliability for a task execution, which is motivated by the factor that resource performance fluctuates frequently but the cluster-level trouble is occasional. Thus, when PingAn insures the first slot to the task, aiming at the efficiency can drastically and directly reduce the execution time.

The efficiency-first principle in the first round also enforces that the efficiency of a job with promissory slots should be satisfied priori to the reliability of a job with higher priority. Recalling that our objective is to minimizing the sum of job flowtimes, a slot used for job’s reliability generally contributes less to the objective than a slot used for job’s efficiency since the former has less chance to save the execution time from occasional cluster-level troubles. Therefore, PingAn insures only essential copy for all qualified jobs’ tasks in the first round and insures extra copies in the later rounds, which is referred to as Efficient-First Allocation (EFA) among jobs. The other alternative is to insure both essential and extra copies for each job in priority order, which is referred to as Job Greedy Allocation (JGA) among jobs. We compare the practical performance of two candidates in Section 6.3 to verify the correctness of efficient-first principle.

The reliability-aware part is to emphasize that in despite of the occasionality, a cluster-level trouble can harm a large scale of job performances. Thus, the reliability is indispensable towards performance improvement. We verify the practical effect of the efficient-first reliability-aware principle in Section 6.3.

4.2. Analysis of PingAn

In this section, we assumes resource augmentation  (Berman and Coulston, 1998) for PingAn and derive the approximation bound via the method of potential function analysis, . In the analysis, time is continuous and we do not consider the cluster-level unreachable troubles in our approximate bound analysis because that the impact of such failures on job flowtime is hard to be measured. Before the details of potential function analysis, we first prove the following Proposition 1 and deduce the final approximation bound with the help of it.

Proposition 1 ().

For any integer , we have under PingAn algorithm.


See Appendix A.1. ∎

Depending on Proposition 1, we prove the following Theorem 2. Under a resource augmentation assumption, the resource speed in PingAn is times faster than the one in the optimal adversary algorithm. Theorem 2 states that the sum of job flowtimes in PingAn is within factor of the optimal algorithm with a resource augmentation.

Theorem 2 ().

PingAn is approximated algorithm for the sum of the expectation of job flowtimes when .


See Appendix A.2. ∎

5. Implementation on Real System

We develop the PingAn in Spark on Yarn, a general geo-distributed data analysis system, and run a series of typical workloads to consolidate the practicality and acceleration ability of PingAn.

Testbed: Our experiments are deployed on 10 VMs running a 64-bit Ubuntu 16.04. Four of them have 8 CPU cores and 20GB memory, the others have 4 CPU cores and 10GB memory. We regard the 10 VMs as ten different edge clusters and the number of containers concurrently running on the VM corresponds to the computing slots number in the edges. We run two ResourceManagers in charge of 5 VMs respectively. We use the Wondershaper to limit the egress and ingress bandwidth of each VM. We intentionally run benchmarks in each VM to consume its spare resources to different extent (Ubench for CPU and memory, Bonnie for disk I/O and Iperf for external bandwidth) in order to cause performance difference via resource contention. In addition, a script file is running for executing shutdown command in VM according to the preset probability to imitate the cluster-level errors. The adjustable parameter in PingAn is set to be 0.6.


The workload includes 88 jobs such as WordCount, Iterative machine learning and PageRank. The variation in input sizes is based on real workloads from Yahoo! and Facebook  

(Vavilapalli et al., 2013) with a reduced scale as shown in Table 1

. We randomly distribute the input across the 10 VMs. The job submission time follows an exponential distribution. The average workload intensity is 3 jobs per 5 min.

Baseline: We compare PingAn with the Spark with delay scheduling for tasks and fair scheduling for jobs and the speculative Spark when Spark’s default speculation mechanism works.


We focus on the average flowtime of jobs and the cumulative distribution function (CDF) of job flowtimes.

JobType WordCount Iterative ML PageRank
Small 100-200MB 130-300MB 150-400MB
Medium 0.7-1.5GB 1.3-1.8GB 1-2GB
Large 3-5GB 2.5-4GB 3.5-6GB
Table 1. Workload Constitution
Figure 2. The average job flowtime comparison under PingAn, Spark and speculative Spark execution.
(a) The flowtime CDF of jobs with 500s flowtime.
(b) The flowtime CDF of jobs with 300s flowtime.
Figure 3. The CDF of job flowtimes under PingAn, Spark and speculative Spark execution.

Figure 2 shows that PingAn reduces the average job flowtime by 39.6 comparing to the default speculation mechanism in Spark. As exhibited concretely in Figure 3, PingAn efficiently reduces the job flowtimes via coordinating resource contention among jobs and insuring proper copies for tasks. Figure 3(a) depicts the CDF of flowtimes for jobs whose flowtime is between 0 and 500 seconds under three algorithm. It indicates that 72.4 jobs in PingAn finishes within 200 seconds while the proportion in speculative Spark and Spark are 65.6 and 45.9 respectively.

Figure 3(b) depicts the CDF of flowtime for jobs whose flowtime is larger than 300 seconds. It shows that the detect-based speculation mechanism in Spark inhibits the overlong tasks. PingAn arranges copies at the execution start to avoid straggler, thus, it is helpless in face of slow tasks, while it saves the system’s cost of remote monitoring.

6. Performance Evaluation

ClusterType Proportion VM Number VM Power WAN Bandwidth Mean(mips) Mean(kb/s) RSD Large-scale 5 500-1500 55-75 174-355 0.25-0.6 64-256 0.2-0.5 0.002-0.011 Medium-scale 20 50-500 65-85 128-241 0.55-0.85 0.02-0.2 Small-scale 75 10-50 75-95 68-179 0.35-0.75 0.05-0.5

Table 2. Simulation Experiments settings

In simulations, we expand experiments’ scale to verify the acceleration effect of PingAn on different conditions.

6.1. Methodology

Simulation Setup: We modify CloudSim to support our simulation experiments.

The clusters launched in simulation have large, median and small three kinds of scale. We use the BRITE Topo generator to create 100 clusters with a heavy-tailed distribution around the world. We sort 100 clusters in the decreasing order of their degrees and let the first clusters be the large-scale cluster, the following be the medium one and the rest be the small one.

Table 2 shows the various kinds of parameters’ range setting in different scale cluster. Some parameters’ range (VM Power and WAN Bandwidth) is based on the real performance analysis experiments on Amazon EC2 or other public clouds  (Dejun et al., 2010; Schad et al., 2010; Zaharia et al., 2008), and some parameters’ range is set to be wide for excavating the ability of our algorithm. Specially, the Gate Bandwidth Limit Ratio in the fourth column of Table 2

indicates the ratio of the egress/ingress bandwidth to the sum of the VMs’ external bandwidth of a cluster. We assumes the VM power and inter-cluster bandwidth to follow a normal distribution as observed in  

(Schad et al., 2010).

Workloads: We construct an synthetic workloads containing 2000 Montage workflows. Montage workflow assembles high-resolution mosaics of region of the sky from raw input data, which consist of the tasks with high demand of both data transfer and computing. The job size distribution refers to the traces in Facebook’s production Hadoop cluster  (Ananthanarayanan et al., 2010, 2013, 2014, 2012) that , and of jobs are with small (1-150), medium (151-500) and large (

500) task numbers respectively. We randomly disperse the raw input data of each workflow to the edges as well as some medium-scale clusters. The workflow inter-arrival times are derived from a Poisson distribution. We adjust system load condition via the Poisson parameters

from 0.02 to 0.15.

Baseline: We compare PingAn with four baseline algorithms.

  1. Flutter. Flutter is a geo-distributed scheduler to optimize stage completion time.

  2. Iridium. Iridium optimizes data and task placement to reduce the WAN transfer during the job execution.

  3. Flutter+Mantri. Mantri is demonstrated to be the best detection-based speculation mechanism inside cluster.

  4. Flutter+Dolly. Dolly is a passive cloning mechanism and performs better than Mantri under the Facebook’s trace.

Metric: We focus on the same metric in Section 5. In addition, for Dolly, Mantri and PingAn, we focus on their reduction in job flowtime of the Flutter as well as the CDF of the reduction ratio. Under each setting, we run our workloads ten times and calculate the average flowtime of the ten executions for each job as its final flowtime.

6.2. Comparison against Baselines under Different Load

Figure 4. The performance comparison under different load condition.

We compare the the average job flowtime of PingAn with four baselines under light, medium and heavy load respectively. We set for PingAn under light load, under medium load and under heavy load according to the selection hint in Section 6.4.

Figure 4 shows the comparison results. Without the awareness of cluster heterogeneity, the job performance in both Flutter and Iridium keep away from the expectation. As a whole, Dolly and Mantri have adept load case apiece and PingAn works the best on all load condition. PingAn reduces the average job flowtime by 52.9, 61.9 and 13.5 than the best baseline under light, medium and heavy load respectively. More details are illustrated in Figure 5.

(a) CDF of flowtime under light load ()
(b) CDF of flowtime reduction under light load ()
(c) CDF of flowtime under medium load ()
(d) CDF of flowtime reduction under medium load ()
(e) CDF of flowtime under heavy load ()
(f) CDF of flowtime reduction under heavy load ()
Figure 5. The details of job performance in PingAn and baselines in different load condition. (a)(c)(e) depicts the CDF of job flowtimes for PingAn and each baselines under three load; (b)(d)(f) depicts the CDF of jobs flowtime reduction ratio to the Flutter for PingAn, Mantri and Dolly under three load.

In lightly loaded case, Mantri and Dolly are well-matched. The sufficient idle slots admits Dolly making enough clones to avoid stragglers, and co-existed task number is little enough for Mantri to detect the straggler and copy it quickly. As shown in Figure 5(a), the fraction of jobs finishing within 100 seconds is 70.5 in Dolly and 73.7 in Mantri. However, PingAn performs better that 76.9 jobs finishes within 100 seconds under light load. PingAn makes copies toward execution efficiency and reliability directly and improves the performance the most. Seen in another light, as shown in the Figure 5(b), more than 70 jobs in PingAn has at least 91.4 reduction in flowtime. In comparison, Mantri and Dolly has only 74.3 and 85.2 at their reduction ratio.

In moderately loaded case, Mantri is insensitive to promote tasks with a relatively moderate latency which are the majority under the medium load and sometimes the restart copy is ineffective due to costly WAN transfer. Dolly improves the job efficiency via aggressively making copies and works better than Mantri as shown in Figure 5(c) and 5(d). In Figure 5(c), Dolly has 67.5 jobs finishing within 200 seconds and Mantri only has 61.34. In Figure 5(d), more than 70 jobs in Dolly has at least 89.7 flowtime reduction while 88.1 in Mantri. PingAn further precedes Dolly since it insures more efficient and reliable copies to tasks suffering higher risk instead of aggressively cloning as Dolly. Thus, 84.03 jobs in PingAn can finish within 200 seconds and the value at reduction ratio is 94.13.

In the heavy load case, Mantri effectively restrains the overlong tasks. Figure 5(e) illustrates that 59 jobs in Mantri finishes within 800 seconds under heavy load. The result is better than the 37.6 in Dolly but worse than the 71.0 in PingAn. PingAn speeds up the job flowtimes the most even under the heavy load via optimizing copy effect as depicted in Figure 5(f). Concretely speaking, PingAn improves the job flowtimes by 49.6 at . In contrast, Mantri is 41.1 and Dolly even makes 63.4 jobs flowtime be longer due to its reckless preemption.

6.3. Impact of Insurance Principle

In this subsection, we verify the effect of efficiency-first reliability-aware principle via comparing the job performance after exchanging the insuring principle in the first two round.

(a) Effect comparison of candidate principles in the first two round of PingAn
(b) Effect comparison of Job-Greedy and Efficient-first in PingAn
Figure 6. The effect of efficient-first reliability-aware principle when in PingAn and load parameter .

The original insuring scheme in PingAn is denoted as Eff-Reli. For the others, the one that uses reliability-aware in the first round and efficient-first in the second round is denoted as Reli-Eff, the one that uses efficient-first in both two rounds is denoted as Eff-Eff and the one that uses the reliability-aware in both two rounds is denoted as Reli-Reli. Figure 6(a) shows that Eff-Reli performs better than the candidates violating the efficiency-first principle and its average job flowtime is less than Reli-Eff and Reli-Reli by 18.5 and 52.8 respectively. Although the efficiency is priority to the reliability, however, the reliability is also worthy to consider since Eff-Eff without the awareness of reliability is worse than Eff-Reli by 4 in the average job flowtime.

The efficiency-first principle also works on resource allocation among multiple jobs in the first insuring round. Figure 6(b) indicates that EFA works better than JGA. Specially, the average flowtime of EFA is less than JGA by 39.4.

6.4. Hint on Selection


The average job flowtime variance of

under different
(b) The average job flowtime variance of under different
Figure 7. The relation between and .

The adjustable performance parameter in PingAn need to tune to some value that best fits the system load condition. The trades off the overall performance improvement between the acceleration of jobs with smaller workloads and the completion of jobs with larger workloads. We adjust the Poisson parameters to control the jobs arriving rate and evaluate the impact of on the average job flowtimes in each load condition. The evaluation results is depicted in Figure 7.

Under five workload arriving rate (let to be 0.02, 0.05, 0.07, 0.11 and 0.15 respectively), the workload’s favourite value is 0.8, 0.6, 0.6, 0.4 and 0.2 respectively. It can be a hint to select for a system. For a lightly loaded case, the selection of is partial to be a moderate or little bigger value to fully utilize idle resources. For a heavily loaded case, the value prefers to be closer to 0.2 to strive more efficiency for the small jobs arriving at the system.

7. Conclusion

In this paper, we focus on an online geo-distributed job flowtimes optimization problem in a cloud-edge system. To address the unstable and unreliable execution in edges, we propose PingAn insuring algorithm to speed up jobs via inter-cluster task copying and provide a bounded competitive ratio. PingAn excavates the insuring revenue better on account of the awareness of cluster heterogeneity and costly inter-cluster data fetch on copy execution. Both of our system implementation and extensive simulation results demonstrate that under any load condition, PingAn can drastically improve the geo-distributed job performance and surpass the best cluster-scale speculation mechanisms by at least 14.


  • (1)
  • Ananthanarayanan et al. (2013) G. Ananthanarayanan, A. Ghodsi, S. Shenker, and I. Stoica. 2013. Effective straggler mitigation: attack of the clones. In NSDI. 185–198.
  • Ananthanarayanan et al. (2012) G. Ananthanarayanan, A. Ghodsi, A. Wang, D. Borthakur, S. Kandula, S. Shenker, and I. Stoica. 2012. PACMan: Coordinated memory caching for parallel jobs. In NSDI. 20–20.
  • Ananthanarayanan et al. (2014) G. Ananthanarayanan, C. Hung, X. Ren, I. Stoica, A. Wierman, and M. Yu. 2014. GRASS: trimming stragglers in approximation analytics. In NSDI. 289–302.
  • Ananthanarayanan et al. (2010) G. Ananthanarayanan, S. Kandula, A. Greenberg, I. Stoica, Y. Lu, B. Saha, and E. Harris. 2010.

    Reining in the Outliers in Map-Reduce Clusters using Mantri. In

    OSDI. 265–278.
  • Berman and Coulston (1998) P. Berman and C. Coulston. 1998. Speed is more powerful than clairvoyance. In Scandinavian Workshop on Algorithm Theory. 255–263.
  • Calder et al. (2013) M. Calder, X. Fan, Z. Hu, E. Katz-Bassett, J. Heidemann, and R. Govindan. 2013. Mapping the expansion of Google’s serving infrastructure. In PIMC. 313–326.
  • Chaiken et al. (2008) R. Chaiken, B. Jenkins, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. 2008. SCOPE: easy and efficient parallel processing of massive data sets. VLDB Endowment 1, 2 (2008).
  • Chen et al. (2017) M. H. Chen, B. Liang, and M. Dong. 2017. Joint offloading and resource allocation for computation and communication in mobile cloud with computing access point. In INFOCOM. 1–9.
  • Dean and Ghemawat (2008) J. Dean and S. Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107–113.
  • Dejun et al. (2010) J. Dejun, G. Pierre, and C. H. Chi. 2010. EC2 performance analysis for resource provisioning of service-oriented applications. In ICSOC. 197–207.
  • Edmonds and Pruhs (2009) J. Edmonds and K. Pruhs. 2009. Scalably scheduling processes with arbitrary speedup curves. In PSODA. 685–692.
  • Fox et al. (2013) K. Fox, S. Im, and B. Moseley. 2013. Energy efficient scheduling of parallelizable jobs. In PSODA. 948–957.
  • Gupta et al. (2014) A. Gupta, F. Yang, J. Govig, A. Kirsch, K. Chan, K. Lai, S. Wu, S. G. Dhoot, A. R. Kumar, and A. Agiwal. 2014. Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing. In VLDB. 1259–1270.
  • Hsieh et al. (2017) K. Hsieh, A. Harlap, N. Vijaykumar, D. Konomis, G. R Ganger, P. B. Gibbons, and O. Mutlu. 2017. Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds.. In NSDI. 629–647.
  • Hu et al. (2016) Z.M. Hu, B.C. Li, and J. Luo. 2016. Flutter: Scheduling tasks closer to data across geo-distributed datacenters. In INFOCOM. 1–9.
  • Hung et al. (2015) C. Hung, L. Golubchik, and M. Yu. 2015. Scheduling jobs across geo-distributed datacenters. In PSCC. 111–124.
  • Im et al. (2016) S. Im, B. Moseley, K. Pruhs, and E. Torng. 2016. Competitively scheduling tasks with intermediate parallelizability. TOPC 3, 1 (2016), 4.
  • Kloudas et al. (2015) K. Kloudas, M. Mamede, N. Preguiça, and R. Rodrigues. 2015. Pixida: optimizing data parallel jobs in wide-area data analytics. VLDB Endowment 9, 2 (2015).
  • Mao et al. (2017) Y. Mao, C. You, J. Zhang, K. Huang, and K. B. Letaief. 2017. A Survey on Mobile Edge Computing: The Communication Perspective. IEEE Communications Surveys & Tutorials 19, 4 (2017), 2322–2358.
  • Morita et al. (2014) A. Morita, M. Iki, Y. Dohi, Y. Ikeda, S. Kagamimori, Y. Kagawa, and H. Yoneshima. 2014. Aggregation and Degradation in JetStream: Streaming analytics in the wide area. In NSDI. 275–288.
  • Pu et al. (2015) Q. Pu, G. Ananthanarayanan, P. Bodik, S. Kandula, A. Akella, P. Bahl, and I. Stoica. 2015. Low Latency Geo-distributed Data Analytics. In SIGCOMM. 421–434.
  • Ren et al. (2015) X. Ren, G. Ananthanarayanan, A. Wierman, and M. Yu. 2015. Hopper: Decentralized Speculation-aware Cluster Scheduling at Scale. In SIGCOMM. 379–392.
  • Schad et al. (2010) J. Schad, J. Dittrich, and J. Quiané-Ruiz. 2010. Runtime measurements in the cloud: observing, analyzing, and reducing variance. In VLDB. 460–471.
  • Tan et al. (2017) H. Tan, Z. Han, X. Li, and F. Lau. 2017. Online job dispatching and scheduling in edge-clouds. In INFOCOM. 1–9.
  • Tong et al. (2016) L. Tong, Y. Li, and W. Gao. 2016. A hierarchical edge cloud architecture for mobile computing. In INFOCOM. 1–9.
  • Vavilapalli et al. (2013) V. K. Vavilapalli et al. 2013. Apache hadoop yarn: Yet another resource negotiator. In Proceedings of the 4th annual SCC. 5.
  • Viswanathan et al. (2016) R. Viswanathan, G. Ananthanarayanan, and A. Akella. 2016. CLARINET: WAN-Aware Optimization for Analytics Queries. In OSDI. 435–450.
  • Vulimiri et al. (2015) A. Vulimiri, C. Curino, P. B. Godfrey, T. Jungblut, J. Padhye, and G. Varghese. 2015. Global analytics in the face of bandwidth and regulatory constraints. In NSDI. 323–336.
  • Wang and Li (2017) H. Wang and B. Li. 2017. Lube: Mitigating bottlenecks in wide area data analytics. In 9th USENIX Workshop on Hot Topics in Cloud Computing.
  • Xu and Lau (2015a) Huanle Xu and Wing Cheong Lau. 2015a. Optimization for speculative execution in a MapReduce-like cluster. In INFOCOM. 1071–1079.
  • Xu and Lau (2015b) H. Xu and W. C. Lau. 2015b. Task-Cloning Algorithms in a MapReduce Cluster with Competitive Performance Bounds. In ICDCS. 339–348.
  • Zaharia et al. (2012) M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. 2012. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI.
  • Zaharia et al. (2008) M. Zaharia, A. Konwinski, A. D. Joseph, R. Katz, and I. Stoica. 2008. Improving MapReduce performance in heterogeneous environments. In OSDI. 29–42.
  • Zheng et al. (2013) Y. Zheng, N. B. Shroff, and P. Sinha. 2013. A new analytical technique for designing provably efficient mapreduce schedulers. In INFOCOM. 1600–1608.

Appendix A

a.1. Proof of Proposition 1

For convenience of analysis, we assumes that the distribution of data processing speed inside a cluster as well as the distribution of data transfer speed inter a cluster-pair can be fit to a continuous distribution.  (Schad et al., 2010) supports the assumption. They conducted a performance analysis spanning multiple Amazon EC2 clusters and found several of the performance measurements of VMs - particularly network bandwidth - to be normally distributed.


After insuring rounds in PingAn, The task execution rate . The execution rate of the copy follows a distribution, i.e., and let .

Let and define as the cumulative distribution function of . We have


and further deduce its derivation that


In the first place, we prove that


when . We expand the left side of Eq. (15) as shown in Eq. (16). The second equality in Eq. (16) follows the definition of expectation.


and from the right side, we have


The third equality in the Eq. (17) applies the definition of in Eq. (14). Obviously, the first term in the last formula of Eq. (16) is greater than the first term in the last formula of Eq. (17) because that . Consequently, we only need to prove that the remainder of Eq. (16) and Eq. (17) satisfies the following inequality.


To this end, we unfold the left side in Eq. (18) based on the definition of and there are


Recalling that PingAn greedily insure the best copy for a task in each round. Thus, we have


Apparently, it follows that


The fourth inequality in the above follows the factor that . Substituting Eq. (19) into Eq. (21), we conclude the Eq. (18) and further prove the Eq. (15). Based on the Eq. (15), for any integer , we have

The proof completes. ∎

a.2. Proof of Theorem 3

The potential function we defined in our analysis is extended from (Im et al., 2016) and (Xu and Lau, 2015b). We assumes that the rate function of the optimal adversary in our potential function analysis is a concave and strictly increasing function of , which is a model generally adopted in many studies about online parallel scheduling problem (Xu and Lau, 2015b; Im et al., 2016; Edmonds and Pruhs, 2009; Fox et al., 2013). The Proposition 1 is proved in (Xu and Lau, 2015b), thus we used it directly in the following theorem proof.

Proposition 1 ().

Consider any continuous and concave function : with . Then for any , we have .


Let where and represent the remaining unprocessed workload for task in Job at time under the optimal scheduling policy and PingAn insurance algorithm respectively.

The potential function for a single task is defined as follow

where .

The overall Potential Function for all the jobs arriving at the system is defined as

where denotes the set of alive jobs in PingAn at time . Further let and indicates the jobs and tasks that have not completed at time in the optimal scheduling.

The potential function is differentiable, and we have


Obviously, it holds that and the value of the potential function does not increase when a job arrives or completes in PingAn and the optimal adversary. Thus, we analyze the change of at the time that no job arrives or completes. Some notations shown in Table 3 are used in the following change analysis.

Notations Corresponding meaning
The completion time of job under algorithm
The accumulated flow time of job at time under algorithm
The accumulated sum of job flowtimes at time under algorithm
The flowtime of
The sum of job flowtimes under algorthm
Table 3. The Notation in Approximation Analysis
  • Calculating , the changes in due to the optimal scheduling:

    For a task , the change made by the optimal scheduling is denoted as . Based on the definition of potential function, we expand and bound it in the equation below


    Applying the definitions Eq. (6) and Eq. (7) in optimal scheduling with speed function , we have


    According to Eq. (24), the following formula is yielded


    Let be the number of slots assigned to task of job in the optimal scheduling. Substituting Eq. (25) into Eq. (23), we have


    Recalling that the expected rate of a task running on a slot need to be greater than fraction of the global optimal rate of the task. We denote the proportion as for logogram. Based on the rule of lower limit rate, we have


    Substituting the Eq. (27) into the Eq. (26) yields that


    Considering two cases, when , we have as the result of the monotonic property of function; when , based on Proposition 1, we have . In consequence, it holds that


    Based on Eq. (22), it follows that


    where introduced in Eq. (32) is the most copy numbers of tasks made in the optimal scheduling. The second term in Eq. (32) is deduced following the slot number restriction. Based on the flowtime definition, we derive the Eq. (33) and the Eq. (34).

    At this point, we get the change bound caused by the optimal scheduling

  • Calculating , the changes in due to PingAn insurance:

    PingAn runs at speed of faster. Let be the number of slots assigned to task of job in PingAn at time . We expand based on the definitions of potential function and it holds that


    The third equality is based on the Eq. (25) with a replacement of rate in PingAn. According to the insuring policy in PingAn, we have and , further based on Proposition 1, we derive the following Eq. (36) that