An Efficient Fault Tolerant Workflow Scheduling Approach using Replication Heuristics and Checkpointing in the Cloud

10/15/2018 ∙ by S. Jaya Nirmala, et al. ∙ National Institute Of Technology Tiruchirappalli 0

Scientific workflows have been predominantly used for complex and large scale data analysis and scientific computation/automation and the need for robust workflow scheduling techniques has grown considerably. But, most of the existing workflow scheduling algorithms do not provide the required reliability and robustness. In this paper, a new fault tolerant workflow scheduling algorithm that learns replication heuristics in an unsupervised manner has been proposed. Furthermore, the use of light weight synchronized checkpointing enables efficient resubmission of failed tasks and ensures workflow completion even in precarious environments. The proposed technique improves upon metrics like Resource Wastage and Resource Usage in comparison to the Replicate-All algorithm, while maintaining an acceptable increase in Makespan as compared to the vanilla Heterogeneous Earliest Finish Time (HEFT).



There are no comments yet.


page 27

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Scientific workflows are described as a “useful paradigm to describe, manage, and share complex scientific analysis” taverna . A workflow is a formal way to express a calculation. The workflow involves multiple tasks of different sizes and characteristics, with control and data dependencies between them. They also capture the various parameters of the task such as their input, output etc.videoworkflow . Workflows have emerged as comprehensive tools for managing complex computations and managing storage requirements. They are used in a lot of applications like neuroscience, high-energy physics and genetics.

Many authors have studied the advantages of using the cloud environment for executing scientific workflows hoffa2008use ; deelman2010grids ; juve2012resource and have claimed that the cloud environment enables workflow execution with low cost and that the virtualization overhead due to the cloud would be very minimal (in the range of 1% to 10%).

Efficient scheduling of scientific workflows helps in reducing the makespan or execution time, in meeting deadlines and in minimizing the cost. As the problem of scheduling tasks simultaneously on multiple processors with a start and an end time is NP-complete, researchers have relied on heuristic and meta-heuristic optimization techniques to schedule them.

Failures in scientific workflows increase the makespan and waste a lot of workforce and time. The different types of failures that can occur during an execution are task failures, VM failures and workflow-level failures. Fault tolerance for scientific workflows can be provided either at the task-level or workflow-level hwang2003grid . Task-level scheduling techniques involve retry, checkpointing and the use of alternate resources for the same task. On the other hand, workflow level scheduling involves the usage of alternate tasks, redundancy, user-defined exception handling and rescue workflows. In this paper, a new fault tolerant workflow scheduling approach called Checkpointing and Replication based on Clustering Heuristics (CRCH) is proposed. It uses replication, resubmission, checkpointing and provides fault-tolerance in an efficient manner.

In the scheduling step, the workflow tasks are replicated and then scheduled. The multiple copies prevent the task from failing and increase the probability of its successful completion. If one copy fails, one of its replicas is scheduled and executed

plankensteiner2009new ; zhang2009combined . Task resubmission grandstrand:2004 ; plankensteiner2009new is also widely used for fault tolerance in the workflow scheduling. It takes place during the execution phase. In task resubmission, the failed task is resubmitted either to the same or different resource. The resource usage and wastage in task resubmission is less as compared to replication but the execution time may be more. Replication generates identical copies of a task. Hence they have same dependencies, and thus sufficient parallel systems can afford to execute them in parallel saving execution cost. Checkpointing grandstrand:2004 zhang2009combined is one of the time efficient fault tolerant methods. In synchronous checkpointing, the states of the tasks or processes are saved promptly at regular intervals. Whenever a Virtual Machine (VM) fails, the process starts from the previously saved state. Thus, this method is gainful over methods which reschedule a duplicate of the task.

One of the key contributions of this paper is an unsupervised way of learning replication counts for tasks. In comparison to other replication heuristics plankensteiner2009new

, this approach is much quicker and robust, as it doesn’t involve exploring every possible solution (HEFT schedules with varying sets of replicas) in a combinatorial optimization problem. Along with this, a checkpointing mechanism that stimulates dynamic resubmission of tasks on the most optimum resource has been proposed. In an elaborate analysis of well established metrics like Resource Usage, Resource Wastage and Total Execution Time it has been shown that the algorithm proposed performs better than the existing state-of-the-art workflow scheduling techniques even in highly faulty environments.

The outline of the paper is as follows: Related Work discusses some of the novel and significant progress made in the field of fault tolerant workflow scheduling. Proposed Methodology and Performance Analysis discuss in depth the algorithms we propose, followed by a performance benchmarking against state-of-the-art methods. The concluding remarks along with future research is presented in Conclusion and Future Work.

Related Work

Yu and Buyya yu2005taxonomy give a brief overview of the various fault tolerant workflow scheduling techniques. Fault tolerance to workflow applications are provided either at the task or workflow level. Replication of tasks or data, Resubmission, Checkpointing and Alternate Resource are widely used techniques at the task-level, whereas Alternate Task, User-defined Exception Handling and Rescue Workflow are widely used at the workflow level. Poola et al. grandstrand:2004 give a comprehensive survey of the fault tolerant techniques employed in various Workflow Management Systems (WFMS). They also present a detailed taxonomy of the different techniques employed for fault tolerance in distributed environments. Also, the paper discusses a variety of metrics used for quantifying fault tolerance. plankensteiner2007fault discusses the fault-tolerant techniques employed in various grid WFMS. The survey reveals that resubmission techniques are most widely used for providing fault tolerance in workflows followed by replication and checkpointing.

Poola et al. poola2014robust use the concept of slack time to generate robust schedules for scientific workflows to enable them to withstand failures in the cloud environment. They use a common set of parameters to model the stochasticity of all VMs. They also assume that there exists no resource contention, which can be a strong assumption in highly faulty environments. In this proposed methodology, resource failure parameters are sampled from various distributions, thus making the system more robust. Plankensteiner et al. plankensteiner2009new estimate the replication count of a task from its Resubmission Impact (RI) heuristic. Their approach creates multiple workflows, each with a particular task duplicated by a constant value. The replication count is estimated from a normalized score assigned to each task, based on how much they impact the execution time (had they been replicated). On the other hand, the approach in this paper infers replication count using an unsupervised machine learning algorithm that gives more accurate estimates and saves the time involved in re-computing HEFT schedules.

Zhang et al. zhang2009combined integrate the vanilla HEFT topcuoglu2002performance / Duplication Scheduling Heuristic (DSH)kruatrachue1988grain schedules with the over-provisioning algorithm proposed by Kandaswamy et al.kandaswamy2008fault . To meet the constraints on the overall workflow DAG (Directed Acyclic Graph) success probability, which diminishes exponentially with the addition of tasks to the workflow, the entire DAG is over-provisioned on a distinct set of resources. This leads to increased Resource Usage. The over-provisioning algorithm proposed by kandaswamy2008fault

finds the solution for a combinatorial optimization problem, which meets both performance and reliability constraints for a task. Although the assumption of independent binomial distributions for resource failures seems reasonable, the computation of expected execution time of a task on a resource cannot be agnostic to the state of the current workflow execution.

Chen and Deelman chen2012fault

introduce horizontal/vertical task clustering based on the workflow structure. Having defined Gamma/Weibull distributions, for task runtime, overhead time and job (collection of tasks) runtime, they use Maximum Likelihood Estimates (MLE) for the distribution parameters, along with corresponding conjugate priors for dynamic estimation. For the parameter estimation to converge, large datasets of task/job runtimes are required. But the clustering technique proposed in this paper relies on grouping dense task embeddings. These embeddings are based on task/task-neighborhood structural characteristics (like edges, DAG order, etc.)

Zhang et al. zhang2009combined use the technique to find the smallest subset of resources to replicate the tasks such that they satisfy their performance and reliability constraints. If the smallest subset of resources could not be found, the success probability for all the resource combinations are calculated, and the tasks are replicated on the resource set with highest success probability. The method proposed by plankensteiner2009new does not use checkpointing, but resubmits a task when all of its replicas have failed. Resubmission of the whole task significantly increases the execution time of the task, which in turn increases the workflow’s makespan. But the fault tolerant approach proposed in this paper employs replication heuristics and light-weight checkpoint/ restart techniques at the task level. The replication heuristic employed calculates the number of replications needed for each of the tasks in the workflow, and thereby reduce the resource waste and execution cost. Light-weight checkpointing enables the system to have minimal stable storage, and the transfer of intermediate data more manageable, and hence reduces execution time.

Proposed Methodology

Figure 1: CRCH Algorithmic Design
Notation Explanation
Estimated Start Time of task
Estimated Finish Time of task as decided by Algorithm-2.
Actual Start Time of task
Actual Finish Time of task
Total Execution Time of the Workflow
Average execution time of task
Average time to transfer units of data from to
Time taken for task to execute on resource
Data transfer rate between and
List of tasks in workflow
List of resources available
{(t,t’,d) is a parent of that sends units of data}
All the parents of task
Number of times task has failed to complete execution
Replica count for task
Set of features for task
PCA Principal Component Analysis

Coverage of Variance

Number of completed checkpoints for task
Set of replicas generated for the given task
Returns true if the VM is busy executing a backlog of tasks
Table 1: Notation Index

HEFT is a suitable base algorithm for scheduling the tasks of a workflow. Wieczorek et al. wieczorek2005scheduling

analyze and evaluate the performance of HEFT, Genetic Algorithms (GA) and simple ”myopic” for scheduling scientific workflows. Their results show that the full-graph scheduling technique with HEFT algorithm performs best when compared to other strategies. Hence, it is decided to use HEFT to determine the initial schedule of tasks and their corresponding replicas. Furthermore, the performance of CRCH in faulty environments has been benchmarked against that of HEFT. The basic HEFT algorithm does not involve any fault tolerance. If a resource fails, any task executing on that resource also fails and therefore the workflow itself fails. One approach to providing fault tolerance is to generate replicas of the tasks. Even if the task fails, the replicas can generate the intermediate results.

zhang2009combined mentions a ReplicateAll algorithm that replicates each task by a constant factor. This redundancy improves the probability of completion of the workflow, although it increases the Total Execution Time (TET) or Makespan by a huge margin. grandstrand:2004 ; plankensteiner2009new use resubmission wherein a task that could not complete execution due to a failed resource is resubmitted on another. This method suffers from the loss of computation involved in resubmitting a task that has almost completed. grandstrand:2004 ; zhang2009combined use checkpointing wherein the states of the tasks are saved at regular intervals so that upon resubmission they can start execution from the previously saved state. Table 1 elucidates the various terms/abbreviations referenced throughout the paper.

CRCH Algorithm

Figure 1 highlights the data flow path, processing involved in the proposed method, CRCH.

CRCH method consists of three modules namely Clustering, Replication and Checkpointing. The Clustering phase facilitates the computation of replication counts based on the properties of tasks. In the Replication phase, the tasks are replicated by applying the replication heuristics and the standard HEFT algorithm decides the overall schedule. In the final Checkpointing and resubmission phase, a light weight strategy is used, wherein only pointers to the saved state are stored.

The average execution time of the task and the average transfer time are represented using the Equations (1) and (2) respectively.


Further, each task is represented in an -dimensional space as a point. Let task

be denoted with feature vector

, that is, each task has features associated with it. These features can be nominal (priority of a task) or numeric (average execution time). The axes of the -dimensional space denote the various features. Some of the possible features are :

  1. Average execution time of a task :

  2. Average time to transfer data from the parents of the task :

  3. Priority of task

  4. Number of parents:

  5. Number of children:

Clustering Module

The replication count for each task can be determined using multiple Machine Learning techniques ranging from Supervised Classification like Logistic Regression, Max Entropy Models, etc. to Unsupervised Classification like Clustering, LSA, PLSA

ISLR2013 ; PCA2002 . In general, this can be treated as a multi-class classification problem, where the inputs are task representations and the target values are one-hot vectors (=1 if replication count for the input task is

). When substantial labeled training data is present, a Multilayered Perceptron (MLP) works reasonably well. For each target class

, there exists a weight vector . represented by Equation (3) denotes the probability of task to have replication count , where is the feature representation for task .


Equation (3) is a standard softmax formulation, and the MLP can be trained with a cross entropy loss (Equation (4

)) and an optimizer like Stochastic Gradient Descent, Batch Gradient Descent, or for faster convergence RMSprop or Adam is used

ISLR2013 .



is a one hot encoding vector denoting the class to which the

observation belongs and

is a vector denoting the probability distribution over

classes predicted by the logistic classifier for the


To generalize well, this method requires a training set of large size with already existing replication counts for a task in multiple workflows ISLR2013 . Since such a set of observations is not readily available and needs to be compiled over a period of time, a move towards an unsupervised classification scheme is adopted. One such method is Principal Component Analysis (PCA) PCA2002 where the feature vectors are projected onto less than component vectors, so that the coverage of variance in the data is greater than a predefined threshold. PCA can be attributed to recognize correlated features, which help in representing the points in a lower dimensional space. These principal components are orthogonal unit vectors. With the addition of each such vector, the portion of variance covered with respect to the original variance in the data is improved. This improvised representation not only facilitates faster clustering but also prevents over-fitting to a certain extent PCA2002 . Since the features used are of different scales like numeric regarding the average execution time of a task or the number of parents/children of a task to ordinal in case of priority, the data needs to be standardized before the application of PCA. PCA predominantly uses correlation/ covariance matrix of the set of features. The analysis has been done on a training set of 100 points, each representing a task in a ten dimensional space, with a covariance threshold of 0.8. Intuitively, it can be seen that the number of parents is positively correlated with the average transfer time of data from parents. PCA would help remove such co-dependence. Before performing PCA, the data is mean subtracted and standardized (whitened). Since a dataset like task-features can be enormous regarding the basic characteristics of a task (execution time, priority) and its basis expansion would be even higher, a method like PCA would help identify coherent features like criticality against the number of dependents of the task. Steps 2-9 in Algorithm 1 focus on determining the set of principal components ( of them) and step 10 deals with the projection of the original points onto the new basis.

CRCH involves hierarchical clustering on the projected points

PCA2002 ; ISLR2013 . The distance between two clusters, characterized using the average Euclidean distance between every pair of points belonging to the two clusters is given by Equation (5). This can be referred to as the affinity between the two clusters yang2016joint . In each iteration, clusters are merged to form superclusters, based on their affinity to neighbors. The agglomerative clustering strategy used has been derived from the triplet loss method which has been popular in clustering high dimensional deep neural embeddings yang2016joint .

Triplet loss validates that the supercluster under consideration , would be merged with one of the neighbors not only based on the closeness of the two superclusters but also on the additional condition that any other (, , where is the set of closest superclusters to ) is much further away from . The pair of clusters that minimize Equation (6) is merged into a supercluster for the subsequent time step. At any given time step, let represent the distance between two superclusters, and . Then,


The time step at which the clustering would converge is decided by the dendrogram generated by the superclusters PCA2002 ; ISLR2013 . At each level the branches in the dendrogram reduce, and there exists a point at which the minimum inter-cluster distance exceeds a certain threshold, the number of branches in the dendrogram at this time step is indicative of the final number of superclusters PCA2002 ; ISLR2013 . Once steps 11-17 are done building the superclusters, steps 18-19 deal with the assignment of replication counts, which can be decided by the size and other summary statistics of the superclusters like average execution time, average priority, etc.

Figure 2 shows the state of the agglomerative clustering procedure at a time step . Each blob denotes the latent representation of a task, while the dotted lines represent the inter-cluster distances for tasks and (distances not impacting the loss have not been shown). Blobs in the same color belong to the same supercluster. and denote the figurative cluster centers for the superclusters in and . In a traditional clustering approach, would end up getting merged with its closest neighbor. In the presence of triplet loss, , which is not as close as is to its closest neighbor, but is considerably far from the remaining neighbors forms the green supercluster. This prevents the agglomeration from collapsing into superclusters of sizes that lie on either end of the spectrum. For example, in traditional approaches, it is not uncommon to notice agglomerative clustering from converging into either very huge clusters or very tiny ones (size3). Figure 3 shows the progress of supercluster formation across iterations along with the assignment of replication counts based on the final sizes of the superclusters.

Figure 2: Action taken at a given time step by triplet loss (left) and agglomerative clustering (right).
(to be viewed in color)
Figure 3: Clustering progress across iterations. (to be viewed in color)

Many scientific workflows exhibit properties wherein most of the tasks can be segmented into a few large clusters. Each segment encompasses tasks having similar features, and they end up having low replication counts. The outliers based on high priority or high average execution time belong to clusters of much smaller sizes and hence are assigned a higher replication count. Such an assignment policy may lead to cases wherein a low priority task taking less time to execute (on average) gets allotted more replicas than needed as it ends up being an outlier. Simple rule ensembles, based on sufficient statistics of the feature values in the supercluster can be learned to avoid this.

1:procedure replicationCount(taskList, threshold, K)
2:     Let COV = 0.0 Coverage of Variance
3:     Let principalComponents = {}
4:     Let dataSet =
5:     do
6:         Let PC = nextPrincipalComponent(dataSet) Orthogonal unit vector
7:         COV = COV + variance(PC) Additional variance accounted
8:         principalComponents = principalComponents PC
9:     while 
10:     dataSet = dataSet principalComponents Project points in lower dimensional space
11:     Let clusters = dataSet
12:     do
13:         Let newClusters = hierarchicalClustering(clusters) triplet loss
14:         Let oldClusters = {C C clusters C’ newClusters C C’}
15:         clusters = (clusters - oldClusters) newClusters
16:     while 
17:     Sort clusters based on the size of each cluster in descending order of size (or other statistics)
18:     for  clusters do
19:          = i t Assign replication counts      
Algorithm 1 Replication Count Algorithm

Replication Module

The tasks part of the initial workflow graph are the original tasks, and their duplicates are referred to as replicas. Once Algorithm 1 has ascertained the replication count for each task, Algorithm 2 defines the HEFT schedule for each original task and its replicas.

The original tasks are sorted in descending order of their estimated start times (calculated based on the inherent dependencies). Once an initial task is scheduled on a resource, the algorithm checks whether all of its original sibling tasks have also been scheduled. If so, the replicas of this initial task are mapped to resources based on which the resources would be allocated an execution interval with the minimum EST.

The following section discusses the influence of task checkpoints, on resubmissions (due to resource failures) at runtime. Here, it is assumed that there exists a subset of VMs which are entirely reliable (non-failing). Other non-reliable resources, may not be available when a task is scheduled to execute or may go down during the execution of a task. Resource failures cause delays in task execution and this results in AFT (Actual Finishing Time) being greater than EFT (Estimated Finishing Time). Such delays, when propagated along the critical path of the workflow have a direct impact on TET. Some proportions of the execution delays for tasks that are not on the critical path tends to get subsumed under the slack time that occurs between two noncritical tasks. It may also be possible that a resource is busy executing a backlog of tasks (in HEFT order), right when a task scheduled to run on that resource has met its dependencies, and as a result ends up waiting for that resource.

1:procedure HEFT(taskList, dependenciesList)
2:     for  do
3:         Calculate = B-Level of the task in critical path calculation      
4:     sort based on in descending order
5:     for  do
6:         Schedule on with minimum
7:         if  then
8:              if  then
9:                  Schedule replicas of on VMs with minimum ESTs                             
Algorithm 2 HEFT with Over-Provisioning

Checkpointing Module

Algorithm 3 defines the execution of a workflow at runtime concerning checkpointing and resubmission. The environment is modeled to have a set of resources which fail depending on their respective Mean Time To Repair (MTTR) and Mean Time Between Failures (MTBF). The size and duration of these failures is dependent on the type of environment (Stable, Normal or Unstable). In the case of a perfect environment with no failures and assuming no checkpoints are involved, the TET () is determined by the task schedule, as described in section Replication Module. At any given point, let be the number of checkpoints that have been completed for . If task is scheduled to start on at and has then, is represented using Equation 7.


A schedule S is a set of tasks with scheduled on a VM from to . Only when belongs to a set of Failing VMs (), may or may not be rescheduled on a nonfailing VM, that is, VM FVM. The two cases to consider here are whether fails during the execution of (Case 1) or is down at (Case 2). is the number of checkpoints that have been completed for . These checkpoints are global and are synchronized, that is, at each checkpoint, the working memory is dumped to a nonvolatile stable storage associated with each VM. When a task completes its execution on , the result of its execution is stored in the stable storage associated with . The pointer to the location on stable storage is stored in a global memory. This pointer can be referenced using a hash value of the task id for quick access. Along with the pointer, if , the results are also stored in the global memory. If is the set of tasks that use the results generated from , then the data can be fetched by using pointers obtained from dereferencing the global memory. In case the VM on which the parent was executed is down, the data can be fetched directly from the global memory. Each VM has a set of time intervals denoted by which indicate the intervals in time when the resource would be unavailable to execute a task that has been scheduled on it (using , sampled from log-normal and Weibull distributions respectively).

A task may fail to comply with the HEFT schedule for one of the following reasons.

  • Resource fails during the execution of a task.

  • Task is scheduled to start on a VM that is currently down.

  • Task is scheduled to start on a VM that is executing a backlog of tasks.

1:while  t such that is not scheduled do
2:     schedule t on VM v using Algorithm 1, Algorithm 2
3:     if isBusy(then
4:         if  then is the last unscheduled replica
5:              Wait for to be online          
6:     else
7:         Terminate
9:     if  then
10:         if t has started execution on v then
12:              if  then The task finishes before X
13:                  Continue               
15:              if  =  then
20:                  if ( + ) then
22:                  else
23:                        Reduced Execution Time                                 
24:         else is scheduled on a resource that is currently down
26:              if  =  then
30:                  if  then
32:                  else
Algorithm 3 CheckpointHEFT

Light-weight checkpointing reduces the load on the global memory and thus reduces the chance of a possible bottleneck that can result from a large number of memory accesses.

Algorithm 3 takes as input the HEFT schedule defined by Algorithm 1 and Algorithm 2. There is a possibility that a task scheduled on a failing VM at cannot begin execution because of a backlog of tasks that are currently executing on . If is the last replica of a given task which has not executed successfully, it needs to wait until a VM is available, else it can be terminated and deemed as a failure (Steps 3-8). Tasks that have their dependencies satisfied and yet can’t be executed on their scheduled resource due to the backlog of tasks being executed on them can be deemed as failures. In unstable environments where a decent proportion of tasks aren’t able to find their scheduled resource free, even when their dependencies are fulfilled, it has been found that it is better not to consider them as failures, as it would lead to unnecessary resubmissions on a set of few non failing VMs, thus forming a bottleneck. This would simply increase resubmissions and would not exploit the usage of replicas. Step 9 checks if a task is scheduled on an unreliable VM, and if so, whether the task has begun execution or not. Step 11 defines variables X,Y which represent the points in time during which the resource in question is down. If the task completes before the next time interval when the resource is down, execution proceeds normally (Steps 12-13). Steps 16-17 identify the nonfailing VM with the minimum EST whereas steps 18-19 calculate the proportion of the task completed (in terms of execution time), which would constitute the overhead involved in scheduling the task on a different resource. If the task is to be scheduled on the same VM it can resume from the latest checkpoint, if not it pays the overhead of re-executing already completed checkpoints. Steps 20-23 describe this comparison. Steps 14-15 and 25-26 ensure that a task gets resubmitted only if and only if all its preceding replicas including itself have failed execution.

The case when the resource is down at for is dealt with by comparing the minimum possible EST at a non-failing VM with the point in time when the currently down resource would be available at the latest (Steps 26-32).

Since failure count of an original workflow task is a task level attribute, multiple processes running on different VMs may contest to update the count simultaneously. This justifies the need for a cache coherence strategy like MESI to avoid inconsistencies.

Dynamic Checkpoint Interval

The checkpoint interval is a global parameter which defines the interval between two consecutive checkpoints. Each checkpoint has an overhead associated with it. To improve the TET, can hold an optimal value depending on the environment of execution. In a highly stable environment failures are rare. Hence TET would reduce with a higher . Hence higher would mean less number of total checkpoints and reduced total checkpoint overhead. In the case of an unstable environment with a large number of failures, TET may improve with a smaller , as the data lost from having checkpoints further apart would be the delimiting factor instead of the overhead of checkpoints.

Checkpoint interval analysis is heavily dependent on the critical path. A critical path of a DAG is a path from an entry node to an exit node, whose length is the maximum kwok1999benchmarking . To compute the critical path of the workflow after the addition of replicas and for determining the schedule using Algorithm 2, a simple backtracking approach is sufficient. Let be the VM on which the task is executed. Since is scheduled on a VM with minimum possible , there exists at least one such task , where . By induction, the critical path can be backtracked.

This section highlights a proof which suggests a varying checkpoint interval for an optimal TET with the same level of fault tolerance. It assumes that there is no shift in the critical path with the involvement of resubmissions.

Lemma 0.1.

The checkpoint interval introduced in the CRCH algorithm is environment dependent and can be modified to best suit the TET of the workflow in a given environment.

Assumption 1.

The waiting time of a task is distributed normally, and the probability density function depends on the ancestors of the task, environmental parameters. The range of values observed can only be explained by a normal distribution which spans

. The set of observations for waiting times of tasks taken over a large range of inputs closely mimicked a normal distribution, indicated by a QQ (quantile-quantile) plot

natrella2010nist . Upon reasonable & realistic variations in , while maintaining the same environmental conditions it was found that the shift in the distribution parameters is negligible.

Assumption 2.

The probability of a task to fail on the scheduled resource is independent of .

Assumption 3.

The resubmission of ancestor tasks does not influence the probability of a task to fail on the resource it is meant to execute on.

We go on to prove Lemma 0.1 based on the aforementioned assumptions. Notations used in the proof are as follows :

  • : checkpoint interval (time interval between two successive globally synchronized checkpoints )

  • : overhead at each checkpoint

  • : expected time to execute CRCH algorithm

  • : checkpoint overhead

  • : expected time to execute CRCH excluding

  • : expected time taken for task to execute in CRCH

  • : critical path - a set of tasks lying on the critical path

  • : expected waiting time for task

  • : expected resubmission time for task

  • : time taken for task to execute in HEFT


According to the definitions mentioned above:


If the task is the last of its replicas, it may not follow the schedule given by Algorithm 2 due to:

  1. All the parents in haven’t completed execution that is, there exists one task st. , hasn’t completed execution.

  2. The resource on which the task has been scheduled is executing some backlog of tasks.

  3. The resource is unavailable, as it has failed.

depends on , that is, the ancestor graph for (which includes the VMs on which the tasks have been scheduled & their corresponding MTBFs, MTTRs).

, is a Gaussian distribution which determines the expected amount of time a task has to wait on its immediate parents and the sufficient statistics are functions of

. (Assumption 1) Thus:


is the overhead involved in resubmission. A task is resubmitted only if it is the last replica and all the other replicas have failed, that is,


The following calculations for the probability of failure of a task and the corresponding expected resubmission overhead are only done for tasks compliant with equation (13).

The probability that a task fails, defined as , depends on the probability that the task has been scheduled on a failing VM and the probability for an intersection of the execution time with the interval where


Since the set of failing VMs is chosen from uniform distribution:


is the probability of a VM to fail

is the set of failing VMs

Now to evaluate the probability of interval overlap, and are needed which are determined from Weibull distributions and . Thus:


As the executions of all replicas are mutually independent, with being the number of failing replicas of :


is the number of failing replicas of task

is the total number of replicas of

If is the expected overhead in resubmissions then by using (18):


Let be the resource on which is resubmitted. Hence is the probability that the task is resubmitted on the same resource on which it was originally submitted. follows a multinomial distribution conditional on the schedule decided by 2 and any other workflow or resource parameters (including ). Overhead due to submitting on the same resource:



is the point in time where the execution of was stopped on due to failure with respect to .

is the expected value for the difference in minimum Estimated Start Time when the task is rescheduled on the same resource and (point of failure).

is the expected value for the difference in minimum Estimated Start Time when the task is rescheduled on a different non-failing resource and (point of failure).

Therefore, the probability of the task being resubmitted on a different resource is Overhead due to submission on a different resource:


Using (18), (19), (20), (21):


That is,


Using (8), (9), (10), (23)


From (10), (11)


It can be seen that is a product of two separate terms involving .
Let and
For , it is clear that as increases the value of decreases.
In the expressions involving are , , , where is independent of (Assumptions 2 and 3). The first expression among these is a non differentiable function in , although it is piece-wise continuous. The average value of the function over a contiguous range of values for (where the function is continuous and is chosen randomly from a uniform distribution) would increase as increases.

In stable environments, Therefore is insignificant and can be increased. This means a greater checkpoint interval is affordable in a stable environment.

On the contrary, in an unstable environment is considerable and cannot be ignored. decreases as increases since the overhead involved in re-scheduling on a different resource would be comparable to the loss in execution time units incurred from distant checkpoints. In general, with low resource availability there are many instances of resubmissions, which leads to an increase in and . Thus a lower value of would be needed to reduce the overhead caused due to checkpointing.

For cases where , would reduce as the resubmission would not account for much loss in execution time units. In environments, the contributing factor to the waiting time of a task would be the unavailability of resources to schedule ancestor tasks. Hence the observations on waiting times for tasks indicated a negligible impact on changes in . For environments the resubmissions are itself low to record any shifts in the parameters of the normal distribution. (Assumption 1)

Performance Analysis

Experimental Setup

To evaluate the performance of the proposed approach, the WorkflowSim simulator is used. The cloud environment is modeled with one data center and 20 Condor VMs. It is assumed that there are at least four reliable VMs at any point in time. Each VM is assumed to be fully connected to every other VM using two-way dedicated lines. This assumption is based on the fact that if they are connected using a single interconnect bus then wait time to access and gain control over the bus also needs to be considered and becomes a substantial factor in TET as compared to factors of significance such as transmission time, checkpoint overhead etc.

Four different workflows Montage, Cybershake, Inspiral and SIPHT are used in the simulation, with sizes ranging from 100 to 700, in multiples of 100. The workflows are given as input in the form of a DAX file. The dependencies between the tasks in the workflow, runtime of the tasks on multiple VMs, the rate of file transmission between VMs and the size of the files to be transmitted between the tasks are read from the DAX file. Each DAX file is executed ten times and the average of the results in those executions is considered.

In some cases, the same DAX file shows high run times on a few executions and low on the others. This is attributed to the fact that if a critical task gets scheduled on a failing VM, the runtime increases by a huge margin. A task is a critical task if that lies on the critical path and has zero free and total floatISLR2013 . The increase in the runtime of this task would also increase the final execution time linearly. In those cases, a weighted average of those values has been taken.

The runtimes of the tasks are different in each of the Condor VMs. Each task has a set of defined parent and children tasks and the data to be transferred between tasks is specific. Thus three different matrices are used as input.

  1. : Data to be transferred.

  2. : Run time of task on VM.

  3. : Transmission time between two VMs.

A light-weight checkpointing duan2005dee model is proposed. Checkpoints of a task only contain pointers to the data and not the data itself. Each VM has non-volatile stable storage. Each globally synchronized checkpoint dumps the working memory of the VM to the nonvolatile storage. It involves instruction pointer, register state, current stack etc. There also exists global storage which holds references to the outputs generated by the parents of a task. The parents’ output is stored in the resource, on which the parent was executed. Hence these references are pointers to those locations. The data required for execution is fetched from references on the go. Instead of fetching all the data from the references before the execution of the task, wastage is reduced in the transmission of unwanted data by fetching only those pages from storage which are required at any given point during execution. Since dedicated lines exist between the VMs, the time to fetch data from disk belonging to another VM is comparable to fetching data from its own disk.

The resource failures for the simulated environment considers MTBF, the size of the failure in terms of the number of VMs affected, and the duration of the failure (MTTR). The MTBF is modeled using a Weibull distribution with shape parameters ranging between 11.5 and 12.5 plankensteiner2009new . The size of the failure is also modeled using Weibull distribution with shape parameters ranging between 1.5 and 2.4 plankensteiner2009new

. The set of failing VMs and MTTR are modeled using a uniform distribution and log-normal distribution respectively. The shape parameters used for MTTR are 10 and 5. Three different failure models

are modeled. For each environment, the shape parameters change. MTBF values decrease as one moves from stable to unstable environment as failures are more frequent. The MTTR values used are 6 minutes for unstable, 3 minutes for normal and 1 minute for stable environment.

Experimental Results

The performance of CRCH is compared with HEFT and another heuristic approach ReplicateAll proposed in plankensteiner2007fault . ReplicateAll uses a simple and straightforward method to task replication wherein each task in the workflow is replicated repCount number of times specified by the user. repCount is assumed to be three for ReplicateAll, and hence it is called ReplicateAll(3) in this work. HEFT is run on multiple workflows with the same failure models as that of CRCH. A majority of the executions could not be completed, as there is no failure tolerance. TET, Resource Usage, Resource wastage and Standard Length Ratio (SLR) are the metrics used for evaluating the performance of CRCH.

A resource is considered good if a majority of the tasks have a very low execution time on it as compared to other VMs. If the VM chosen to fail is good, lots of tasks are likely to be scheduled on it (especially in a sparse workflow) and hence more resubmissions. Otherwise, the number of tasks originally scheduled on that resource is itself low, hence less number of resubmissions and a comparatively smaller increase in TET. The goodness of a failing VM and consequentially the number of resubmissions are a decisive factor in all the metrics mentioned and thus explains a majority of the anomalies.

The results for Montage workflow have been presented. Similar patterns were observed for SIPHT, LIGO and CyberShake.

TET of CRCH is greater than HEFT in all the environments. This is expected as CRCH involves execution of replicated tasks and at the time of resource failures, CRCH waits for the failed resource to be back online or schedules the task on a less optimal but available resource. This increase is more profound in the normal environment, as the rate of resource failures is higher than that of the stable environment. The TET of ReplicateAll(3) in all three environments (not shown) is much higher than their counterparts, as all the tasks of the workflow have to be executed four times. Albeit, the replicas can be scheduled in parallel, zhang2009combined suggests a task schedule where replicas for a task are scheduled after its children. Moreover, a large number of tasks on the same level as seen in many scientific workflows deelman2015pegasus would generate more replicas than that can be handled by existing VMs. These factors lead to a manifold increase in TET for ReplicateAll(3). Figure 4 presents the comparison of HEFT and CRCH algorithms under stable and normal environment. HEFT fails to execute in the unstable environment because of a higher probability of resource failure and lack of fault tolerance. TET in stable is, in general, lower as compared to the normal environment as there exist very few cases of resubmission, whereas replication amounts to the same time in both the environments. In general, as CRCH considers resubmission on the same resource, TET values are considerably lower than what can be expected, if the resubmissions were always on a different resource.

Figure 4: Total Execution Time

Resource Usage is the sum of the actual processor time in minutes used for executing all the tasks of the workflow zhang2009combined . Figure 5 shows the average resource usage of the CRCH, HEFT and ReplicateAll(3) algorithms as a fraction of TET. HEFT schedules the task of the workflow in the most optimum manner topcuoglu2002performance and thus has a minimum resource usage. HEFT cannot finish the execution of workflows in the unstable environment, as it has no fault tolerance. CRCH, on the other hand takes into account resource failures and uses replication and resubmission of tasks, and thus the resource usage increases. In case of the stable environment, the resource usage of CRCH is higher than HEFT by 16%. This small increase is attributed only to the replication of tasks, as resubmission is occasionally done due to the low resource failure rate. However, in the case of a normal environment, the increase is at 17%, which is much more prominent as along with replication, resubmission is also considered. Resubmission of the tasks increases as the failure rate of the resources increases. All the tasks are replicated three times in ReplicateAll(3). That leads to higher resource usage. In the stable, normal and unstable environments the average resource usage increase compared to CRCH is 41%, 30% and 17% respectively. This is attributed to the fact that CRCH uses Replication Heuristics in the form of hierarchical clustering. A decline in the percentage increase in resource usage for ReplicateAll(3) over CRCH is observed across the environments. This observation is due to the fact that when a large number of resources are failing most of the replicated tasks do not complete execution. Hence the futile usage of resources to re-execute completed tasks reduces. Instead the resources are idle. The increase in usage is highly dependent on the workflow types.

Figure 5: Average Resource Usage

Two major types of resource wastage are considered in CRCH:

  1. When a task is running on a resource and the resource fails, wastage occurs when the task has executed beyond a certain checkpoint and needs to be executed on a different resource. The processing power wasted in executing a task beyond the last checkpoint is the effective wastage.

  2. When a task is replicated and the first replica executes successfully, any subsequent replica executed on any of the resources are now deemed as waste.

In case of HEFT where there is no resubmission, there is a possibility of a workflow failure - all tasks executed as part of the workflow are considered as waste. The wastage in HEFT is calculated over ten executions, where all failed workflow executions contribute to the waste and the wastage is averaged over the executions. There is no wastage in an execution that is completed. No replication heuristic is considered in ReplicateAll(3) for deciding the number of replicas of tasks and so a lot of unnecessary replicas are created. Resource wastage in ReplicateAll(3) is the time that is spent for executing the unnecessary replicas of tasks, after their first replicas have already executed successfully. The resource wastage that occurs due to resubmission is eliminated, as there is no resubmission of tasks in ReplicateAll(3). The average resource wastage of the HEFT, CRCH and ReplicateAll(3) as a fraction of TET is compared in Figure 4. On average, CRCH gives a 22% and 46% reduction in the resource waste in the normal and stable environments over HEFT respectively. The improvement in resource wastage over HEFT in the normal environment is lesser than the stable environment, because more number of resubmissions are needed in the normal environment. The resource waste is higher in ReplicateAll(3) than CRCH by 70%, 47% and 29% for the stable, normal and unstable environments respectively. Overall CRCH performs better than its counterparts.

Figure 6: Average Resource Wastage
Figure 7: Average SLR

SLR is defined as the ratio between the execution time and the B-level of the first entry task of the workflow zhang2009combined . In the case of CRCH and ReplicateAll(3), SLR is calculated as the ratio between the execution time and the B-level of the first task on the critical path after introducing the replicas. The B-level of a node is the length of the longest path from node to an exit node bound by the length of the critical path yukwong1999 . An optimal algorithm tries to choose the best scheduling plan and thus lowers the SLR value. HEFT has no consideration for task or resource failures and hence has low SLR value. CRCH assumes resource failures, and therefore the optimal resource for a task may not be available at the ready time of the task. The task might need to be scheduled on a second-best resource or wait for the failed resource to come online, and thus the SLR values are higher than HEFT. In the stable environment, as shown in Figure 7, the SLR values of CRCH are comparable with that of HEFT and is only slightly higher by 5%. The slight improvement in SLR for CRCH can be intuitively attributed to the reason that SLR is dependent on the critical path identified. Since every task has some number of replicas, a shorter critical path may be found from a replica of the task instead of the original one. In the case of failure of an optimal resource in CRCH, resubmissions on the same resource improve the SLR value. In a normal environment more resources are expected to fail, hence the increase in SLR value for CRCH. Both the execution time (numerator of SLR) and the critical path found (denominator of SLR) increase and hence the increase of 10% is observed. In ReplicateAll(3), there is a slight increase in SLR, but not as substantial as both the numerator and denominator of the SLR ratio increase. SLR is also affected by resubmissions and there is no resubmission in ReplicateAll(3).

Metrics like Resource Usage and Wastage show specific trends across workflow types in resonance with the structural differences among them.

Figure 8 shows the Average Resource Usage observed across various workflow types, environments and algorithms. For all the three algorithms under the normal environment, there are similar proportions of differences among Resource Usage in the four different workflow types. A 35% increase in Resource Usage is seen, when comparing Montage with CyberShake under the normal environment with CRCH Algorithm. This increase is as high as 129% for LIGO. Although, the Resource Usage is higher for ReplicateAll(3), as compared to CRCH under all environments and workflow types, the percentage increase across the different types is lower. There is a 47% increase when Montage is compared to LIGO. The percentage increase also rises, as one moves from the unstable to the stable environment. An increase of 111%, 129%, and 144% respectively is observed. Such patterns are also observed while comparing Montage with CyberShake, and CyberShake with SIPHT.

Corollary to Figure 8, there is Figure 9. A 109% increase is seen in Resource Wastage while comparing Montage with LIGO under the normal environment using CRCH Algorithm. This figure is close to a 106% increase seen in case of ReplicateAll(3). The percentage increase is as high as 211% while comparing Montage and LIGO under the stable environment using CRCH algorithm. Montage shows the least Resource Wastage irrespective of the algorithm or the environment and LIGO has the highest Resource Wastage in all cases. A percentage increase of 110% is seen in an unstable environment, which is close to the 109% increase under the normal environment.

Figure 8: Resource Usage Across Workflow Types
Figure 9: Resource Wastage Across Workflow Types

Conclusion and Future Work

It has been shown that modeling replication heuristics using unsupervised statistical techniques on readily available unlabelled data, improves efficiency in faulty environments. In other words, CRCH can capture the correlations between similar tasks and their corresponding replication counts. On the other hand, it fails to incorporate the probability distributions over resource failure parameters like MTTR and MTBF, in deciding the magnitude of task replication. This would mean that corresponding tasks in identical workflows would end up having a similar number of replications, irrespective of the environment of execution.

The CRCH analysis goes to show that using checkpoints to store light-weight pointers to saved states not only improves memory access time but also makes the system robust. Distributed non-volatile storage complements the use of checkpoints. The analysis reflects an improvement in resource wastage. Resources on an average spend less time recomputing intermediate results. Since the results are also stored at a non-volatile memory location, improving checkpoint retrieval through increased network capacity is trivial.

It has been proven that varying checkpoint parameters with failure parameters, can lead to better Average Total Execution Time. Albeit, the resource failure distributions for different resources are assumed to be mutually independent over time. This may not be the case when resource failures are modeled at a resource group level. In such situations, the probability of failure of a resource is dependent on the current state of its neighborhood resources and the resource failure probability distributions are merely conditionally independent, given the state of a few other resources poola2014robust .

The improvements in Total Execution Time, Resource Usage and Resource Wastage, over other HEFT based algorithms across normal, stable and unstable environments has been shown empirically. Further work can be done by establishing theoretical bounds on the extent of improvements in the aforementioned metrics.

The failure models considered assume the availability of a few reliable resources, which are used for task resubmissions, ensuring that a task is resubmitted at most once. Thus helping in estimating the expected resubmission overhead. Although in practical circumstances, reliable resources may not always be available.

This paper can be extended to deal with real-world workflow management systems like Pegasus deelman2015pegasus . An elaborate set of training samples for replication counts can further improve the machine learning aspect. The DAX files for tasks with multiple direct and derived features can also improve the heuristic accuracy.

In estimating the increase in Total Execution Time, the model takes into account only the failures of tasks that lie on the critical path as they have a direct impact on the increase in TET. The DAX inputs with multiple/dynamic critical paths or cloud environments with highly varying MTBF/MTTR distributions for individual VMs requires further analysis.


=0mu plus 1mu


  • (1) Taverna, Why use workflows?,, online; Accessed 2017-06-22 (2014).
  • (2) S. Callagan, Overview of scientific workflows,, online; Accessed 2017-06-22 (2017).
  • (3) C. Hoffa, G. Mehta, T. Freeman, E. Deelman, K. Keahey, B. Berriman, J. Good, On the use of cloud computing for scientific workflows, in: eScience, 2008. eScience’08. IEEE Fourth International Conference on, IEEE, 2008, pp. 640–645.
  • (4) E. Deelman, Grids and clouds: Making workflow applications work in heterogeneous distributed environments, International Journal of High Performance Computing Applications 24 (3) (2010) 284–298.
  • (5) G. M. Juve, Resource management for scientific workflows, Ph.D. thesis, Citeseer (2012).
  • (6) S. Hwang, C. Kesselman, Grid workflow: A flexible failure handling framework for the grid, in: High Performance Distributed Computing, 2003. Proceedings. 12th IEEE International Symposium on, IEEE, 2003, pp. 126–137.
  • (7) K. Plankensteiner, R. Prodan, T. Fahringer, A new fault tolerance heuristic for scientific workflows in highly distributed environments based on resubmission impact, in: e-Science, 2009. e-Science’09. Fifth IEEE International Conference on, IEEE, 2009, pp. 313–320.
  • (8) Y. Zhang, A. Mandal, C. Koelbel, K. Cooper, Combined fault tolerance and scheduling techniques for workflow applications on computational grids, in: Cluster Computing and the Grid, 2009. CCGRID’09. 9th IEEE/ACM International Symposium on, IEEE, 2009, pp. 244–251.
  • (9) D. Poola, M. A. Salehi, K. Ramamohanarao, R. Buyya, A taxonomy and survey of fault-tolerant workflow management systems in cloud and distributed computing environments, in: I. Mistrik, R. Bahsoon, N. Ali, M. Heisel, B. Maxim (Eds.), Software Architecture for Big Data and the Cloud, Morgan Kaufmann, Morgan Kaufmann, 2017, Ch. 15, pp. 285–315.
  • (10) J. Yu, R. Buyya, A taxonomy of workflow management systems for grid computing, Journal of Grid Computing 3 (3-4) (2005) 171–200.
  • (11) K. Plankensteiner, R. Prodan, T. Fahringer, A. Kertesz, P. K. Kacsuk, Fault-tolerant behavior in state-of-the-art grid workflow management systems, 2009 Fifth IEEE International Conference on e-Science.
  • (12) D. Poola, S. K. Garg, R. Buyya, Y. Yang, K. Ramamohanarao, Robust scheduling of scientific workflows with deadline and budget constraints in clouds, in: Advanced Information Networking and Applications (AINA), 2014 IEEE 28th International Conference on, IEEE, 2014, pp. 858–865.
  • (13) H. Topcuoglu, S. Hariri, M.-y. Wu, Performance-effective and low-complexity task scheduling for heterogeneous computing, IEEE transactions on parallel and distributed systems 13 (3) (2002) 260–274.
  • (14) B. Kruatrachue, T. Lewis, Grain size determination for parallel processing, IEEE software 5 (1) (1988) 23–32.
  • (15) G. Kandaswamy, A. Mandal, D. A. Reed, Fault tolerance and recovery of scientific workflows on computational grids, in: Cluster Computing and the Grid, 2008. CCGRID’08. 8th IEEE International Symposium on, IEEE, 2008, pp. 777–782.
  • (16) W. Chen, E. Deelman, Fault tolerant clustering in scientific workflows, in: Services (SERVICES), 2012 IEEE Eighth World Congress on, IEEE, 2012, pp. 9–16.
  • (17) M. Wieczorek, R. Prodan, T. Fahringer, Scheduling of scientific workflows in the askalon grid environment, ACM SIGMOD Record 34 (3) (2005) 56–62.
  • (18) G. James, D. Witten, T. Hastie, R. Tibshirani, An Introduction to Statistical Learning, Springer, 2013.
  • (19) I. Jolliffe, Principal Component Analysis Second Edition, springer, 2002.
  • (20)

    J. Yang, D. Parikh, D. Batra, Joint unsupervised learning of deep representations and image clusters, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 5147–5156.

  • (21) Y.-K. Kwok, I. Ahmad, Benchmarking and comparison of the task graph scheduling algorithms, Journal of Parallel and Distributed Computing 59 (3) (1999) 381–422.
  • (22) M. Natrella, Nist/sematech e-handbook of statistical methods, (2010).
  • (23) R. Duan, R. Prodan, T. Fahringer, Dee: A distributed fault tolerant workflow enactment engine for grid computing, in: International Conference on High Performance Computing and Communications, Springer, 2005, pp. 704–716.
  • (24) E. Deelman, K. Vahi, G. Juve, M. Rynge, S. Callaghan, P. J. Maechling, R. Mayani, W. Chen, R. F. da Silva, M. Livny, et al., Pegasus, a workflow management system for science automation, Future Generation Computer Systems 46 (2015) 17–35.
  • (25) I. A. Yu-Kwong Kwok, Static scheduling algorithms for allocating directed task graphs to multiprocessors, ACM Computing Surveys (CSUR).