The worldwide Infrastructure as a Service (IaaS) cloud market is attracting various users and grew 37.3% in 2019 to total $44.5 billion . IaaS enables users to escape purchase and maintenance of servers whose capacity has to satisfy their peak demand to avoid unacceptable latency. Users can scale up or down their computing capacity by renting servers from IaaS providers to match the variation in demand over time. The dominant IaaS providers include Amazon Elastic Cloud Compute (EC2), Microsoft Azure, and Google Cloud, accounting for 45.0%, 17.9% and 5.3% of the global market share respectively. The ongoing COVID-19 pandemic also provides a further push for the adoption of IaaS as more enterprises move their applications to public clouds. To bridge the gap between IaaS providers and users, the key is to determine the process for users to cost-effectively use IaaS services, which enhances user engagement and satisfaction and long-term sustainability of cloud ecosystems .
On-demand and spot instances are two typical purchase options . On-demand instances are always available at a fixed price once requested by users. Users pay only when instances are actually consumed. Differently from Amazon EC2, spot instances are called spot virtual machines (VMs) in Microsoft Azure  and preemptive VM instances in Google Cloud . Spot instances have uncertain availability. Generally, CSPs may reclaim the resources of spot instances at any time point for other purposes. In Google Cloud, spot prices are fixed and the instance availability only depends on the dynamics of system resources. In Amazon EC2 and Microsoft Azure, spot prices vary over time and a user needs to bid a price for spot instances; the instance availability also depends on the relation of the spot and bid prices. Spot instances can reduce costs by up to 50-90% compared to on-demand instances . On the other hand, a user can have its own instances, called self-owned instances, which, although insufficient at times, can be extended with additional IaaS instances purchased on-demand from the cloud. Also, some users may have no self-owned instances (e.g., in the case of startups) and need to buy all necessary computing resources.
Previous works [10, 12, 9, 11] have enabled cost-effectively processing a special type of workloads, namely map-only tasks [8, 13, 14]; each task is partitioned into a large number of independent sub-tasks that can be executed on multiple instances simultaneously; there is also a parallelism bound specifying the maximum number of instances that the task can utilize simultaneously. However, such tasks are independent and can only cover a limited number of important applications. The workload is more generally described by a directed acyclic graph (DAG) whose nodes are tasks and whose edges represent precedence constraints among tasks [15, 8]; each DAG is referred to as a job. Examples of such jobs include the workloads of MapReduce and Spark’s RDDs [19, 20, 21], which are fundamental programming paradigms for big-data processing. For a user, its jobs arrive over time, each with a specific timing requirement, i.e., a deadline by which to complete all its tasks. Each job will be allocated instances of different types (self-owned, on-demand and spot). Our problem is to find an allocation that minimizes cost while meeting the deadline requirement of the job and the precedence constraints among its tasks.
Challenges. The costs of self-owned, spot and on-demand instances are increasing. To be cost-optimal, the objective of an allocation policy should be maximizing the utilization of self-owned and then spot instances and minimizing the utilization of costly on-demand instances. One component of our framework is the policy for allocating different types of instances to a single task to be executed in a predefined time window and it involves determining the proportions of different instances. Previous works [10, 12] consider a discrete allocation case where the allocation of spot and on-demand instances is updated on an hourly basis, which arises in a class of instances in Amazon EC2 where the billing of on-demand instances is done on an hourly basis. In this paper, we consider the continuous allocation case with a reformulated analysis; here, users pay for the exact period in which on-demand instances are consumed. The resulting framework applies to the other class of instances in Amazon EC2 and the instances of Microsoft Azure and Google Cloud.
The other new aspect is addressing the precedence constraints among the tasks of a job. A task can be executed only when all its preceding tasks have been finished. For analytical tractability, a DAG job is normally transformed into a job with a chain precedence constraint (i.e., a sequence of tasks) where one task can be executed only if its preceding task is completed . Spot instances are available at irregular intervals. The minimum execution time needed to finish a single task is its workload divided by its parallelism bound. Suppose a user has no self-owned instances. An intuitive greedy strategy does not work well: it first requests to fully utilize spot instances to finish tasks one by one until some time point after which all remaining tasks have to fully utilize costly on-demand instances to meet the job deadline. In contrast, difference exists among tasks and the capacity of a task utilizing spot instances depends on its characteristics and the length of an associated time window in which it is executed. Given a job, its tasks can be executed from its arrival time until its deadline, and a proper allocation of time window sizes to its tasks is needed to maximize the total utilization of spot instances.
Our Contributions. Technically, the main contribution of this paper is to propose a framework that enables utilizing a class of IaaS services to process jobs with chain precedence constraints cost-effectively, where on-demand instances are charged for the period in which they are exactly consumed:
In the case that a single task is to be executed in a time window, we derive policies that allocate spot and on-demand instances cost-optimally and self-owned instances cost-effectively. This is the basis to derive the capacity that a task can achieve to utilize spot instances, given the time window length.
A job has multiple tasks. We derive an optimal yet efficient allocation of time window sizes to the tasks, based on a formulation of the problem as an integer linear program to maximize the utilization of spot instances. The allocation algorithm can be used both when the tenant has self-owned resources and when it does not.
Leveraging existing techniques in combinatorial optimization, a DAG job can be transformed into a job with a chain precedence constraint. Consequently, our technical framework can be used to cost-effectively utilize cloud services for the general DAG jobs. It applies to a significant class of instances in Amazon EC2 and the instances of Microsoft Azure and Google Cloud. Experimentally, several intuitive heuristics are used as baselines to validate the cost improvement brought by the proposed solutions. The cost saving is up to 24.87% when spot and on-demand instances are considered and up to 59.05% when self-owned instances are considered. In our framework, the policies and algorithm are parametric, in terms of the availability of spot instances and the sufficiency of self-owned instances. Like , we leverage the online learning technique of [10, 9] to infer these parameters. We note that the sufficiency is indicated by a parameter that controls the allocation of self-owned instances (see Section 4.2.1); the more self-owned instances a user has, the smaller the value of and the more self-owned instances each task gets allocated. While a user requests spot instances, their availability is quantified as the averaged proportion of the period in which spot instances are available.
The rest of this paper is organized as follows. The related work is introduced in Section 2. We formally describe the problem in Section 3. In Section 4, we propose a technical framework for allocating deadlines and self-owned, on-demand and spot instances. In Section 5, we introduce the existing techniques for job transformation and online learning, which will be integrated into our framework. Experimental results are given in Section 6 to validate the effectiveness of the solutions of this paper. Finally, we conclude this paper in Section 7.
2 Related Work
To date, multiple service and pricing models have been proposed [33, 35] and the spot and on-demand service model is a major service offering [32, 36, 34]. Jain et al. are the first to enable the application of an online learning approach to infer the cost-effective parametric policy for utilizing spot and on-demand instances [10, 9]. The key to achieving cost efficiency is the design of a parametric policy and another limitation of [10, 9] is that self-owned instances are not considered. Next, with this approach, Wu et al. formalize the instance allocation process and derive the expected optimal parametric policy for spot and on-demand instances and the near-optimal parametric policy for self-owned instances [11, 12]. The works [10, 9, 11, 12] simply consider the allocation to independent map-only tasks. Also, in their framework, on-demand instances are charged on an hourly basis, and users have to consider maximizing the usage of instances to integer hours to avoid extra charge. In our framework, users pay by the second and thus for what they exactly consume. Thus, our policies for a single task have different forms than the policies of [11, 12]. We will also propose an approach to deal with the precedence constraints among tasks. The online learning approach is interesting in that it does not need prior statistical workload characterization, compared to other techniques such as stochastic programming.
Also, there are other works that simply consider independent tasks and associate a specific deadline with each task to make the instance allocation process manageable. Specifically, Zafer et al.
use a Markov model to characterize spot prices and derive an optimal bidding strategy to utilize spot instances. Yao et al. formulate the problem of utilizing reserved and on-demand instances as an integer program, and propose heuristic algorithms that give approximate solutions .
Now, we briefly review other approaches to cost-effective use of cloud services. There is one class of works based on priori statistical knowledge of the workload or spot prices. For instance, Hong et al. and Chaisiri et al. apply stochastic programming for reserved and on-demand instances [18, 22]; Zheng et al. derive the optimal bidding strategy for spot instances . However, the computational cost of deriving the related statistical knowledge is high . Wang et al. apply the Bahncard problem for reserved and on-demand instances and the resulting algorithm is analyzed by competitive analysis . Vintila et al.
propose a genetic algorithm for spot and on-demand instances. Shi et al. apply Lyapunov optimization and are among the first to jointly utilize the three common types of cloud instances ; yet, a large job delay is incurred . Gao et al.
consider the joint resource provisioning and task scheduling and propose a two-timescale markov decision process approach to maximize the profit of a multimedia service provider. Dubois et al. propose a heuristic to help cloud users decide the right type of spot instances and the bid price, aiming to minimize the cost while maintaining an acceptable level of performance [27, 28].
3 Problem Description and Model
In this section, we introduce the cloud pricing models, define the operational space of a user to utilize various instances, and characterize the objective of this paper.
3.1 Resource Availability and Pricing Structure
On-demand and spot instance services available at popular CSPs may be modelled as follows. First, the price of an on-demand instance is fixed per unit of time and such instances are always available once requested by a user. For example, the price of utilizing an instance for one hour is posted to users that pay for computing capacity by the second. When a user utilizes an instance for hours, it is charged where can be fractional. This is more convenient to users compared with other pricing where billing is done on an hourly basis; in the latter, users have to consider maximizing the usage of instances to integer hours to avoid extra charge.
Second, a user can also request spot instances at a lower price than on-demand instances. Their availability varies over time and users can only utilize spot instances occasionally. Factors affecting availability include the idleness of cloud systems generally and the bid price in some scenarios. The cloud can reclaim spot instances allocated to a user at any time whenever it needs to access to those resources for other high-priority jobs; the bid price is the maximum price that a user is willing to pay for spot instances. In Google cloud, spot instances are offered at a fixed price; they are delivered to a user when there are idle instances. In Amazon EC2 and Microsoft Azure, the price of spot instances varies over time; a user successfully gets the spot instances only if its bid price exceeds the spot price; spot instances are reclaimed by the cloud when either there are inadequate resources or the bid price is below the spot price. From a user perspective, spot service is a type of stochastic service. When a user persistently requests spot instances, the spot service commences at random time points and lasts for random durations. To facilitate analysis, we let denote the average portion for which a user can spot instances per unit of time where .
Third, a user might have its own instances, i.e., self-owned instances, whose amount is limited or zero and denoted by . If any, the (averaged) cost of utilizing self-owned instances is assumed to be the cheapest compared with cloud instances, which implies that a user always prefers to first utilize its own instances before purchasing instances from the cloud. Thus, without loss of generality, this cost is assumed to be zero.
To sum up, availability and price are two key features in cost management. From a user perspective, on-demand instances are always available as if the CSP has infinite on-demand instances to deliver. A user can also request multiple spot instances, which however are available occasionally. The availability of on-demand and spot instances is illustrated in Figure 1. If any, self-owned instances are finite and are always available; however, they may be insufficient at times to satisfy the user computing need. The costs of utilizing self-owned, spot, and on-demand instances are increasing.
3.2 DAG-Structured Jobs
As the time horizon expands, the job arrival of a tenant is monitored at every moment. The tenant plans to rent instances from IaaS clouds to process its jobs and aims to minimize the cost of completing a set of jobs(that arrive over a time horizon ) by their deadlines. Following [15, 10, 12], each job is characterized by a DAG. It has an arrival time and a deadline , that is, job can be executed and has to be finished in a time window . The main notation of this paper is summarized in Table I. The DAG nodes represent tasks and the directed edges represent precedence relations. Each DAG job has tasks and different jobs may have different values of . We use to indicate the execution of task can begin only after task is completed. Thus, a task can be executed when all its preceding tasks are completed.
Each task of job consists of a large number of negligible sub-tasks that are independent and can be executed on multiple instances simultaneously. Completing a task means completing all its sub-tasks. Formally, each task of job has a workload and an upper bound of parallelism. While executing task , the number of instances assigned to task could change over time; the parallelism bound limits the maximum number of instances that can be used to execute task simultaneously. is the instance time that task has to consume in order to be finished. For example, suppose ; to finish task , it needs to consume one instance for two units of time or two instances for one unit of time. When the task is always executed on the maximum number of instances, it has the minimum execution time, which is denoted by
3.3 Problem Description
Each job must be finished in a time window . We first need to determine a time window in which each task of job is executed, while respecting the precedence constraints among the tasks. is the earliest time at which all its preceding tasks are finished where and at which the execution of can begin. is the deadline by which task has to be finished.
3.3.1 Principled Instance Allocation Process
While executed in , each task is assigned self-owned instances: if a user possesses self-owned instances (i.e., ), and otherwise. The amount of workload processed by self-owned instances is . The remaining workload is to be processed by spot and on-demand instances and its amount is . If , all the workload of task will be processed by spot and on-demand instances. From time on, task requests on-demand instances and spot instances from the cloud to process the remaining workload where
to satisfy the parallelism constraint. While task is being executed at a time , the expected workloads that have been processed by on-demand and spot instances are and respectively. At time , the remaining workload of to be processed is denoted by whose expected value is as follows:
With the parallelism constraint, is the minimum time needed to finish the remaining workload.
For a task with residual instance time , we say that task has flexibility to utilize unstable spot instances at a moment when the following condition holds:
Due to inherent uncertainty within spot service, a task may reach a state where it has to totally utilize stable on-demand instances in order to finish by its deadline. Formally, as task is executed, if there exists some time satisfying
we call such time as a turning point and denote it by . At time , we have to give up utilizing cheap spot instances to finish the remaining workload by the deadline . The instance allocation process may have two phases defined below:
If the turning point exists and , the instance allocation process has two phases:
on-demand and spot instances are requested in the period ;
on-demand instances are utilized in .
If the turning point exists and , on-demand instances are utilized in the period . If the turning point does not exist, on-demand instances and spot instances are requested until some time such that .
Example. Now, we give a toy example to illustrate the instance allocation process in Definition 3.2. Suppose task has a parallelism bound and is executed in ; the user has self-owned instance. The availability of spot instances is . The scheduler allocates self-owned instance to task in . The remaining workload to be processed is . The scheduler begins to request one spot and one on-demand instance at time 0 where :
If , we have . At time 1, task gets enough execution time from spot and on-demand instances; we thus have and the turning point does not exist. This is illustrated in Fig. 2(a).
If , we have . At time 1, the remaining workload is ; since , the turning point exists and ; task need turn to totally utilize on-demand instances to meet the deadline. This is illustrated in Fig. 2(b).
3.3.2 Decision Variables, and Objectives
Jobs arrive over time. Each job is represented as a DAG and has multiple tasks.
Decision Variables. Given a job , we need to determine (i) the deadline by which each task is finished and (ii) the numbers of spot and on-demand instances requested when there is flexibility for task to utilize spot instances (i.e., when the turning point has not appeared; see the first and third cases of Definition 3.2). If a user possesses self-owned instances, we also need to determine its amount allocated to each task . Thus, our decision variables include , , and for each task of a job.
Objective of Instance Allocation. We refer to the ratio of the total cost of utilizing a certain type of instances to the total workload processed by this type of instances as the average unit cost of this type of instances. As described in Section 3.1, we assume like [11, 12] that
The average unit cost of self-owned instances is lower than the average unit cost of spot instances, which is lower than that of on-demand instances.
Due to Assumption 1, the overall objective of our instance allocation framework is to maximize the utilization of self-owned and then spot instances and minimize the utilization of costly on-demand instances. Achieving this objective involves properly determining the decision variables , , and for each task of a job. The deadline of a task affects its instance allocation process by Definition 3.2 and thus its completion time; the later affects the time that other tasks can start being executed due to the precedence constraint.
While allocating various instances to a single task in a specific time window , we should consider allocating various instances to a task in the order of self-owned, spot and on-demand instances; the objectives here are the same as the ones in [11, 12] where only the allocation to a single task is considered. Differently, we consider the case of a DAG job where a user pays exactly for what it consumes. Now, we describe these objectives in Principles 3.1 and 3.2.
If a user possesses self-owned instances, the scheduler should make self-owned instances (i) fully utilized, and (ii) utilized in a way so as to maximize the opportunity that all tasks have to utilize spot instances.
After self-owned instances are used or if a user has no self-owned instances, the scheduler should utilize on-demand instances in a way so as to maximize the opportunity that a task has to utilize spot instances.
Realizing the above principles involves properly determining the decision variables , and for each individual task of a job. Last but not least, a job has tasks and has to be finished in a given time window . We also need to maximize the aggregate utilization of self-owned and spot instances by all tasks within the job. Correspondingly, we need to realize the following objective.
Before allocating instances to the tasks of a job, the scheduler needs to properly determine the deadlines to maximize the overall utilization of self-owned instances, if any, and spot instances.
In the following, we will propose solutions for realizing the three principles above. The final result is an integrated framework for a user to cost-effectively process DAG jobs by renting typical cloud instances from major IaaS providers.
4 (Near-)Optimal Instance Allocation
In this section, we consider a special case of jobs, i.e., each job is a chain of tasks, where for all the execution of the -th task can begin if and only if the first tasks have been finished. We propose a framework to design (near-)optimal parametric policies that can effectively realize Principles 3.1-3.3. In the next section, we will use the technique of  to extend the framework to the case where the precedence constraints are present in a general DAG.
4.1 Spot and On-demand Instances
In this subsection, we consider the case that a user has no self-owned instances. We will derive a couple of optimal parametric policies in terms of the availability of spot instances to maximize the utilization of spot instances and realize Principle 3.2 and 3.3 optimally.
Consider a job with a chain of tasks to be processed in a time window . While processing these tasks, one question is what deadline is associated to each task to ensure that the latter tasks have a large enough window in which they are finished. For all , it is expected that is also the time point at which task is finished. To respect the precedence constraints among the tasks, the execution of the -th task can begin when the ()-th task is finished where for all . Thus, task is expected to be executed in where trivially and we have
The other question is that, given the time window of task , what is the optimal composition of instance types (i.e., the values of and ) to maximize the amount of workload to be processed by spot instances.
For example, let us consider a job of tasks with . The task sizes are , , , and . The parallelism bounds are , , , and . The availability of spot instances is specified as . We artificially set the deadline of the -th task to where . Each task is finished at time point . In this setting, the amount of workload processed by spot instances is 2, which is illustrated in Fig. 3. However, as seen later, the optimal amount of workload processed by spot instances is by properly setting the values of , , and for . In the rest of this subsection, for an arbitrary job , we will derive a computationally efficient yet optimal allocation of deadlines , , , to its tasks. Additionally, we also derive the expected optimal composition of instance types to finish each task , which is in fact one enabler of the optimal deadline allocation. In this subsection, we have
and the number of self-owned instances assigned to each task is zero, i.e., .
4.1.2 Allocation to a Single Task
Suppose that the deadlines , , , are given in advance. In this subsection, we give the expected optimal composition of instance types for a single task to utilize spot and on-demand instances in the predefined time window where . This will realize Principle 3.2 optimally.
The instance allocation process is described in Definition 3.2. Now, we give a condition under which task can be expected to be finished by utilizing spot instances alone, without utilizing costly on-demand instances. We also derive the expected optimal strategy for task to utilize different types of instances.
A task can be finished by utilizing spot instances alone when the time window size satisfies the following condition:
The expected optimal strategy of utilizing spot and on-demand instances is as follows:
if the condition (6) holds, then it is expected that the turning point does not exist and we have and ;
if , then the instance allocation process is expected to have two phases and we have in the first phase that and ;
if , then it is expected that the turning point is and we have and .
See Appendix A.1. ∎
4.1.3 Optimal Deadline Allocation
In this subsection, we will realize Principle 3.3 optimally. A job should be executed in the time window . Our question is finding an optimal allocation of , , , to maximize the utilization of cheap spot instances and minimize the consumption of costly on-demand instances.
Formulation as an Integer Linear Program. We formulate the deadline allocation problem as an integer linear program. To ensure that each task can be finished in its time window , we have
where is the minimum execution time of task by (1). can be written:
With the strategies in Proposition 4.1, the total amount of workload processed by spot instances has the following relation with the time window size .
Given the time window size , the expected amount of workload processed by spot instances is
See Appendix A.2. ∎
Here, when . For each task , Proposition 4.1 and 4.2 show (i) the minimum time window size needed to finish the task by only utilizing spot instances, and (ii) how the amount of workload processed by spot instances varies with the time window size and job characteristics.
Our objective is finding an allocation of deadlines , , , to maximize the utilization of spot instances. This is formulated as an integer linear program below:
Solution. Now, we derive a computationally efficient yet optimal solution to the integer linear program (10). By Proposition 4.2, we have the following observation. While the time window size ranges in , the workload of task processed by spot instances is linearly proportional to ; the larger the parallelism bound , the larger the value of . While exceeds , the workload will not increase any more. We can thus propose a greedy strategy to optimally determine the allocation of deadlines to tasks, which is presented in Algorithm 1 with as an input i.e., Dealloc(). Algorithm 1 gives the optimal values of , and we can thus derive the optimal values of , , , by (4) and (7).
The idea of Dealloc() is as follows. Let , , be such that . It considers tasks in non-increasing order of their parallelism bounds (line 3) and allocates as much time as possible to the tasks with the largest parallelism bounds. Specifically,
Each task is initially allocated a time window of size to guarantee that it can be finished in the allocated window (line 1).
The remaining time is allocated to the first tasks with the largest parallelism bounds where .
if , we have for that the task has a time window size (lines 4-5) and the task has a time window size (lines 6-7);
if , the first task has a time window size (lines 6-7).
See Appendix A.3. ∎
Example. Now, we continue the example in Section 4.1.1 and show by Algorithm 1 and Proposition 4.1 the expected optimal deadline and instance allocation to the job , which is illustrated in Fig. 4. The optimal deadline allocation is as follows: , , , and . The first task requests two spot instances in in the first phase of allocation and two on-demand instances in in the second phase of allocation; the second task simply requests one on-demand instance in ; the third requests three spot instances in ; the fourth requests one on-demand instance in . Finally, the amount of workload processed by spot instances is .
4.2 Incorporating Self-Owned Instances
In this section, we extend the framework of Section 4.1 to the case with additional self-owned instances.
4.2.1 Allocation of Self-owned Instances
In this subsubsection, we consider the allocation of self-owned instances to a single task to be finished in . We will give a policy that realizes Principle 3.1 effectively. Specifically, the policy needs to guarantee that (i) self-owned instances are fully utilized by tasks and (ii) in the meantime, the overall opportunity of all tasks utilizing spot instances is maximized. Like [10, 12], in the subsequent analysis, the issue of rounding the allocations of a job to integers is ignored temporarily for simplicity; in reality, we can round up the allocations to integers, without affecting the effectiveness of our conclusions much as shown by our experiments.
We will use a common parameter to determine the amount of self-owned instances allocated to each task . is defined by a function when . The function relates to the characteristics of task and is defined as follows:
When , ; when , , where is given in (1). The value of ranges in . We refer to the parameter as the sufficiency index of self-owned instances. As we will see, given a set of jobs arriving over time, the value of is small if self-owned instances are sufficient and large otherwise.
The function has the following properties:
is the minimum number such that, after task is allocated self-owned instances, it is expected that task can be finished in by only requesting to utilize spot instances without utilizing costly on-demand instances.
is non-increasing in .
See Appendix A.4. ∎
Now, we introduce the policy. Let denote the number of self-owned instances available at time and be the maximum number of self-owned instances that are available in the entire time interval , i.e.,
The number of self-owned instances allocated to task is defined as follows:
Task can use these instances in the period .
We show by Proposition 4.4 that the policy (12) can effectively realize Principle 3.1. represents the availability of spot instances. is non-increasing in . In the case that sufficient self-owned instances are available, we can set to a value smaller than and each task is assigned more than self-owned instances; as a result, all tasks can be expected to be finished by utilizing spot instances alone, without consuming costly on-demand instances. In the meantime, by setting to a properly small value, we can guarantee that self-owned instances are fully utilized by allocating a large number of self-owned instances to each task.
In the case that self-owned instances are insufficient, we can set to a value larger than , and each task is assigned less than self-owned instances; here, all tasks are expected to consume some costly on-demand instances. No tasks are assigned more than self-owned instances. Allocating more than self-owned instances to a task can lead to a waste of self-owned instances since they can be allocated to other tasks for processing the workload that will otherwise be processed by costly on-demand instances. Finally, as shown in the two cases above, if the policy (12) is used, no tasks are overly allocated and a balanced-allocation is achieved to well realize Principle 3.1. This will further be validated in our third experiment of Section 6.
4.2.2 Deadline Allocation
In the last subsubsection, we have given an explicit form of the policy for self-owned instances. Built on such a policy, we derive in this subsubsection the expected optimal allocation of deadlines under some mild assumptions.
A task is allocated to utilize self-owned instances in . Task is divisible and afterwards task can be viewed as a new task with a parallelism bound and a (remaining) workload/size , which will be processed by spot and on-demand instances alone. The number is defined in (12). Within the parallelism bound, it is the minimum of and the maximum number of self-owned instances available in . When a CSP has sufficient self-owned instances, is set to a value smaller than and a task is expected to be assigned more than self-owned instances. When a CSP has insufficient self-owned instances, is set to a value larger than and a task is expected to be assigned less than self-owned instances.
In any case, by choosing a properly large or small value for , can equal or be close to . Thus, for analytical tractability, we assume that each task is assigned self-owned tasks to be utilized in , although the policy that is actually used in our framework is defined by (12). This helps obtain an informed policy to allocate deadlines to the tasks of a job. The effectiveness of the resulting policy will further be validated by our experiments (see Experiments 2 and 3 in Section 6.2). Each job is assigned a specific . Depending on the relation between the availability of spot instances and the sufficiency index of self-owned instances, we have the following conclusion on the amount of workload processed by spot instances after each task is allocated self-owned instances.
Depending on the time window size , in the case that , we have
In the case that , we have
See Appendix A.5. ∎
Proposition 4.5 has the following implications. In spite of the relation of and , the workload processed by spot instances is linearly proportional to the parallelism bound and the additional available time for executing task until some threshold, after which the workload keeps constant and stops increasing with . This is the same as the case in Section 4.1 where only spot and on-demand instances are utilized. Thus, in the case with self-owned instances (i.e., ), we can still apply Algorithm 1 to determine the optimal allocation of deadlines: the specific way is presented in lines 1-5 of Algorithm 2.
4.3 Summarizing Deadline and Instance Allocation
In this subsection, we summarize the process of allocating instances to a chain of tasks.
As the time horizon expands, we check whether specific events are triggered at every moment and take corresponding allocation actions, which are presented in Algorithm 2. Generally, when a job arrives, we first determine its deadline allocation. For all , its -th task can be executed when its preceding tasks have been finished if any. The tasks are executed one by one. In particular, when , job arrives and we first determine the allocation of deadlines to its tasks (lines 1-5): when only on-demand and spot instances are utilized, execute lines 1-3 since ; otherwise, execute lines 1-5 since .
Recall that . For all , when , it means that either job just arrives if or the ()-th task has been finished if ; then, the execution of the -th task begins and we determine the instance allocation to task (lines 6-15). In the case that there are self-owned instances (i.e., ), when , task is first allocated self-owned instances in , where is given in (12) (lines 6-8); otherwise, (lines 9-10). If , task can be viewed as a new task with reduced parallelism bound and task size that will only be processed by spot and on-demand instances. Except the possible workload processed by self-owned instances, the remaining workload of task to be processed at time is denoted by . While task is being executed at time , if , no actions are taken to request spot and on-demand instances since the current allocation of instances is enough to finish task ; otherwise, we have
if there is flexibility for task to utilize spot instances at time by Definition 3.1, request to utilize spot instances (lines 12-13).
otherwise, there is no such flexibility and is the turning point of task ; by Definition 3.2, stop requesting spot instances and turn to utilize on-demand instances in (lines 14-15).
5 Online Learning for Generalized Case
In the last section, we propose a series of parametric policies for allocating instances to a chain of tasks, which are the core technical contribution of this paper. Supported by two existing techniques directly from [15, 10], we can further obtain an integrated framework to process general DAG jobs, which is of great interest in practice. In this section, we introduce the two techniques briefly, although they are not the main contribution of this paper. Their formal description is given in Appendix B.
Job Transformation. The technique of Nagarajan et al.  is used to transform a general DAG job to a virtual job with a chain precedence constraint, also called a pseudo-job. Any feasible schedule of the pseudo-job is also a feasible schedule of the DAG job , with their parallelism, precedence and deadline constraints respected. While transforming to , the high-level idea is as follows. Consider a virtual schedule of , also called a pseudo-schedule: each task of is allocated instances and executed as early as possible. Each pseudo-task of consists of parts of the tasks of that are executed in the same time interval. There are multiple time intervals between the starting and completion times of the pseudo-schedule. These intervals correspond to multiple pseudo-tasks that form a pseudo-job with a chain precedence constraint.
Learning the Optimal Parameters. The online learning algorithm (TOLA) of Menache et al.  is adapted to learn the most cost-effective parametric policy.
Each job is associated with a particular parametric policy that is defined by a tuple of parameters . The parameter represents the availability of spot service while indicates the sufficiency of self-owned instances. When a job arrives, and are used to determine the deadline allocation via the lines 1-5 of Algorithm 2. The value of may only depend on the system dynamics, independent of the behavior of an individual user; this is the case of Google Cloud. Besides the system dynamics, it may also relate to the bid price of a user; this is the case of Amazon EC2 and Microsoft Azure. Then, a user needs to bid a price to request spot instances; its jobs fail to get instances when either is lower than the spot price at a moment or the system reclaims the allocated instances. In this case, we need to learn the best bid price against the spot price dynamics. In the case of Google Cloud, no bid is required and we simply set to a null value.
There is a set of tuples
, each representing one policy. The high-level idea of TOLA is as follows. This is an initial probability distribution over thepolicies. Whenever a job arrives at time , a policy is randomly chosen from according to the distribution and it determines the actual allocation of instances to the job and the actual cost of completing . On the other hand, given an arbitrary policy, the cost of completing an arbitrary job depends on the fixed on-demand price and the variable spot prices in . At time , for the past jobs whose deadlines are no larger than , we can derive their costs under each policy of since we know the spot prices in . We can choose one of such jobs unexamined so far, and examine its cost under each policy; then the distribution is updated at time such that the lower-cost (higher-cost) polices of this job are re-assigned the enlarged (resp.
As the time horizon expands, the probability distribution is updated over and over and the most cost-effective policies of will be identified gradually, i.e., the ones with the highest probabilities. In the meantime, as more and more jobs are processed, the actual cost of completing all jobs will be close to the cost of completing all jobs under the best policy of .
The main aim of our evaluations is to show the effectiveness of the proposed policies of this paper.
6.1 Simulation Setups
In alignment to best practices in prior art [10, 15, 12], jobs are generated as follows. The on-demand price is normalized to be 1. The job arrival follows a poisson process with a mean of 4. The number of tasks in a job is randomly set to or . The order of generating tasks is also the topological order of tasks in the graph. For any two tasks and , a precedence constraint is associated with a probability 0.5. To ensure connectivity, for all , a task without successors is randomly connected to one of the latter tasks , as its successor; for all , a task without predecessors is randomly connected to one of its former tasks as its predecesor. The parallelism bound of a task is randomly set to 8 and 64. The minimum execution time of every task follows a bounded Pareto distribution  with a shape parameter , a scale parameter and a location parameter ; the maximum and minimum values of are set to 2 and 10. The task size is .
For each DAG job , we compute its critical path and denote its length by , which is the minimum execution time needed to finish . The job’s relative deadline is set to , where
is uniformly distributed over. represents jobs’ flexibility and determines their capability to utilize spot instances; it is a main factor that determines the performance. In this paper, we consider four types of jobs with different levels of time flexibility, and the 1st, 2nd, 3rd and 4th types of jobs respectively have
. Each DAG job is transformed into a simpler job with chain-like precedence constraints, after which various policies are applied to the simplified job for processing. We can use an exponential distribution to model spot prices. Specifically, each unit of time is divided into 12 equal time slots, and spot prices are updated per slot; their values can follow a bounded exponential distribution where its mean is set to 0.13; the upper and lower bounds are set to 1 and 0.12.
Proposed Policies. The parametric policy is described in Section 5. , and are chosen respectively from , , and . When only spot and on-demand instances are considered, the set of policies is set to
When there are also self-owned instances, the set of policies is set to
Benchmark Policies. The benchmark policy is used as a baseline to measure the performance of the proposed policy. Our analysis in Proposition 4.1 formalizes that an intuitive policy can achieve the expected optimal utilization of spot instances. For comparison, the benchmark policies include (i) the naive policy for allocating the time windows in which tasks are executed and (ii) the naive policy for self-owned instances. We evaluate two possible naive policies for time window allocation, where the first can only be applied to spot and on-demand instances and the other will also be applied to self-owned instances:
As the time horizon expands, a job simply bids for spot instances for each of its tasks until the length of the critical path for processing the remaining workload of tasks is no less than the remaining time window size; afterwards, we simply use on-demand instances for processing the remaining workload of each task .
Upon arrival of a job, we specify a series of consecutive time windows in which its tasks are executed and finished. Each task has a time window size . The remaining time is evenly allocated among the tasks, and we set to .
The naive policy for self-owned instances would be allocating as many self-owned instances as possible to each task in a first-come-first-served discipline, taking into account the number of self-owned instances available. Specifically, upon arrival of a job, if the time windows of tasks are specified, we allocate as many self-owned instances as possible to each task within its parallelism bound, i.e.,
The set of benchmark policies are parameterized and defined as
Performance Metric. The objective of this paper is minimizing the cost of finishing a set of jobs that arrive over time. Each job is processed under a proposed or benchmark policy, indexed by . There are three types of jobs to be evaluated. Let denote the total workload of job that consists of tasks, i.e., . Let denote the cost of completing under the policy . When there are self-owned instances and the -th type of jobs are processed, the average unit cost of processing jobs under a policy , denoted by , is defined as the ratio of the total cost of utilizing various instances to the processed workload of jobs:
When a fixed policy is applied to all jobs, we use (resp. ) to denote the minimum of the average unit costs of our proposed policies (resp. the benchmark policies):
To measure the effectiveness of our proposed policies over the benchmark policies, we define a metric, called cost improvement, as follows:
represents how much cost is saved by using our proposed policies, compared with the benchmark policies. For example, when , the cost of our proposed policies is only half the cost of the benchmark policies.
Furthermore, in this paper, the policies of a set are associated with a probability distribution on which we base the selection of a policy for each arriving job. The online learning algorithm TOLA (i.e., Algorithm LABEL:Regret in Appendix B.2) is run to update the distribution, finally identifying the policy that generates the lowest cost. When TOLA is applied, we use (resp. ) to denote the average unit cost of processing all jobs if the set of policies is (resp. ), and the cost improvement is defined as follows:
represents the cost saving when online learning is applied.
Our simulations are run over about 10000 jobs. We will show the cost improvement of our proposed policies over the benchmark policies.
Experiment 1. We evaluate the effectiveness of the proposed deadline allocation algorithm (i.e., Algorithm 1) in the case that a user does not have any self-owned instances (i.e., ) and only utilizes spot and on-demand instances. This algorithm is compared with the greedy and even policies in Section 6.1. The corresponding results are listed in Table II. The cost improvement of our algorithm is significant and ranges from 15.23% to 27.10%. The improvement is especially strong when the population of jobs has a tight time flexibility to be finished, e.g., the cost improvement can be up to 27.10%. Since our proposed policy is expected to be optimal, we can see in all cases that the cost of our policy is a lower bound of the cost of the other policies.
Experiment 2. We consider the case that a user also has some self-owned instances. In our proposed framework (i.e., Algorithm 2), there are policies for allocating deadlines and self-owned instances. We evaluate the overall effectiveness of this framework. The benchmark policies for comparison include the even policy for allocating deadlines and the naive policy for allocating self-owned instances. The corresponding results are listed in Table III. The cost improvement is significant and ranges from 37.22% to 62.73%. As a user has more self-owned instances, less spot and on-demand instances will be consumed to complete all the jobs. The more self-owned instances a user has, the larger their effect on the cost. With our proposed policies, the cost improvement increases as the number of self-owned instances increases from 300 to 1200.