Modeling Constrained Preemption Dynamics Of Transient Cloud Servers

11/12/2019 ∙ by Prateek Sharma, et al. ∙ Indiana University 0

In this paper, we conduct a first of its kind empirical study and statistical analysis of the preemption behavior of Google's Preemptible VMs, that have a distinguishing characteristic of having a maximum lifetime of 24 hours. This temporal constraint introduces many challenges in preemption modeling, since existing memoryless models are not applicable. We introduce and develop a new probability model of constrained preemptions that is based on a large scale empirical study of over 1,500 VM preemptions. We place our preemption probability model in the framework of reliability theory and use insights from statistical mechanics to understand the general nature of constrained preemptions. To highlight the effectiveness of our model, we develop optimized policies for job scheduling and checkpointing for constrained preemptions. Compared to existing preemption modeling techniques, our model-based policies can reduce the running time of jobs on preemptible VMs by up to 5×, and reduce the probability of job failure by more than 2×. We also implement our policies as part of a batch computing service, which can reduce the cost by 5× compared to conventional cloud deployments.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Transient cloud computing is an emerging and popular resource allocation model used by all major cloud providers, and allows unused capacity to be offered at low costs as preemptible virtual machines. Transient VMs can be unilaterally revoked and preempted by the cloud provider, and applications running inside them face fail-stop failures. Due to their volatile nature, transient VMs are offered at steeply discounted rates. Amazon EC2 spot instances spo (2013), Google Cloud Preemptible VMs pre , and Azure Low-priority Batch VMs azu , are all examples of transient VMs, and are offered at discounts ranging from 50 to 90% compared to conventional, non-preemptible “on-demand” VMs.

To expand the usability and appeal of transient VMs, many systems and techniques have been proposed that seek to ameliorate the effects of preemptions and reduce the computing costs of applications. Fault-tolerance mechanisms Sharma et al. (2015); Marathe et al. (2014), resource management policies Sharma et al. (2017); Wieder et al. (2012), and cost optimization techniques Dubois and Casale (2016); Shastri and Irwin (2017) have been proposed for a wide range of applications—ranging from interactive web services, distributed data processing, parallel computing, etc. These techniques have been shown to minimize the performance-degradation and downtimes due to preemptions, and reduce computing costs by up to 90%.

However, the success of these techniques depends on probabilistic estimates of when and how frequently preemptions occur. For instance, many fault-tolerance and resource optimization policies are parametrized by the mean time to failure (MTTF) of the transient VMs. A commonly used technique in transient computing is to periodically checkpoint application state, and the “optimal” checkpointing frequency that minimizes the total expected running time of a job depends on the MTTF of the VMs 

Daly (2006).

Past work on transient computing has focused on Amazon EC2’s spot instances, whose preemption characteristics are determined by dynamic prices (which are in turn set using a continuous second-price auction Ben-Yehuda et al. (2013)). Transiency-mitigation techniques such as VM migration Sharma et al. (2015), checkpointing Sharma et al. (2016a); Marathe et al. (2014), diversification Sharma et al. (2017), all use price-signals to model the availability and preemption rates of spot instances. However, these pricing-based models are not generalizable to other transient VMs having a flat price (such as Google’s or Azure’s offerings). Moreover, no information about preemption characteristics is publicly available, not even coarse-grained metrics such as MTTFs. This lack of information and understanding about preemption behavior precludes most failure modeling and transient computing optimizations.

To address this gap, we seek to understand the preemption characteristics of Google’s Preemptible VMs, whose distinguishing characteristic is that they have a maximum lifetime of 24 hours. We conduct a large empirical study of over 1,500 preemptions of Google Preemptible VMs, and develop an analytical probability model of preemptions. We find that the temporal constraint is a radical departure from pricing-based preemptions, and presents fundamental challenges in preemption modeling and their effective use.

Due to the preemption constraint on preemptions, classical models that form the basis of preemption modeling and policies, such as memoryless exponential failure rates, are not applicable. We find that preemption rates are not uniform, but bathtub shaped with multiple distinct temporal phases, and are incapable of being modeled by existing bathtub distributions such as Weibull. We capture these characteristics by developing a new probability model. Our model uses reliability theory principles to capture the 24-hour lifetime of VMs, and generalizes to VMs of different resource capacities, geographical regions, and across different temporal domains. To the best of our knowledge, this is the first work on constrained preemption modeling. Our investigation also points to an interesting connection to statistical mechanics (the Tonks gas model Tonks (1936)), which can be leveraged to obtain fresh insights for modeling temporally constrained preemptions.

We show the applicability and effectiveness of our model by developing optimized policies for job scheduling and checkpointing. These policies are fundamentally dependent on empirical and analytical insights from our model such as different time-dependent failure rates of different types of VMs. These optimized policies are a building block for transient computing systems and reducing the performance degradation and costs of preemptible VMs. We implement and evaluate these policies as part of a batch computing service, which we also use for empirically evaluating the effectiveness of our model and policies under real-world conditions.

Towards our goal of developing a better understanding of constrained preemptions, we make the following contributions:

  1. [leftmargin=12pt]

  2. We conduct a large-scale, first of its kind empirical study of preemptions of Google’s Preemptible VMs. We then show a statistical analysis of preemptions based on the VM type, temporal effects, geographical regions, etc. Our analysis indicates that the 24-hour constraint is a defining characteristic, and that the preemption rates are not uniform, but have distinct phases.

  3. We develop a probability model of constrained preemptions based on empirical and statistical insights that point to distinct failure processes underpinning the preemption rates. Our model captures the key effects resulting from the 24 hour lifetime constraint associated with these VMs, and we analyze it through the lens of reliability theory and statistical mechanics.

  4. Based on our preemption model, we develop optimized policies for job scheduling and checkpointing that minimize the total time and cost of running applications. These policies reduce job running times by up to compared to existing preemption models used for transient VMs.

  5. We implement and evaluate our policies as part of a batch computing service for Google Preemptible VMs. Our service is especially suitable for scientific simulation applications, and can reduce computing costs by compared to conventional cloud deployments, and reduce job failure probability by up to .

Ii Background

We now give an overview of transient cloud computing, and the use of preemption models in transient computing systems.

ii.1 Transient Cloud Computing

Infrastructure as a service (IaaS) clouds such as Amazon EC2, Google Public Cloud, Microsoft Azure, etc., typically provide computational resources in the form of virtual machines (VMs), on which users can deploy their applications. Conventionally, these VMs are leased on an “on-demand” basis: cloud customers can start up a VM when needed, and the cloud platform provisions and runs these VMs until they are shut-down by the customer. Cloud workloads, and hence the utilization of cloud platforms, shows large temporal variation. To satisfy user demand, cloud capacity is typically provisioned for the peak load, and thus the average utilization tends to be low, of the order of 25% Verma et al. (2015); Cortez et al. (2017).

To increase their overall utilization, large cloud operators have begun to offer their surplus resources as low-cost servers with transient availability, which can be preempted by the cloud operator at any time (after a small advance warning). These preemptible servers, such as Amazon Spot instances ec2 , Google Preemptible VMs pre , and Azure batch VMs azu , have become popular in recent years due to their discounted prices, which can be 7-10 lower than conventional non-preemptible servers. Due to their popularity among users, smaller cloud providers such as Packet pac and Alibaba ali have also started offering transient cloud servers.

However, effective use of transient servers is challenging for applications because of their uncertain availability Singh et al. (2014). Preemptions are akin to fail-stop failures, and result in loss of the application memory and disk state, leading to downtimes for interactive applications such as web services, and poor throughput for batch-computing applications. Consequently, researchers have explored fault-tolerance techniques such as checkpointing Sharma et al. (2016a); Marathe et al. (2014); Subramanya et al. (2015) and resource management techniques Sharma et al. (2017) to ameliorate the effects of preemptions. The effect of preemptions depends on the application’s delay insensitivity and fault model, and mitigating preemptions for different applications remains an active research area Joaquim et al. (2019).

ii.2 Modeling Preemptions of Transient VMs

Underlying all techniques and systems in transient computing is the notion of using some probabilistic or even a deterministic model of their preemptions. Such a preemption model is then used to quantify and analyze the impact of preemptions on application performance and availability; and to design model-informed policies to minimizing the effect of preemptions. For example, the preemption rate or MTTF (Mean Time To Failure) of transient servers has found extensive use in selecting the appropriate type transient server for applications Sharma et al. (2017); Subramanya et al. (2015), determining the optimal checkpointing frequency Sharma et al. (2016a); Marathe et al. (2014); Harlap et al. (2017); Ghit and Epema (2017), etc.

However, all prior work on transient computing has exclusively focused on Amazon’s EC2 spot instances. Launched in 2009, spot instances are the first example of transient cloud servers, and their low price (often 90% cheaper than equivalent on-demand instances) provided the motivation to develop optimized policies for reducing the impact of preemptions and the overall cost.

The preemptions of EC2 spot instances are based on their price, which is dynamically adjusted based on the supply and demand of cloud resources. Spot prices are based on a continuous second-price auction, and if the spot price increases above a pre-specified maximum-price, then the server is preempted Ben-Yehuda et al. (2013). Thus, the time-series of these spot prices can be used for understanding preemption characteristics such as the frequency of preemptions and the “Mean Time To Failure” (MTTF) of the spot instances. Publicly available111Amazon posts Spot prices of 3 months, and researchers have been collecting these prices since 2010 Javadi et al. (2011).historical spot prices have been used to characterize and model spot instance preemptions Sharma et al. (2015); Zheng et al. (2015); Shastri et al. (2016); Wolski and Brevik (2016). For example, past work has analyzed spot prices and shown that the MTTFs of spot instances of different hardware configurations and geographical zones range from a few hours to a few days Wolski et al. (2017a); Ouyang et al. (2016); Wolski and Brevik (2016); Baughman et al. (2018); Wolski et al. (2017b).

However, using pricing information for preemption modeling is not a generalizable approach and is not applicable to other types of transient cloud VMs such as Google Preemptible VMs and Azure Low-priority batch VMs. These VMs have flat pricing, and thus pricing cannot be used to infer preemptions, unlike in the case of EC2. Moreover, these cloud providers (Google and Azure) do not expose any public information about their preemption characteristics.

The total lack of information about the preemption characteristics precludes the vast array of optimizations and systems that have been developed to make transient computing more appealing to different kinds of applications. Therefore in this paper, we seek to develop the first empirical model of preemptions of Google Preemptible VMs pre . Our empirical data and preemption model allows the development of preemption mitigation policies.

Google Preemptible VMs have a maximum lifetime of 24 hours, and this constrained preemption introduces new challenges in preemption modeling. Past work on failure modeling of EC2 spot instances have assumed preemptions to be memoryless

and follow the exponential distribution 

Zheng et al. (2015); Sharma et al. (2016b, a); Ghit and Epema (2017). However, the 24 hour constraint precludes such memoryless assumptions, and as we see in the next section, require new modeling techniques.

Iii Constrained Preemptions of Google Preemptible VMs

In this section, we first present an empirical analysis of preemptions of Google Preemptible VMs, and then develop a new probability model based on our observations. Finally, we discuss the unique aspects and general characteristics of constrained preemptions using reliability theory and statistical mechanics.

iii.1 Empirical Study Of Preemptions

Figure 1:

CDF of lifetimes of Google Preemptible VMs. Our proposed distribution for modeling the constrained preemption dynamics provides a better fit to the empirical data compared to other failure distributions. Inset shows the probability density functions.

To understand the nature of temporally constrained preemptions, we conducted the first empirical study of Google’s Preemptible VMs, that have a fixed price and a maximum 24 hour lifetime. Our empirical study is necessitated by the fact that the cloud operator (Google) does not disclose any other information about the preemption rates, and thus relatively little is known about the preemptions of these VMs, and as a result their performance.

We launched 1,516 Google Preemptible VMs of different types over a two month period (Feb–April 2019), and measured their time to preemption (i.e., their useful lifetime).To ensure the generality of our empirical observations, VMs of different resource capacities were launched in a four geographical regions; during days and nights and all days of the week; and running different workloads. A sample of over 100 such preemption events are shown in Figure 1

, which shows cumulative distribution function (CDF) of the VM lifetimes of the

n1-highcpu-16 VM in the us-east1-b zone. Note that the cloud operator (Google) caps the maximum lifetime of the VM to 24 hours, and all the VMs are preempted before that limit.

(a) Preemption characteristics of different VM types. Larger VMs are more likely to be preempted.
(b) Variations due to time of day and workload.
(c) n1-highcpu-16 in different regions.
Figure 5: Analysis of preemption characteristics by VM-type, region, time-of-day, and workload type.

Observation 1:

The lifetimes of VMs are not uniformly distributed, but have three distinct phases.

In the first (initial) phase, characterized by VM lifetime hours, we observe that many VMs are quickly preempted after they are launched, and thus have a steep rate of failure. The rate of failure (preemption rate) is the derivative of the CDF. In the second phase, VMs that survive past 3 hours enjoy a relatively low preemption rate over a relatively broad range of lifetime (characterized by the slowly rising CDF in Figure 1). The third and final phase exhibits a steep increase in the number of preemptions as the preemption deadline of 24 hours approaches. The overall rate of preemptions is “bathtub” shaped as shown by the solid black line in the inset of Figure 1 (discussed in detail below).

Observation 2: The preemption behavior, imposed by the constraint of the 24 hour lifetime, is substantially different from conventional failure characteristics of hardware components and EC2 spot instances.

In “classical” reliability analysis, the rate of failure usually follows an exponential distribution , where . Figure 1 shows the CDF () of the exponential distribution when fitted to the observed preemption data, by finding the distribution parameter that minimizes the least squares error. The classic exponential distribution is unable to model the observed preemption behavior because it assumes that the rate of preemptions is independent of the lifetime of the VMs, i.e., the preemptions are memoryless. This assumption breaks down when there is a fixed upper bound on the lifetime.

Observation 3: The three preemption phases and associated bathtub shaped preemption probability are general, universal characteristics of Preemptible VMs.

In general, the preemption dynamics of a VM are determined by the supply and demand of VMs of that particular type. Thus, our empirical study looked at preemptions of VMs of different sizes, in different geographical zones, at different times of the day, and running different workloads (Figure 5). In all cases, we find that there are three distinct phases associated with the preemption dynamics giving rise to the bathtub shaped preemption probability. We argue that this is not a coincidence, but may be a result of practical and fundamental outcomes of cluster management policies.

While the actual specific preemption policy is up to the cloud operator, we will show that the bathtub behavior has benefits for applications. For applications that do not incorporate explicit fault-tolerance (such as checkpointing), early preemptions result in less wasted work than if the preemptions were uniformly distributed over the 24 hour interval. Furthermore, the low rate of preemptions in the middle periods allows jobs that are smaller than 24 hours to finish execution with only a low probability of failure, once they survive the initial preemption phase. We evaluate the performance of applications with bathtub shaped preemptions in Section VI. In addition to being beneficial to applications, we also conjecture that the bathtub behavior may be a fundamental and general characteristic of constrained preemptions, which we show later in Section III.3.

Observation 4: Larger VMs have a higher rate of preemptions.

Figure (a)a shows the preemption data from five different types of VMs in the Google Cloud n1-highcpu-{2,4,8,16,32}, where the number indicates the number of CPUs. All VMs are running in the us-central1-c zone. We see that the larger VMs (16 and 32 CPUs) have a higher probability of preemptions compared to the smaller VMs. While this could be simply due to higher demand for larger VMs, it can also be explained from a cluster management perspective. Larger VMs require more computational resources (such as CPU and memory), and when the supply of resources is low, the cloud operator can quickly reclaim a large amount of resources by preempting larger VMs. This observed behavior aligns with the guidelines for using preemptible VMs that suggests the use of smaller VMs when possible pre .

Observation 5: Preemptions exhibit diurnal variations, and are also affected by the workload inside the VM.

From Figure (b)b, we can see that VMs have a slightly longer lifetime during the night (8 PM to 8 AM) than during the day. This is expected because fundamentally, the preemption rates are higher during periods of higer demand. We also notice that completely idle VMs have longer lifetimes than VMs running some workload. Presumably, this could be a result of the lower resource utilization of idle VMs being more amenable to resource overcommitment, and result in lower preemptions.

Figure 6: QQ plot of different preemption models. Weibull and Gompertz-Makeham can model preemptions up to CDF=0.5, but not over the entire range.

iii.2 Failure Probability Model

We now develop an analytical probability model for finding a preemption at time (preemption dynamics) that is faithful to the empirically observed data and provides a basis for developing running-time and cost-minimizing optimizations. Modeling preemptions constrained by a finite deadline raises many challenges for existing preemption models that have been used for other transient servers such as EC2 spot instances. We first discuss why existing approaches to preemption modeling are not adequate, and then present our closed-form probability model and associated reliability theory connections.

iii.2.1 Inadequacy of existing failure distributions

Spot instance preemptions have been modeled using exponential distribution Zheng et al. (2015); Sharma et al. (2016b, a), which is the default in most reliability theory applications. However, the strict 24 hour constraint and the distinct preemption phases are not compatible with the memoryless properties of the exponential distribution. To describe failures (preemptions) that are not memoryless (i.e., increasing or decreasing failure rate over time), the classic Weibull distribution with CDF is often employed. However, the Weibull distribution is also unable to fit the empirical data (Figure 1) and especially unable to model the sharp increase in preemptions near the 24 hour deadline.

For constrained preemptions, the increase in failure rate as modeled by the Weibull distribution is not high enough. Other distributions, such as Gompertz-Makeham, have also been used for modeling bathtub behavior, especially for actuarial use-cases Missov and Lenart (2013). The key idea is to incorporate an exponential aging process, which is used to model human mortality. The CDF of the Gompertz-Makeham distribution is given by and is fitted to the data in Figure 1, and is also unable to provide a good model for the observed preemption data.

The non-trivial bathtub-shaped failure rate of Google preemptible VMs (Figure 1

) requires models that capture the sudden onset of the rise in preemptions near the deadline, which is challenging for the existing failure distributions because of the sharp inflection point. From an application and transiency policy perspective, the preemption model must provide insights about the phase transitions, so that the application can adapt to the sharp differences in preemption rates. For example, the preemption model should be able to warn applications about impending deadline, which existing failure distributions cannot account for. Thus, not only is it important to minimize the total distribution fitting error, it is also important to capture the changes in phase. However, as we can see from the QQ plots in Figure 

6, existing distributions are unable to capture the effects of the deadline and all the phases of the preemptions, and a new modeling approach is needed, which we develop next.

iii.2.2 Our model

Our failure probability model seeks to address the drawbacks of existing reliability theory models for modeling constrained preemptions. The presence of three distinct phases exhibiting non-differentiable transition points (sudden changes in CDF near the deadline, for example) suggests that for accurate results, models that treat the probability as a step function (CDF as a piecewise-continuous function) could be employed. However, this limits the range of model applicability and general interpretability of the underlying preemption behavior. Our goal is to provide a broadly applicable, continuously differentiable, and informative model built on reasonable assumptions.

We begin by making a key assumption: the preemption behavior arises from the presence of two distinct failure processes. The first process dominates over the initial temporal phase and yields the classic exponential distribution that captures the high rate of early preemptions. The second process dominates over the final phase near the 24 hour maximum VM lifetime and is assumed to be characterized by an exponential term that captures the sharp rise in preemptions that results from this constrained lifetime.

Based on these observations, we propose the following general form for the CDF:

(1)

where is the time to preemption, is the rate of preemptions in the initial phase, is the rate of preemptions in the final phase, denotes the time that characterizes “activation” of the final phase where preemptions occur at a very high rate, and is a scaling constant. The model is fit to data for , where hours represents the temporal interval (deadline). Combination of the 4 fit parameters (, and ) are chosen to ensure that boundary condition is satisfied. In practice, typical fit values yield hours, , , and .

For most of its life, a VM sees failures according to the classic exponential distribution with a rate of failure equal to – this behavior is captured by the term in Equation 1. As VMs get closer to their maximum lifetime imposed by the cloud operator, they are reclaimed (i.e., preempted) at a high rate , which is captured by the second exponential term, of Equation 1. Shifting the argument () of this term by ensures that the exponential reclamation is only applicable near the end of the VM’s maximum lifetime and does not dominate over the entire temporal range.

The analytical model and the associated distribution function introduced above provides a much better fit to the empirical data (Figure 1) and captures the different phases of the preemption dynamics through parameters , and . These parameters can be obtained for a given empirical CDF using least squares function fitting methods (we use scipy’s optimize.curve_fit with the dogbox technique sci ). The failure or preemption rate can be derived from this CDF as:

(2)

vs. yields a bathtub type failure rate function for the associated fit parameters (inset of Figure 1).

In the absence of any prior work on constrained preemption dynamics, our aim is to provide an interpretable model with a minimal number of parameters, that provides a sufficiently accurate characterization of observed preemptions data. Further generalization of this model to include more failure processes would introduce more parameters and reduce the generalization power.

iii.2.3 Reliability Analysis

We now analyze and place our model in a reliability theory framework.

Expected Lifetime: Our analytical model also helps crystallize the differences in VM preemption dynamics, by allowing us to easily calculate their expected lifetime. More formally, we define the expected lifetime of a VM () as:

(3)

where is the rate of preemptions of the VM (Equation 2).

This expected lifetime can be used in lieu of MTTF, for policies and applications that require a “coarse-grained” comparison of the preemption rates of servers of different types, which has been used for cost-minimizing server selection Sharma et al. (2016a).

Hazard Rate: The hazard rate governs the dynamics of the failure (or survival) processes. It is generally defined as , often expressed via the following differential equation (rate law):

(4)

where is the survival function associated with a CDF , and is the failure probability function (rate) at time . The survival function indicates the amount of VMs that have survived at time . The hazard rate can also be directly expressed in terms of the CDF as follows: . The exponential distribution has a constant hazard rate . The Gompertz-Makeham distribution has an increasing failure rate to account for the increase in mortality, and its hazard rate is accordingly non-uniform and given by .

Since we model multiple failure rates and deadline-driven preemptions, our hazard rate is expected to increase with time. Defining the survival function for our model: , and using Eq. 4 yields the hazard rate associated with our model:

(5)

where we have introduced , to denote the rates of preemptions associated with initial and final phases respectively.

Recall that the sharp increase in preemption rate only happens close to the deadline, which means that . Thus, when , we get , mimicking the hazard rate for the classic exponential distribution. As approaches and exceeds (i.e., ), the increase in the hazard rate due to the second failure process kicks in, accounting for the deadline-driven rise in preemptions. Note that our hazard rate satisfies for .

iii.3 Insights on the bathtub shape distribution

For constrained preemptions, one might expect to see uniformly distributed preemptions with a probability over . However, as our empirical analysis shows, the preemption distribution is baththub shaped. Interestingly, we can show using exact analytical arguments that non-uniform, baththub distributions are in fact a general characteristic of systems with constrained preemptions, modulo some assumptions.

Lemma: Consider randomly distributed preemptions over an interval . Assume that each preemption takes time-units to perform, and preemptions cannot overlap, i.e, they occur in a mutually exclusive manner. Then, there exists such that , where is the probability of finding a preemption at time .

We first make some preliminary remarks and introduce concepts necessary to complete the proof.

Firstly, mutual exclusion of preemptions implies that there is a finite non-zero waiting time between preemptions. For preemptions to occur within interval, evidently, we must have . Also, while , the time to perform the preemption is generally expected to be much smaller than the total time interval . preemptions occupy a “temporal volume” of (volume here represents the one-dimensional volume). We assume that while a preemption may start at , the last preemption must finish by . Thus, the amount of free or excluded “temporal volume” available within the constrained system is . The idea of excluded volume is central in physics and materials engineering where it underpins the origin of entropic or steric forces in material systems Krauth (2006); Jing et al. (2015).

Secondly, we note that the system of preemptions within a constrained deadline of interval maps exactly to a well known and analytically solvable system in classical statistical mechanics, the Tonks gas model Tonks (1936), where one considers a system of hard-spheres of diameter to move along a line segment of length . The structural quantities associated with this system including the probability of finding a sphere at position within the interval are computed by evaluating the partition function of the system, which essentially measures the number of valid system configurations Krauth (2006). Employing this mapping and the associated statistical mechanics tools, the original model of non-overlapping (interacting) preemptions can be mapped to a system of overlapping (non-interacting) preemptions, each allowed to access an excluded volume of , and the number of valid configurations is given by the partition function . For the case of preemptions, we have .

We are interested in calculating the probability that a preemption starts at time , i.e., . Given that the time to perform the preemption is generally expected to be much smaller than the total time interval , is the probability of finding a preemption near the deadline. The assumption of mutually exclusive preemptions implies that no other preemption can be found for , that is, . Hence, the remaining preemptions must occur such that the last of those finish by (the preemption at time essentially sets an effective deadline for the other preemptions). The number of ways this can happen is given by the partition function , where is the corresponding excluded temporal volume accessible to each of the preemptions. It is interesting to note that this excluded volume is the same as that of the original preemption system: this fortuitous result arises because the decrease in available volume to place the preemptions is commensurate with the need to place preemptions instead of .

The probability is obtained as the ratio of the valid configurations given by the two partition functions computed above. That is, , since and . Choosing completes the proof.

By symmetry arguments, the above lemma is in fact valid for both the end points of the interval, i.e., . Thus, the probability of preemption is higher near the end points (deadline) than the average preemption probability of , and we get a bathtub shaped distribution. Thus, the bathtub distribution can be considered to be a general artifact of constrained preemptions. Of course, the empirical preemption distribution is determined by the cloud platform’s policies and supply and demand, and we elaborate more about the generality of our model and observation in Section VIII.

For the above proof, we assumed that each preemption event occurs over a timespan of , which is determined by the preemption warning that the cloud platform provides (which is 30 seconds for Google Preemptible VMs and 120 seconds for Amazon EC2 spot instances). Preempting a VM and reclaiming its resources involves manipulating the cluster-management state, and mutually exclusive preemptions may be convenient for cluster management, since serializing VM preemptions makes accounting and other cluster operations easier. From an application standpoint, non-overlapping preemptions are also beneficial, since handling multiple concurrent preemptions is significantly more challenging Sharma et al. (2017).

Iv Application Policies For Constrained Preemptions

Having analyzed the statistical behavior of constrained preemptions and presented our probability model, we now examine how the bathtub shape of the failure rate impacts applications. Based on insights drawn from our statistical analysis and the model, we develop various policies for ameliorating the effects of preemptions. Prior work in transient computing has established the benefits of such policies for a broad range of applications. However, the constrained nature of preemptions introduces new challenges that do not arise in other transient computing environments such as Amazon EC2 spot instances, and thus new approaches are required.

iv.1 Impact On Running Time

When a preemption occurs during the job’s execution, it results in wasted work, assuming there is no checkpointing. This increases the job’s total expected running time, since it must restart after a preemption. The expected wasted work depends on two factors:

  1. [leftmargin=12pt]

  2. The probability of the job being preempted during its execution.

  3. When the preemption occurs during the execution.

We can analyze the wasted work due a preemption using the failure probability model. We first compute the expected amount of wasted work assuming the job faces a single preemption, which we denote by , where is the original job running time (without preemptions).

(6)

where . Here, is the probability that there is a preemption within time and is given by where is the CDF. is the probability of a preemption at time , and is given by , where

is the probability distribution function given by Equation 

2. We can therefore write the above equation as:

(7)

We note that the integral is the same as the “expected lifetime”, given by Equation 3. The above expression for the expected waste given a single preemption can be used by users and application frameworks to estimate the increase in running time due to preemptions. The total running time (also known as makespan) of a job with preemptions is given by:

(8)

Where and . The above equation for thus becomes:

(9)

This expression for the expected running time assumes that the job will be preempted at most once. An expression which considers the higher order terms and multiple job failures easily follows from the base case, but presents relatively low practical value. The probability of multiple preemptions is low, and most transient computing systems seek to avoid repeated preemptions, and discard the job if multiple preemptions occur or move them to on-demand VMs.

Consequences for applications: Based on our analysis, both the increase in wasted time () and expected running time depend on the length of the job for non-memoryless constrained preemptions. For memoryless exponential distributions, the expected waste is simply , but this assumption is not valid for constrained preemptions, and thus job lengths must be considered when evaluating the suitability of Preemptible VMs.

Users and transient computing systems can use the expected running time analysis for scheduling and monitoring purposes. Since the preemption characteristics are dependent on the type of the VM and temporal effects, this analysis also allows principled selection of VM types for jobs of a given length. For instance, VMs having a higher initial rate of preemptions are particularly detrimental for short jobs, because the jobs will see high rate of failure and are not long enough to run during the VM’s stable period with low preemption rates. We evaluate the expected wasted time and running time for Google Preemptible VMs later in Section VI.

iv.2 Job Scheduling and VM Reuse Policy

Many cloud-based applications and services are long-running, and typically run a continuous sequence of tasks and jobs on cloud VMs. In the case of deadline-constrained bathtub preemptions, applications face a choice: they can either run a new task on an already running VM, or relinquish the VM and run the task on a new VM. This choice is important in the case of non-uniform failure rates, since the job’s failure probability depends on the “age” of the server. Because of the bathtub failure distribution, VMs enjoy a long period of low failure rates during the middle of their total lifespan. Thus, it is beneficial to reuse VMs for multiple jobs, and relinquishing VMs after every job completion may not be an optimal choice.

However, jobs launched towards the end of VM life face a tradeoff. While they may start during periods of low failure rate, the 24 hour deadline-imposed sharp increase in preemptions poses a high risk of preemptions, especially for longer jobs. The alternative is to discard the VM and run the job on a new VM. However, since newly launched VMs also have high preemption rates (and thus high job failure probability), the choice of running the job on an existing server vs. a new server is not obvious.

Our job scheduling policy uses the preemption model to determine the preemption probability of jobs of a given length . Assume that the running VM’s age (time since launch) is . Then, the the probability of failure on the existing VM . The intuition is to reuse the VM only if the expected running time is lower, compared to running on a new VM. To compute the expected running time of a job of length starting at vm-age , we modify our earlier expression for running time (Equation 9) to:

(10)

The alternative is to discard the VM and launch a new VM, in which case, Equation 9 applies. Depending on the VM’s age and the job’s running time , we can compare Equations 9 and  10, and run the job on whichever case yields the lower expected running time.

iv.3 Checkpointing Policy

A common technique for reducing the total expected running time of jobs on transient servers is to use fault-tolerance techniques such as periodic checkpointing Sharma et al. (2016a). Checkpointing application state to stable storage (such as network file systems or centralized cloud storage) reduces the amount of wasted work due to preemptions. However, each checkpoint entails capturing, serializing, and writing application state to a disk, and increases the total running time of the application. Thus, the frequency of checkpointing can have a significant effect on the total expected running time.

Existing checkpointing systems for handling hardware failures in high performance computing, and for cloud transient servers such as EC2 spot instances, incorporate the classic Young-Daly Dongarra et al. ; Daly (2006); Sharma et al. (2016a); Marathe et al. (2014) periodic checkpointing interval that assumes that failures are exponentially distributed. That is, the application is checkpointed every time units, where is the time overhead of writing a single checkpoint to disk.

However, checkpointing with a uniform period is sub-optimal in case of time dependent failure rates, and especially for bathtub failure rates. A sub-optimal checkpointing rate can lead to increased recomputation and wasted work, or result in excessive checkpointing overhead. Intuitively, the checkpointing rate should depend on the failure rate, and our analytical preemption model can be used for designing an optimized checkpointing schedule.

We now present our checkpointing policy that uses the preemption model and provides non-uniform, failure-rate dependent checkpointing. In a nutshell, our policy allows us to compute the optimal checkpointing schedule for jobs of different lengths and different starting times, employing a new dynamic programming approach that minimizes the total expected makespan.

Algorithm description: Let the uninterrupted running time of the job be . For ease of exposition, we assume that each job-step takes one unit of time, yielding job-steps. Let the checkpoint cost be —i.e, each checkpoint increases the running time by . We seek to minimize the total expected running time or the makespan, which is the sum of , the expected periodic checkpointing cost, and the expected recomputation.

The makespan can be recursively defined and computed. Let denote the makespan where is remaining length of job to be executed, and is the time elapsed since the VM’s starting time (i.e., the VM’s current age). We now need to determine when to take the next checkpoint, which we take after steps. Let denote the minimum expected makespan.

(11)

The makespan is affected by whether or not there is a failure before we take the checkpoint:

(12)

Here denotes the probability of the job successfully executing without failures until the checkpoint is taken, i.e., from to . is computed using the CDF, and .

is the expected makespan if there are no job failures when the job is executing from step to , and is given by a recursive definition:

(13)

Note that the makespan includes the amount of work already done (), the checkpointing overhead (), and the expected minimum makespan of the rest of the job. Similarly, when the job fails before step , then that portion is “lost work”, and can be denoted by which is the expected lost work when there is a failure during the time interval to . A failure before the checkpoint results in no progress, and steps of the job still remain. The expected makespan in the failure case is then given by:

(14)

In the case of memoryless failures, is approximated as . In our case, the lost work is the wasted work that we defined earlier in Equation 7, but we need to consider the different start and end times, and we get:

(15)

where is the probability density function from Equation 2.

Computing the optimal checkpoint schedule: We can find the minimum makespan by using Equations 1115. Given a job of length , minimizing the total expected makespan involves computing , where is the current age of the server. Since the makespan is recursively defined, we can do this minimization using dynamic programming, and extract the job-steps at which checkpointing results in a minimum expected makespan. The job’s checkpointing schedule is determined as follows (assume the job starts at for ease of exposition). We first locate the checkpointing interval that minimizes . Then, we recursively find the next checkpointing interval by minimizing , and so on, until the .

If a job encounters a failure, it is resumed from the most recent checkpoint, on a new VM. After every such resume-event, we compute the optimal checkpointing schedule for , since the job’s failure rate is dependent on the VM age when it starts, and the job may be resumed at a later time or on a VM of a different type, etc. Our algorithm yields non-uniform intervals proportional to failure rate. For a 5 hour job launched on a new VM (time=0), the checkpointing intervals are minutes. Further analysis of our algorithm is presented in Section VI.2.2.

V Implementing a Batch Computing Service For Preemptible VMs

We have implemented a prototype batch computing service that implements various policies for constrained preemptions. We use this service to examine the effectiveness and practicality of our model and policies in real-world settings. Our service is implemented as a light-weight, extensible framework that makes it convenient and cheap to run batch jobs in the cloud. We have implemented our prototype in Python in about 2,000 lines of code, and currently support running VMs on the Google Cloud Platform gcp .

We use a centralized controller (Figure 7), which implements the VM selection and job scheduling policies described in Section IV. The controller can run on any machine (including the user’s local machine, or inside a cloud VM), and exposes an HTTP API to end-users. Users submit jobs to the controller via the HTTP API, which then launches and maintains a cluster of cloud VMs, and maintains the job queue and metadata in a local database.

Our service integrates, and interfaces with two primary services. First, it uses the Google cloud API gcl for launching, terminating, and monitoring VMs. Once a cluster is launched, it then configures a cluster manager such as Slurm slu or Torque tor , to which it submits jobs. Our service uses the Slurm cluster manager, with each VM acting as a Slurm “cloud” node, which allows Slurm to gracefully handle VM preemptions. The Slurm master node runs on a small, 2 CPU non-preemptible VM, which is shared by all applications and users. We monitor job completions and failures (due to VM preemptions) through the use of Slurm call-backs, which issue HTTP requests back to the central service controller.

Policy Implementation: Our service creates and manages clusters of transient cloud servers, manages all aspects of the VM lifecycle and costs, and implements the model-based policies. It manages a cluster of VMs, and parametrizes the bathtub model based on the VM type, region, time-of-day, and day-of-week. When a new batch job is to be launched, we find a “free” VM in the cluster that is idle, and uses the job scheduling policy to determine if the VM is suitable or a new VM must be launched. Due to the bathtub nature of the failure rate, VMs that have survived the initial failures are “stable” and have a very low rate of failure, and thus are “valuable”. We keep these stable VMs as “hot spares” instead of terminating them, for a period of one hour. For the checkpointing policy, our dynamic programming algorithm has a time complexity of , for a job of length . To minimize this overhead, we precompute the checkpointing schedule of jobs of different lengths, and don’t need to compute the checkpoint schedule for every new job.

Figure 7: Architecture and system components of our batch computing service.

Bag of Jobs Abstraction For Scientific Simulations: While our service is intended for general batch jobs, we incorporate a special optimization for scientific simulation workloads that improves the ease-of-use of our service, and also helps in our policy implementation. Our insight is that most scientific simulations involve launching a series of jobs that explore a large parameter space that results from different combinations of physical and computational parameters. These workloads can be abstracted as a “bag of jobs”, with each job running the same application with different parameters. A bag of jobs is characterized by the job and all the different parameters with which it must be executed. Within a bag, jobs show little variation in their running time and execution characteristics.

We allow users to submit entire bags of jobs, which permits us to determine the running time of jobs based on previous jobs in the bag. For constrained preemptions, the running time and checkpointing are determined by job lengths, and the job run time estimates are extremely useful. Having a large sequence of jobs is also particularly useful with bathtub preemptions, since we can re-use “stable” VMs with low preemption probability for running new jobs from a bag. If jobs were submitted one at a time, a batch computing service may have to terminate the VM after job completion, which would increase the job failure probability resulting from running on new VMs that have a high initial failure rate.

Vi Model and Policy Evaluation

In this section, we present analytical and empirical evaluation of constrained preemptions. We have already presented the statistical analysis of our model in Section III, and we now focus on answering the following questions:

  1. How do constrained preemptions impact the total running time of applications?

  2. What is the effect of our model-based policies when compared to existing transient computing approaches?

  3. What is the cost and performance of our batch computing service for real-world workloads?

Environment and Workloads: All our empirical evaluation is conducted on the Google Public cloud using our batch computing service described in Section V. We use three scientific computing workloads that are representative of typical applications in the broad domains of physics and material sciences:

Nanoconfinement. The nanoconfinement application launches molecular dynamics (MD) simulations of ions in nanoscale confinement created by material surfaces Jing et al. (2015); Kadupitiya et al. (2017).

Shapes. The Shapes application runs an MD-based optimization dynamics to predict the optimal shape of deformable, charged nanoparticles Jadhao et al. (2014); Brunk and Jadhao (2019).

LULESH. Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics (LULESH) is a popular benchmark for hydrodynamics simulations of continuum material models Karlin et al. (2013a, b).

vi.1 Impact of Constrained Preemptions on Job Running Times

(a) Computation wasted due to one preemption.
(b) Expected increase in running time.
Figure 10: Wasted computation and expected increase in running time for uniform vs. baththub failures. For jobs hours, bathtub distribution results in significantly lower wasted computation.

We begin by examining how constrained preemptions impacts the total job running times. When a preemption occurs during the job’s execution, it results in wasted work, assuming there is no checkpointing. This increases the job’s total expected running time, since it must restart after a preemption. In case of constrained preemptions, the expected waste depends both on the probability of job preemption, as well as when the job was preempted.

For a job of length , the wasted work, assuming that the job faces a single preemption, is , and is given by Equation 7. We first analyze this wasted work for jobs of different lengths in Figure (a)a. We analyze two failure probability distributions for constrained preemptions: a uniform distribution such that , and the bathtub shaped distribution with parameters corresponding to the n1-highcpu-16 VM type shown in Figure 1.

For the uniform distribution, the wasted work is linear in the job length, and is given by . For the bathtub distribution, the wasted work is given by Equation 7, and is significantly lower, especially for longer jobs (longer than 5 hours). With the bathtub distribution, jobs see a high rate of failure initially, but that also reduces the wasted work. Once jobs survive the initial high failure rate, the rate of failure is low, and thus the wasted work is more or less constant for all but the shortest and longest jobs.

We now examine the expected increase in running time, that also accounts for the probability of failure, and is given by . Figure (b)b shows this expected increase in running times for jobs of different lengths. We see that for uniformly distributed preemptions, the increase in running time is quadratic in the job length (and is given by ). Interestingly, the high rate of early failures for the bathtub distribution results in a slightly worse (i.e., higher) running time for short jobs. However for jobs longer than 5 hours, a cross-over point is reached, and the bathtub distribution provides a significantly lower overhead of preemptions. For instance, for a 10 hour job, the increase in running time is about 30 minutes, or 5%. In comparison, if failures were uniformly distributed, the increase would be 2 hours.

Thus, the bathtub preemptions are beneficial for applications and users, as the low failure rate during the middle periods results in significantly lower wasted work, compared to the uniformly distributed failures. Since the failure rate distribution is ultimately controlled by the cloud provider, our analysis can be used to determine the appropriate preemption distribution based on the job length distributions. For instance, if short jobs are very common, then uniformly distributed preemptions are preferable, otherwise, bathtub distributions can offer significant benefits.

Result: For constrained preemptions, bathtub distributions significantly reduce the expected increase in running times for medium to long running jobs ( hours), but are slightly inferior for short jobs ( hours).

vi.2 Model-based Policies

We now evaluate the effectiveness of model-driven policies that we proposed earlier in Section IV. Specifically, we seek to compare the effectiveness of our job scheduling and checkpointing policies with existing transient computing approaches.

vi.2.1 Job Scheduling

In the previous subsection, we have quantified the increase in running time due to preemptions, but we had assumed that jobs start on a newly launched server. In many scenarios however, a server may be used for running a long-running sequence of jobs, such as in a batch-computing service. Our job scheduling policy is model-driven and decides whether to request a new VM for a job or run it on an existing VM. A new VM may be preferable if the job starts running near the VM’s 24 hour preemption deadline.

Figure (a)a shows the effect of our job scheduling policy for a six hour job, for different job starting times (relative to the VM’s starting time). We compare against a baseline of memoryless job scheduling that is not informed by constrained preemption dynamics. Such memoryless policies are the default in existing transient computing systems such as SpotOn Subramanya et al. (2015). In the absence of insights about bathtub preemptions, the memoryless policy continues to run jobs on the existing VM. As the figure shows, the empirical job failure probability is bathtub shaped. However since the job is 6 hours long, with the memoryless policy, it will always fail when launched after hours. In contrast, our model-based policy determines that after 18 hours, we will be better off running the job on a newer VM, and results in a much lower job failure probability (=0.4). Thus, our model-based job scheduling policy can reduce job failure probability by taking into account the time-varying failure rates of VMs, which is not considered by existing systems that use memoryless scheduling policies.

(a) Effect of job start time on the failure probability.
(b) Job failure probability for jobs of different lengths.
Figure 13: Job failure probability is lower with our deadline aware policy across all job sizes.

The job failure probability is determined by the job length and the job starting time. We examine the failure probability for jobs of different lengths in Figure (b)b, in which we average the failure probability across different start times. We again see that our policy results in significantly lower failure probability compared to memoryless scheduling. For all but the shortest and longest jobs, the failure probability with our policy is half of that of existing memoryless policies. This reduction is primarily due to how the two policies perform for jobs launched near the end of the VM preemption deadline, which we examined previously in Figure (a)a.

Result: Our model-based job scheduling and VM-reuse policy can decrease job failure probability by .

vi.2.2 Checkpointing

We now evaluate our model-based checkpointing policy, that uses a dynamic programming approach. With our policy, the checkpointing rate is determined by the VM’s current failure rate. In contrast, all prior work in transient computing and most prior work in fault-tolerance assumes that failures are exponentially distributed (i.e., memoryless), and use the Young-Daly checkpointing interval. In the Young-Daly approach, checkpoints are taken after a constant period given by . However in the case of constrained preemptions with bathtub distributions, the failure rate is time-dependent and not memoryless.

The expected increase in running time for a 4 hour job is shown in Figure (a)a, in which we account for both the increase due to the checkpointing overhead, as well as the expected recomputation due to preemptions. Throughout, we assume that each checkpoint takes 1 minute. The increase in running time depends on the failure rate and thus the job’s starting time. With our model-based checkpointing policy, the increase in running time is bathtub shaped and is below 5%, and around 1% when the job is launched when the VM is between 5 and 15 hours old.

We also compare with the Young-Daly Daly (2006) periodic checkpointing policy, and use the initial failure rate of the VM to determine the MTTF, which corresponds to an MTTF of 1 hour. This results in a high, constant rate of checkpointing, and increases the running time of the job by more than 25%. The increase in running time is primarily due to the overhead of checkpointing. Note that checkpointing with a lower frequency decreases the checkpointing overhead, but increases the recomputation required.

Next, we examine the expected running time of jobs of different length, when all jobs start at time=0, i.e, are launched on a freshly launched VM. Figure (b)b shows the expected increase in the running time of the jobs with our model-based checkpointing policy and the Young-Daly policy with MTTF=1 hour. With our policy, the running times increase by 10% for short jobs less than 2 hours long, and increase by less than 5% for longer jobs. In contrast, the Young-Daly policy yields a constant increase in running times of 25%. Thus, our model-based policy is able to reduce the checkpointing overhead and thus reduce the performance overhead of running on preemptible VMs to below 5%.

Result: Our checkpointing policy can reduce the performance overhead of preemptions to under 5%, which is better than conventional periodic checkpointing.

(a) Checkpointing overhead for different job starting times.
(b) Increase in running time with checkpointing when jobs start at time=0.
Figure 16: Checkpointing effectiveness.

vi.3 Effectiveness on Scientific Computing Workloads

We now show the effectiveness of our batch computing service on Google Preemptible VMs. We run scientific simulation workloads described earlier in this section, and are interested in understanding the real-world effectiveness of our model-based service.

(a) Cost
(b) Preemptions
Figure 19: Cost and preemptions with our service.

Cost: The primary motivation for using preemptible VMs is their significantly lower cost compared to conventional “on-demand” cloud VMs that are non-preemptible. To evaluate the cost of using our batch computing service, we run a bag of 100 jobs, all running on a cluster of 32 VMs of type n1-highcpu-32

. Within a bag, different jobs are exploring different physical parameters, and job running times show little variance. Figure 

(a)a shows the cost of using Preemptible VMs compared to conventional on-demand VMs. We see that for all the three applications, using our service can reduce costs by .

We note that for this experiment, our service was using model-driven job scheduling, but was not using checkpointing, since the applications lacked checkpointing mechanisms. Using checkpointing would reduce the costs even further, since it would reduce the increase in running time (and server costs) due to recomputation.

Preemptions: Finally, we examine the effect of preemptions on the increase in running time under real-world settings. We ran a cluster of 32 n1-highcpu-32 VMs running the Nanoconfinement application, and repeated the experiment multiple times to observe the effect of preemptions. Figure (b)b shows the increase in running time of the entire bag of jobs, when different number of VM preemptions are observed during the entire course of execution. We see that the net impact of preemptions results in a roughly linear increase in running time. Each preemption results in a roughly 3% increase in running time, which validates our analytical evaluation shown earlier in Figure (b)b. The result also highlights the effectiveness of the job scheduling and VM-reuse policy, since most jobs run on the stable VMs, and those that run on new VMs “fail fast” and result in only a small amount of wasted work and increase in running time.

Result: Our batch computing service can reduce costs by up to 5 compared to conventional on-demand cloud VMs. With the VM-reuse policy, the performance impact of preemptions is as low as 3%.

Vii Related Work

Transient Cloud Computing. The low cost of transient cloud servers has made them very appealing, inspite of their preemptible nature, and their efficient and effective use has been a significant amount of research Sharma (2018). The significantly lower cost of spot instances makes them attractive for running preemption and delay tolerant batch jobs Subramanya et al. (2015); Jain et al. ; Yi et al. (2010); Wieder et al. (2012); Liu (2011); Chohan et al. (2010); Dubois and Casale (2016); Varshney and Simmhan (2019). The challenges posed by Amazon EC2 spot instances, the first transient cloud servers, have received significant attention from both academia and industry spo . The distinguishing characteristic of EC2 spot instances is their dynamic auction-based pricing, and choosing the “right” bid price to minimize cost and performance degradation is the focus of much of the past work on transient computing Javadi et al. (2011); Mihailescu and Teo (2012); Tang et al. (2012); Wee (2011); Xu and Li (2013); Zhang et al. (2011); Zafer et al. (2012); Zheng et al. (2015); Song et al. (2012); Wolski et al. (2017a); Guo et al. (2015).

On the other hand, the effective use of transient resources provided by other cloud providers such as Google, Microsoft, Packet, and Alibaba largely remains unexplored. Ours is the first work that studies the preemption characteristics and addresses the challenges involved in running large-scale applications on the Google Preemptible VMs, and provides insights on the unique constrained preemption dynamics.

Preemption Mitigation. Effective use of transient servers usually entails the use of fault-tolerance techniques such as checkpointing Sharma et al. (2016a), migration Sharma et al. (2015), and replication Subramanya et al. (2015). In the context of HPC workloads, Marathe et al. (2014); Gong et al. (2015); Taifi et al. (2011) develop checkpointing and bidding strategies for MPI applications running on EC2 spot instances. However, periodic checkpointing Dongarra et al. ; Bougeret et al. (2011) is not optimal in our case because preemptions are not memoryless.

Preemption Modeling. Conventionally, exponential distribution have been used to model preemptions, even for EC2 spot instances Zheng et al. (2015); Sharma et al. (2016a, b). Our preemption model for Google preemptible VMs developed in Section III provides a novel characterization of bathtub shaped failure rates not captured even by Weibull distributions, and is distinct from prior efforts Mudholkar and Srivastava (1993); Crevecoeur (1993).

Viii Discussion and Future Directions

Constrained preemptions are a relatively unexplored phenomenon and challenging to model. Our model and the associated data expand transient cloud computing to beyond EC2-spot. We have evaluated the model under different practical conditions including different VM types and temporal domains, and have shown it to be general and robust. However, many questions and avenues of future investigation remain open:

What if preemption characteristics change? Ultimately, the preemption characteristics are based on the cloud provider policies, the supply and demand of transient and on-demand and reserved VMs, etc., and may change over time. Our model allows detecting policy and phase changes by comparing observed data with model-predictions and detect change-points, and a long-running cloud service can continuously update the model based on recent preemption behavior. However, changes are rare: Google’s preemption policy has not changed since its inception in 2015. Regardless, we believe that VMs with constrained preemptions are an interesting new type of transient resource, and our analysis, observations, and policies should continue to be relevant. Furthermore, we demonstrate that the multi-phase bathtub failure distribution may be a fundamental characteristic of constrained preemptions that benefit both the cloud platform and applications, and thus models that capture the distinct preemption phases would still be relevant even if the finer-grained preemption characteristics change.

Phase-wise model.

Our statistical analysis indicates that the preemption rates have three distinct phases. Our model is a continuously differentiable and allows capturing the three phases reasonably well. However it may be possible to use a “phase-wise” model such as a piece-wise continuously differentiable model, where the three phases are modeled either as three segmented linear regions (found using segmented linear regression), or an initial exponential phase and two linear phases. Such a piece-wise model would be able to capture the phase transitions with even more accuracy, and is part of our future work.

Connection to constrained systems and statistical mechanics. Our proof of Lemma LABEL:lemma:1 used mapping to constrained physical systems and employed the statistical mechanics tools such as partition functions Krauth (2006). We have only presented the initial connection between the behavior of constrained preemptions and the statistical mechanics of constraint-driven phenomena in many particle systems Krauth (2006); Solis et al. (2013), and we conjecture that a deeper analogy may exist. Central to our proof is the assumption of mutually exclusive preemptions—that is, the provider preempts VMs in a mutually exclusive manner. This assumption makes sense from a cluster management and application perspective. However, analyzing constrained preemptions with weaker versions of the mutual exclusion assumption is also possible with statistical mechanics approaches. For example, for studies of situations where weakly overlapping preemptions are preferred, one can leverage the statistical mechanics framework of constrained “soft” particles often investigated using molecular dynamics simulations Jing et al. (2015).

Ix Conclusion

The effective use of transient computing relies on understanding the preemption characteristics. While past work on transient computing has developed techniques and systems for Amazon’s EC2 spot instances, ours is the first work to understand the behavior of Google’s Preemptible VMs, that have a unique characteristic of having a maximum 24 hour lifetime. Our large-scale empirical study shows that the constraint imposes a bathtub failure distribution, and we develop a new premption probability model for capturing its three distinct temporal phases. Our insights and model-based policies can reduce the preemption overheads by more than compared to existing preemption models, and our batch computing service can reduce computing costs by over .

References

  • spo (2013) “Scientific Computing Using Spot Instances,” http://aws.amazon.com/ec2/spot-and-science/ (2013).
  • (2) “Google cloud preemptible vm instances documentation,” https://cloud.google.com/compute/docs/instances/preemptible.
  • (3) “Azure low-priority batch vms,” https://docs.microsoft.com/en-us/azure/batch/batch-low-pri-vms.
  • Sharma et al. (2015) P. Sharma, S. Lee, T. Guo, D. Irwin,  and P. Shenoy, in EuroSys (2015).
  • Marathe et al. (2014) A. Marathe, R. Harris, D. Lowenthal, B. R. De Supinski, B. Rountree,  and M. Schulz, in HPDC (ACM, 2014).
  • Sharma et al. (2017) P. Sharma, D. Irwin,  and P. Shenoy, in Proceedings of ACM Measurement and Analysis of Computer Systems, Vol. 1 (2017) p. 23.
  • Wieder et al. (2012) A. Wieder, P. Bhatotia, A. Post,  and R. Rodrigues, in NSDI 12 (2012).
  • Dubois and Casale (2016) D. J. Dubois and G. Casale, Cluster Computing , 1 (2016).
  • Shastri and Irwin (2017) S. Shastri and D. Irwin, in Proceedings of the 2017 Symposium on Cloud Computing (ACM, 2017) pp. 493–505.
  • Daly (2006) J. T. Daly, Future Generation Computer Systems 22 (2006).
  • Ben-Yehuda et al. (2013) O. Ben-Yehuda, M. Ben-Yehuda, A. Schuster,  and D. Tsafrir, ACM TEC 1 (2013).
  • Sharma et al. (2016a) P. Sharma, T. Guo, X. He, D. Irwin,  and P. Shenoy, in EuroSys (2016).
  • Tonks (1936) L. Tonks, Phys. Rev. 50, 955 (1936).
  • Verma et al. (2015) A. Verma, L. Pedrosa, M. Korupolu, D. Oppenheimer, E. Tune,  and J. Wilkes, in EuroSys (ACM, 2015).
  • Cortez et al. (2017) E. Cortez, A. Bonde, A. Muzio, M. Russinovich, M. Fontoura,  and R. Bianchini, in Proceedings of the 26th Symposium on Operating Systems Principles, SOSP ’17 (ACM, New York, NY, USA, 2017) pp. 153–167.
  • (16) “Amazon ec2 spot instances,” https://aws.amazon.com/ec2/spot/.
  • (17) “Packet Spot Market,” https://support.packet.com/kb/articles/spot-market.
  • (18) “Alibaba Cloud Preemptible Instances,” https://www.alibabacloud.com/help/doc-detail/52088.htm.
  • Singh et al. (2014) R. Singh, P. Sharma, D. Irwin, P. Shenoy,  and K. Ramakrishnan, IEEE Internet Computing 18 (2014).
  • Subramanya et al. (2015) S. Subramanya, T. Guo, P. Sharma, D. Irwin,  and P. Shenoy, in SOCC (2015).
  • Joaquim et al. (2019) P. Joaquim, M. Bravo, L. Rodrigues,  and M. Matos, in Proceedings of the Fourteenth EuroSys Conference 2019, EuroSys ’19 (ACM, New York, NY, USA, 2019) pp. 35:1–35:16.
  • Harlap et al. (2017) A. Harlap, A. Tumanov, A. Chung, G. R. Ganger,  and P. B. Gibbons, in Proceedings of the Twelfth European Conference on Computer Systems, EuroSys ’17 (ACM, New York, NY, USA, 2017) pp. 589–604.
  • Ghit and Epema (2017) B. Ghit and D. Epema, in Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing, HPDC ’17 (ACM, New York, NY, USA, 2017) pp. 105–116.
  • Zheng et al. (2015) L. Zheng, C. Joe-Wong, C. W. Tan, M. Chiang,  and X. Wang, in SIGCOMM (2015).
  • Shastri et al. (2016) S. Shastri, A. Rizk,  and D. Irwin, in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’16 (IEEE Press, Piscataway, NJ, USA, 2016) pp. 85:1–85:11.
  • Wolski and Brevik (2016) R. Wolski and J. Brevik, in Proceedings of the 24th High Performance Computing Symposium (Society for Computer Simulation International, 2016) p. 13.
  • Wolski et al. (2017a) R. Wolski, J. Brevik, R. Chard,  and K. Chard, in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC ’17 (ACM Press, Denver, Colorado, 2017) pp. 1–11.
  • Ouyang et al. (2016) X. Ouyang, D. Irwin,  and P. Shenoy, in IEEE International Conference on Distributed Computing Systems (ICDCS) (2016).
  • Baughman et al. (2018) M. Baughman, C. Haas, R. Wolski, I. Foster,  and K. Chard, in Proceedings of the 9th Workshop on Scientific Cloud Computing (ACM, 2018) p. 1.
  • Wolski et al. (2017b) R. Wolski, J. Brevik, R. Chard,  and K. Chard, in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (ACM, 2017) p. 18.
  • Sharma et al. (2016b) P. Sharma, D. Irwin,  and P. Shenoy, in Proceedings of the 8th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud) (USENIX, 2016).
  • Missov and Lenart (2013) T. I. Missov and A. Lenart, Theoretical Population Biology 90, 29 (2013).
  • (33) “Scipy curve fit documentation,” https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html.
  • Krauth (2006) W. Krauth, Statistical mechanics: algorithms and computations, Vol. 13 (OUP Oxford, 2006).
  • Jing et al. (2015) Y. Jing, V. Jadhao, J. W. Zwanikken,  and M. Olvera de la Cruz, The Journal of chemical physics 143, 194508 (2015).
  • (36) J. Dongarra, T. Herault,  and Y. Robert,  , 66.
  • (37) “Google Cloud Platform,” https://cloud.google.com/.
  • (38) “Google Cloud API Documentation,” https://cloud.google.com/apis/docs/overview.
  • (39) “Slurm Workload Manager,” https://slurm.schedmd.com/documentation.html.
  • (40) “Torque Resource Manager,” http://www.adaptivecomputing.com/products/torque/.
  • Kadupitiya et al. (2017) J. Kadupitiya, S. Marru, G. C. Fox,  and V. Jadhao, “Ions in nanoconfinement,”  (2017), online on nanoHUB; source code on GitHub at github.com/softmaterialslab/nanoconfinement-md.
  • Jadhao et al. (2014) V. Jadhao, C. K. Thomas,  and M. Olvera de la Cruz, Proceedings of the National Academy of Sciences 111, 12673 (2014).
  • Brunk and Jadhao (2019) N. E. Brunk and V. Jadhao, Journal of Materials Chemistry B  (2019).
  • Karlin et al. (2013a) I. Karlin, A. Bhatele, J. Keasler, B. L. Chamberlain, J. Cohen, Z. DeVito, R. Haque, D. Laney, E. Luke, F. Wang, D. Richards, M. Schulz,  and C. Still, in 27th IEEE International Parallel & Distributed Processing Symposium (IEEE IPDPS 2013) (Boston, USA, 2013).
  • Karlin et al. (2013b) I. Karlin, J. Keasler,  and R. Neely, LULESH 2.0 Updates and Changes, Tech. Rep. LLNL-TR-641973 (2013).
  • Sharma (2018) P. Sharma, “Transiency-driven Resource Management for Cloud Computing Platforms,” https://scholarworks.umass.edu/dissertations_2/1388/ (2018).
  • (47) N. Jain, I. Menache,  and O. Shamir, in 11th International Conference on Autonomic Computing (ICAC 14) (USENIX Association).
  • Yi et al. (2010) S. Yi, D. Kondo,  and A. Andrzejak, in Cloud Computing (CLOUD), 2010 IEEE 3rd International Conference on (IEEE, 2010) pp. 236–243.
  • Liu (2011) H. Liu, in HotCloud (2011).
  • Chohan et al. (2010) N. Chohan, C. Castillo, M. Spreitzer, M. Steinder, A. Tantawi,  and C. Krintz, in HotCloud (2010).
  • Varshney and Simmhan (2019) P. Varshney and Y. Simmhan, IEEE Transactions on Parallel and Distributed Systems , 1 (2019).
  • (52) “Spotinst,” https://spotinst.com/.
  • Javadi et al. (2011) B. Javadi, R. Thulasiram,  and R. Buyya, in UCC (2011).
  • Mihailescu and Teo (2012) M. Mihailescu and Y. M. Teo, in CCGrid (2012).
  • Tang et al. (2012) S. Tang, J. Yuan,  and X. Li, in CLOUD (2012).
  • Wee (2011) S. Wee, in CCGrid (2011).
  • Xu and Li (2013) H. Xu and B. Li, Performance Evaluation Review 40 (2013).
  • Zhang et al. (2011) Q. Zhang, E. Gürses, R. Boutaba,  and J. Xiao, in Hot-ICE (2011).
  • Zafer et al. (2012) M. Zafer, Y. Song,  and K. Lee, in CLOUD (2012).
  • Song et al. (2012) Y. Song, M. Zafer,  and K. Lee, in Infocom (2012).
  • Guo et al. (2015) W. Guo, K. Chen, Y. Wu,  and W. Zheng, in Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing - HPDC ’15 (ACM Press, Portland, Oregon, USA, 2015) pp. 191–202.
  • Gong et al. (2015) Y. Gong, B. He,  and A. C. Zhou, in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC ’15 (ACM Press, Austin, Texas, 2015) pp. 1–12.
  • Taifi et al. (2011) M. Taifi, J. Y. Shi,  and A. Khreishah, in Algorithms and Architectures for Parallel Processing, Vol. 7017, edited by Y. Xiang, A. Cuzzocrea, M. Hobbs,  and W. Zhou (Springer Berlin Heidelberg, Berlin, Heidelberg, 2011) pp. 109–120.
  • Bougeret et al. (2011) M. Bougeret, H. Casanova, M. Rabie, Y. Robert,  and F. Vivien, in Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC ’11 (ACM Press, Seattle, Washington, 2011) p. 1.
  • Mudholkar and Srivastava (1993) G. S. Mudholkar and D. K. Srivastava, IEEE transactions on reliability 42, 299 (1993).
  • Crevecoeur (1993) G. Crevecoeur, IEEE Transactions on reliability 42, 148 (1993).
  • Solis et al. (2013) F. J. Solis, V. Jadhao,  and M. Olvera de la Cruz, Phys. Rev. E 88, 053306 (2013).