Probabilistic Models for the Execution Time in Stochastic Scheduling

06/14/2020
by   Matheus Henrique Junqueira Saldanha, et al.
0

The execution time of programs is a key element in many areas of computer science, mainly those where achieving good performance (e.g., scheduling in cloud computing) or a predictable one (e.g., meeting deadlines in embedded systems) is the objective. Despite being random variables, execution times are most often treated as deterministic in the literature, with few works taking advantage of their randomness; even in those, the underlying distributions are assumed as being normal or uniform for no particular reason. In this work we investigate these distributions in various machines and algorithms. A mathematical problem arises when dealing with samples whose populational minimum is unknown, so a significant portion of this monograph is dedicated to such problem. We propose several different effective or computationally cheap ways to overcome the problem, which also apply to execution times. These methods are tested experimentally, and results point to the superiority of our proposed inference methods. We demonstrate the existence of execution time distributions with long tails, and also conclude that two particular probability distributions were the most suitable for modelling all execution times. While we do not discuss direct applications to stochastic scheduling, we hope to promote the usage of probabilistic execution times to yield better results in, for example, task scheduling.

READ FULL TEXT VIEW PDF

page 1

page 11

page 23

page 31

page 40

page 41

07/30/2020

New approach to MPI program execution time prediction

The problem of MPI programs execution time prediction on a certain set o...
06/05/2020

Skedulix: Hybrid Cloud Scheduling for Cost-Efficient Execution of Serverless Applications

We present a framework for scheduling multifunction serverless applicati...
03/22/2022

Overview and Applications of GPGPU Based Parallel Ant Colony Optimization

Ant Colony Optimization algorithm is a magnificent heuristics technique ...
06/15/2021

Time-free solution to independent set problem using P systems with active membranes

Membrane computing is a branch of natural computingwhich abstracts fromt...
03/08/2017

Parallel Implementation of Efficient Search Schemes for the Inference of Cancer Progression Models

The emergence and development of cancer is a consequence of the accumula...
09/06/2018

Safe Execution of Concurrent Programs by Enforcement of Scheduling Constraints

Automated software verification of concurrent programs is challenging be...
07/21/2020

Quantifying Performance Changes with Effect Size Confidence Intervals

Measuring performance quantifying a performance change are core eval...

1 Cloud Computing

Cloud computing is the use of outsourced computational resources on-demand in a pay-per-use basis, and has been made feasible thanks to advances in operating systems and techniques for virtualization (tanenbaum2015modern). In this business model, a single entity holds a large amount of computer machines (the cloud), and offers these machines for use by interested users. It has become financially interesting because extreme computational power is needed only occasionally in many cases. With the cloud, instead of buying powerful machines that might become idle most of the time, businesses can leverage the necessary computational power from the cloud solely for the time needed.

In general, users are able to send jobs to execute in the cloud, which are often represented as a directed acyclic graph (DAG) (in this context, the graph is commonly called a workflow) where each node represents a computational task that produces a certain data, and edges represent data dependencies between these tasks. A task cannot be started before all data dependencies have been met. Finally, users can submit multiple workflows, and tasks in workflows of different users can be mapped to the same machine.

There are three main service models offered in the context of cloud computing, which assume the so called acronyms IaaSinfrastructure as a service, PaaSplatform as a service and SaaSsoftware as a service (coulouris2005distributed). In IaaS the client has full control of the computational environment, and has to specify the desired amount of storage, CPU power, memory, network bandwidth and other hardware-specific aspects; the client also has to perform due configurations in the operating system to correctly use the requested hardware. PaaS clouds are often oriented to specific applications (e.g., website server), so they offer cloud plans with fixed hardware specifications, which are tailored to suit the application approximately well; these platforms also come with a pre-configured operating system that is ready to run the application. Finally, SaaS offers a software that can be used by the user through a web interface, and the software runs on an underlying machine that is not available for the user to modify or configure; Google’s GSuite444See https://gsuite.google.com/ (access in September 26th, 2019). is an example of this model. In the literature, there does not seem to exist an agreement concerning precise definitions of IaaS, PaaS and SaaS, but the general ideas are as discussed above.

One difficult problem in cloud computing, which has received a lot of attention from the scientific community, is the problem of scheduling workflows in the available machines (bittencourt2012scheduling). As mentioned previously, users can submit multiple workflows, which in turn are composed of a graph of interconnected tasks; the main question that arises here is what is the optimal order to execute these tasks in the available resources? This is the problem of scheduling, which is known to be NP-complete in many of its forms (ullman1975np), most of which are formulated as a problem of multi-objective optimization (as done, for example, by rodriguez2014deadline and zuo2015multi

). There are many objectives that can be pursued here, such as minimizing the makespan (overall execution time of a workflow), minimizing the cost for the client, maximizing the throughput of completed workflows, and so on. Due to the difficulty in solving the problem analytically, approximate or heuristic optimization methods are frequently used

(zhan2015cloud; bittencourt2012scheduling).

In the literature, most proposed scheduling algorithms consider only the mean execution time of the tasks in order to decide the mapping, rather than other information such as the variance or the probability distribution itself. For example,

panda2015efficient propose two algorithms that require a matrix of expected execution times of each task on each machine . rodriguez2014deadline applies PSOparticle swarm optimization for scheduling, and it also uses the mean execution time of tasks, though automatically calculated based on: 1) the capacity of each machine (in FLOPS) and 2) the size of the task (in number of operations). Similarly, zuo2015multi apply ant colony optimization and, besides the mean time, no other information concerning time is used. The same happens in the works of tsai2014hyper, tang2015self, tripathy2015dynamic and xavier2019chaotic

(although the latter also uses the standard deviation of the processing load on each machine, which is quite interesting).

This phenomenon is more than expected, considering the difficulty of the problem: authors place their focus on complications such as the heterogeneity of machines in the cloud, the necessity for complying with deadline and QoSquality of service requirements, the use of task preemption to improve results, etc. Simplifying the model – by assuming that tasks’ execution time can be characterized by a single number (the mean) – allows simpler algorithms to be devised, yet achieving outstanding results. It is arguable, however, that most of these proposed algorithms can be modified to make use of the randomness of execution times, and consequently improve results, as is discussed in this monograph. We demonstrate that execution times demonstrate a relatively complex behavior across different machines and program types, so it seems that the mean is not sufficient to decide the mapping of tasks into machines.

The area of stochastic scheduling is responsible for using more information about execution times of particular tasks555Recall that workflow is a graph of tasks. (often statistical information) to perform the scheduling, and there has been a decent amount of work in such subject, although much less than in the deterministic context. li1997estimating provide a general statistical framework way to calculate the probability distribution of the makespan of a generic graph of tasks, be in the context of cloud computing or not; it served as a basis for subsequent stochastic scheduling algorithms. This is not a trivial problem because it can involve a maximum of two non-independent random variables, which is mathematically cumbersome; the authors smartly use the Kleinrock independence approximation from queuing theory (kleinrock2007communication) to mitigate this problem and allow calculations. Their approach requires the time distribution of tasks to be known, which for experimentation purposes they assume as being normal distributions; one interesting implication of our work is to provide more informed suggestions as to which distributions could be used. An approach for predicting the execution time of a task given its previous behavior “features”, is given in ganapathi2010statistics. The authors perform experiments specifically with map-reduce jobs (chu2007map), but the idea of correlating previous behavior, such as average time for transferring data or the amount of data processed, is promising and can be extended to contexts within cloud computing.

Heuristic optimization methods whose objective function involve execution times modelled as random variables are presented in dogan2001stochastic and dogan2004genetic. shestak2008stochastic proposes a stochastic metric for evaluating the robustness of a scheduling algorithm, which is given by the probability that the time will be in some defined interval , which requires the probability distribution of tasks to be given. Here the authors claim that normal distributions happen frequently in practice; on that note, (chen2016uncertainty) proposes a stochastic scheduling algorithm that assumes all execution times to be normal. kamthe2011stochastic

use, besides the mean, the standard deviation of execution times in order to improve scheduling results. A method for estimating an empirical probability density function for execution times is given in

dong2010resource, which involves estimating histograms and, therefore, require a decent number of available samples. cai2017elasticsim presents a framework for simulating scheduling algorithms which support stochastic execution times, and the requirement is that the probability distribution be defined by the user. Finally zheng2013stochastic

uses Monte Carlo simulation to overcome the mathematical difficulty of stochastic scheduling, but it also expects underlying probability distribution of tasks to be given, and for experimentation purposes they assumed normal or uniform distributions.

2 Extension to Distributed and Parallel Programs

The problem of scheduling is also present in the areas of distributed and parallel computing, with small modifications from the problem presented in the previous section. Let us first state the difference between these two areas: we will refer as distributed any program that explores parallelism by leveraging multiple computers (each computer having a single operating system) that do not share memory (e.g., cluster of computers or a supercomputer (tanenbaum2016structured)). Parallelism in single multicore machine, with shared memory or not, and possibly with coprocessors such as GPUs, will be considered in this text as parallel computing.

Considering these definitions, we see that a cloud is a distributed system; consequently, their scheduling algorithms have a large intersection. However, cloud computing involves a specific flow of job submission and job execution, so scheduling algorithms often take this in consideration to map tasks into machines. In contrast, there are other distributed systems with different characteristics; for example, a shared web hosting machine that runs web servers of many users, each of which may interact with databases located in other (distant) machines, would require other scheduling strategies.

In the case of parallel algorithms, the time for moving data between machines tends to be shorter and get comparable to the time for processing such data. One way to schedule tasks in a parallel program is to model these times with computational models (norman1993models; rauber2013parallel), in which the time for moving data or executing a task is considered to be a deterministic function of parameters such as the size of data to be transferred or the dimensions of the input data.

As an example, take the problem of training a neural network. This involves computing gradients such as the following

where is our dataset containing pairs (the data item and its correct label, respectively), are weights associated with some function ,

is the loss function that will tell how “wrong” the neural network currently is, and

are layers (e.g., fully-connected, convolutional, max-pooling) that apply some non-linear transformation to their inputs

(goodfellow2016deep; haykin1994neural). The sum can clearly be split over disjoint subsets of , and each of these splits can be calculated in a different processor, core or coprocessor. Considering a cluster with 12 computers, each containing a multicore processor with 2 cores and one GPU, we would like to have splits to compute in each core and GPU. Further, we would like that each split be computed in the same time, so the splits must have different sizes (as GPUs will likely be much faster (rauber2013parallel)); how can we calculate the size of each split? The results presented in this monograph could be used to help such decision666This whole problem is part of an ongoing project; see https://mjsaldanha.com/sci-projects/2-accel-nns-1/..

3 Worst-Case Execution Time in Embedded Systems

Time is a paramount factor in many embedded systems, namely real-time systems and more specifically the hard real-time systems (tanenbaum2015modern). Here, deadlines must be followed by the underlying program, otherwise catastrophes could happen; take, for example, the ABSanti-lock braking system that controls the brakes of a car. Since the execution time is a random variable, and considering that deadlines must always be followed, it is necessary to calculate the WCETworst-case execution time, that is, the largest time that the program can take when in the production environment.

One way to go about solving this problem is to estimate the underlying probability distribution, and the worst-case scenario will be on the right-tail thereof (wilhelm2008worst). With the distribution in hand, the quantity of interest is for which , which is the time above which events are extremely rare or impossible. However, estimating the underlying distribution is far from trivial, and the first difficulty comes from the fact that input data (e.g., data from sensors interacting with the environment) is not deterministic, and each tuple of inputs may cause the program to follow very different execution paths in its CFGcontrol flow graph, which in turn may greatly change the generation process of execution times.

As will be more elaborated in Section 10.2, it is reasonable to say that a single execution path in the CFG yields execution times following a certain well-behaved777Consider as well-behaved a smooth distribution with low variance and few modes. distribution, but the same is not true between different paths. In other words, the distribution given an input can be expected to be well-behaved, but the distribution averaged over all possible inputs may be very uninformative. The program would then have to be described by multiple probability distributions, one for each set of similar execution paths, then each distribution could be analyzed separately (e.g., david2004static). Alternatively, one could define a probability distribution for the inputs given to the program, which allows merging the multiple probability distributions of paths into a single distribution, giving more weight to the paths that are more often exercised (e.g., braams2016deriving).

There is a lot of existing work in the area of WCET: a survey is given in wilhelm2008worst. One common approach is to find WCET statically, by taking the source or compiled binary code, building its control flow graph and analyzing its behavior on a certain model of processor. In david2004static, a framework for calculating WCET statically is presented, which provides the user with the WCET for a given execution path in the CFG, as well as the probability for the program to follow such a path; however, it requires the user to provide probability distributions for many aspects of the program and its environment, which is a shortcoming of the method. Similar works can be found in ferdinand2004ait, wilhelm2010static, engblom2002processor, li1995performance and li2007chronos. Caches are known to be a hard part of static analysis, as it has the biggest impact in execution times (hennessy2011computer), so it has received special attention in lv2016survey, patil2004compositional, touzeau2017ascertaining, blass2017write and kosmidis2013multi. In particular, kosmidis2013multi discusses one way to deal with the problem of statically predicting the number of cache hits or misses: the authors propose and discuss a cache design that is randomized, in the sense that the eviction and replacement of cache block is not done in usual ways (e.g., the least recently used and write back approaches), and then they provide proofs that such a cache greatly favors WCET analysis.

Another way to approach WCET is by using execution time measures taken by running the software on the given hardware using a specified set of inputs (wilhelm2010static). This problem is often formulated probabilistically: let be an iidindependent and identically distributed random sample of execution times, then the problem is the estimation of . The probability distribution of can be calculated analytically:

where is the cumulative distribution of the variables . This requires, however, knowledge of or, equivalently, the underlying probability distribution of the variables , which is often not available. Thankfully, the area of EVTextreme value theory (coles2001introduction) has important results regarding the distribution of the sample maximum even if is unknown. The Fisher–Tippett–Gnedenko theorem (fisher1928limiting; gnedenko1943distribution) says that, in the limit when the sample size goes to infinity, the distribution of with converges (when it converges to a non-degenerate distribution) to one out of three distribution families: Gumbel, Fréchet or Weibull.

Many works explore this result in measure-beased WCET, such as done by lima2016extreme, abella2017measurement, silva2017using and lu2011new. However, it often happens in WCET that the assumption of the theorem does not hold, as empirically observed in lima2016extreme, though in the same work the authors comment on some cases where EVT does indeed help determining the WCET. Besides EVT there are other ways to approach the measure-based WCET problem. lindgren2000using, for example, uses code instrumentation to trace the program’s execution path indirectly, which leads to a system of equations whose solution helps determining the WCET. Measure-based approaches are also used to improve estimates of the static approaches, as done in wenzel2005measurement and kirner2005using.

4 Estimation and the Maximum Likelihood Estimator

Inferring characteristics of a population based on a sample is a widely studied problem in the field of statistics. In the formal setting, we have a random variable with unknown probability density function 888We limit the explanation to the continuous case, as only this case is used throughout the text., but we only have access to a finite sample taken from this underlying distribution. The sample is a sequence of random variables because every time we take elements from the population it will yield a different result 999We will use the convention that uppercase letters mean random variables, whereas lowercase mean realizations (instantiations) of such random variables..

In order to infer characteristics about the population , we can observe the characteristics of the sample. Any function of the sample is a potential characteristic of the population, and is thus called an estimator (degroot2012probability). For example, the sample mean defined as

is an estimator for the populational mean. It is a function of random variables, so it is itself a random variable; as such, important quantities such as its expected value and variance can be computed. Given a sample , the sample mean can be computed and it will deviate from the populational mean; the variance of tells how much deviation should be expected.

In general, if we have two estimators for the same populational quantity , the one with lower variance is called more efficient. The estimator is called consistent if it converges to the populational quantity as the sample size goes to infinity; rigorously, it is consistent if, and only if

One last important property of estimators is the bias, defined as

and the estimator is called unbiased if this difference is zero, that is, the estimator’s expected value is the same as the estimated populational quantity.

Consider a random variable whose distribution is not known. However, we know that its distribution comes from a certain family (e.g., normal distribution with parameters and ). Therefore, what is left is only to find the parameters of such family of distributions; maximum likelihood estimator (MLE) is an estimator for the parameters. Suppose comes from a distribution with density function where

is the vector of parameters of the distribution family. For an independent and identically distributed (iid) sample

, the probability of having sampled each value is , and the probability of having sampled all of them is the product of all these terms, that is,

where is called the likelihood function. The maximum likelihood estimator for the vector is defined as

Since the logarithm is a concave function, the same will be found if we apply logarithm on the internal product, that is,

which is the most common way of performing MLE, as a sum is more numerically stable than a product, avoiding overflows. The MLE is broadly used because it is consistent (in most problems), and its distribution is known under certain conditions to be a normal where is the Fisher information (whose definition is out of the scope of this monograph) for . This applies if is single parameter; if it is a vector, small modifications are needed (degroot2012probability).

5 Hypothesis Tests

Research in computer science, especially in areas related to performance of programs, often involves comparing two populations of execution times. For example, if a novel, faster algorithm is being proposed to solve a problem, it is natural to compare its execution time with the usual approach . This is often done by executing and multiple times, yielding two sets of samples of execution times, and the mean of the samples are directly used to make statements such as “algorithm is faster than ”.

Nevertheless, being random variables, comparing samples of execution times is far from trivial. To start with, let be the execution time of a program, and a sample of such variable (i.e., multiple measures for the same program), then the sample mean defined as

is also a random variable, since it is a function of the samples . This means that the mean could have been any value within the range of its probability distribution; in Figure 1 we show two possible probability distributions for the sample mean of two populations of execution times. Since the distributions overlap, it could happen that we obtain two sample means even though the expected value (the populational mean) is . This is a major problem, as it allows a researcher to incorrectly conclude that their approach is better than the existing one. In order to mitigate this problem, one can make use of hypothesis testing, which is further explained in the following.

Figure 1: Possible probability distributions for the sample mean of two samples of execution times.

*CLTcentral limit theorem Testing hypotheses allows the researcher to better understand the probability that their conclusion was incorrectly inferred from the sample means obtained. This relies on the central limit theorem

101010This theorem exists in more than one form; here we refer specifically to the “classical” version, in which are assumed to be independent and identically distributed. Later we might refer to Lyapunov’s CLT, as it excludes the requirement of identical distributions. (CLT, see billingsley2008probability), which tells us that for any random variable , following any probability distribution, then if we take samples and the sample mean , we have that

(1)

where and are the populational mean and variance, respectively, and is the cumulative distribution of a standard normal distribution. The cumulative distribution uniquely characterizes a probability model, so we can conclude that as the sample size increases, the distribution of tends to 111111Recall that if then , which is a gaussian whose variance also decreases as the sample size increase; thus, in Figure 1, the two histograms would get thinner if we increased the sample size, eventually eliminating the overlapping and, consequently, making the researcher more sure that the sample mean they got is near the true mean, with high probability. Testing hypotheses allows us to measure this degree of certainty.

The problem with (1) is twofold. First, it only works when taking the limit , whereas researchers can never take sample of infinite size. Statisticians, however, have had to deal with such a problem for many decades, and experience made it well accepted that for a large class of random variables already makes (1) approximately valid (walpole1993probability). These random variables are expected to behave similarly to a normal, that is, they must be unimodal and reasonably symmetric. In this context, here we make the first approximation: we assume has a normal distribution.

Under such assumption, we can calculate probabilities of interest from Figure 1. Consider that follows the blue histogram (to the left) and follows the other one, then we want to calculate , which measures the probability of correctly asserting that the populational means follow ; in contrast would measure the probability of saying that , when this is not true. We can rewrite

and using known relations between linear combinations of normal distributions, we have

(2)

from which we could easily calculate , if it were not for the second problem that (1) has.

The second problem in (1) is that is unknown (as information on and is precisely what we are trying to infer from the sample), and in most cases is also unknown. Suppose for now that is known. We have in hand a value (commonly called a statistic), which was sampled from that follows a normal distribution with parameters as in (2). We do not know the parameter , but we can study the conditional probability density of having sampled from the distribution given ; that is, study . Suppose that we got 121212The values here come from a synthetic problem, in which two samples of size 100 were generated, one from a normal , the other from , and that we know that

then the aforementioned conditional probability is shown in Figure 2. This figure shows densities, so it should be taken with caution as it does not represent a probability. It is an approximation of the probability for very small . Despite this, we can still draw conclusions. For example, we see that if the mean was , meaning that algorithm 1 (here associated with ) is not faster than algorithm 2, it would be very unlikely to have sampled ; in particular, the probability of having sampled any value below is at most , where happens when the mean of the normal is taken to be , as higher values would yield even lower probabilities. In these circumstances, we can at least have high certainty that (as initially assumed) is not true, meaning that as desired.

Figure 2: Conditional probability density of having sampled from a normal distribution with given mean and variance as defined in the text. We emphasize that this graph does not represent a probability density because the variable on the x-axis () is not a random variable.

What was done above is precisely what is usually called a hypothesis test, although using a less recipe-based, algorithmic process. In the usual process, we would have formulated the problem as two hypotheses

where

is the null hypothesis that should be chosen as the one we have to assume true until we prove otherwise. In the case of algorithm comparison, it is better to assume the proposed algorithm is worse than the existing one, and then prove this wrong. Then we define a significance level

, which above had been taken as (or ), which will be our criterion for rejecting the validity of . This is done in the following way: 1) there exists a value such that , where the supremum is not necessary as we know the worst case happens at ; 2) then there is a probability of of having sampled any value below ; 3) finally, if our sampled value , we reject , thus accept that our algorithm is faster. As a reference for other computer scientists, we highly recommend walpole1993probability for the more algorithmic process of testing hypotheses; degroot2012probability offers a more theoretical perspective that promotes a better interpretation of the process, and allows one to even modify tests to suit their problem better.

To perform the test, it was assumed that the sample mean follows approximately a normal distribution, for the number of samples used. If the underlying distribution is “well-behaved”, then it is a reasonable assumption for any (walpole1993probability). In this context, do execution times follow a well-behaved distribution? What would be a reasonable to choose? We see (although very rarely) hypothesis tests being used in the literature – as done by kadioglu2011algorithm and saldanha2019high for execution times, and by jindal2017reducing, hassan2005comparison and wang2015knowledge for other measures concerning algorithms –, but no comments on the underlying distributions are made. The present monograph is thus an attempt to support hypothesis tests for comparing execution times of algorithms, provide means to better interpret the outcomes of these tests, and also shed light on the decision of the sample size .

6 Methods for Generating New Probability Distributions

In this project we ended up using some distributions that are actually “compositions” of existing distributions. This section aims to explain such composition process, which will sate eventual curiosity, help interpreting how these distributions differ from other simple ones, and help understanding the naming logic of some distributions used here.

One way of composing distributions is as follows. Suppose we have a cumulative distribution function (cdf)

, which we will call parent distribution and whose support131313The support is the domain of the cumulative/probability density functions (e.g., ; ; ). is arbitrary; also consider a cdf with support . Every cdf follows the properties (degroot2012probability):

  1. [nosep,noitemsep]

  2. ;

  3. ;

  4. is non-decreasing function;

  5. is right-continuous.

The aforementioned and thus have the following signatures (the intervals on the codomains can be open or not):

(3)

Which can be composed as , and what is interesting is that is most often also a cdf. Note that Property 3 holds because the composition of two non-decreasing functions is non-decreasing; that is, since and :

Property 4 also holds, as the composition of two right-continuous functions is right-continuous. Let and , then we have:

so is right-continuous. Properties 1 and 2 can be proved, but it is somewhat more difficult to do so. The biggest problem is that the following equality (note that ):

only holds if is left-continuous at , which is not the case for many discrete distributions, but always holds for continuous ones. We thus conclude that for all continuous and , will also be a cdf. Let us see how this works in practice.

Consider is the cdf for a Gamma distribution with parameters and , which has support ; and

is the cdf of a Beta distribution with parameters

and , with support . Then applying the technique explained above, we can create a new cdf , resulting in a brand new distribution family with 4 parameters , and its pdfprobability density function can be found by differentiating the produced cdf. Figure 3 illustrates the composition process for parameters .

The process of building new distribution families using similar techniques is a very explored topic in Statistics (cordeiro2011new; bourguignon2014weibull; torabi2014logistic; cordeiro2017generalized). The process we described above was done in a similar way with the Kumaraswamy distribution, which has support in , resulting in the Kw-CWG distribution we have used in this project (afify2017new). The same applies to the OLL-GG (prataviera2017odd) distribution we also used.

Figure 3: Example of composing a Gamma parent distribution with a Beta distribution. From the top-left in a clockwise direction: 1) cdf of the Gamma used, 2) cdf of the Beta used, 3) cdf of the composition and 4) the pdf of the composition.

7 Problem Formalization

First let be a random variable that follows a certain probability model with support 141414Note that the discussion presented here also applies to supports of type ., and a sample taken from . Consider the case where the experimental minimum is relatively high as illustrated in Figure 5. By support, we mean the set in which the probability density is not zero, apart maybe from a subset of measure zero; hereafter, we consider all probability functions to be defined on the whole real line. In an attempt to reduce the space of initial conditions to explore, we model such variable as with and . Note that if this model was true, then the support of would be , which is not true considering the initial assumptions. However, it seems reasonable to believe that if is very low, then the loss incurred by such “approximation” would be negligible.

Figure 5: Example of the first scenario analyzed. The underlying phenomenon is represented by a variable whose distribution is shown as the solid line. From such a distribution we take a sample (light grey histogram). Then we choose using methods to be discussed later, and fit a truncated distribution over (dashed line).

The approximation here consists of considering that the range of possible outcomes of begin at a certain that is not the true one. We then would like to model the data under such a consideration; that is, find a model for . If we have knowledge about the distribution family of and that its support begins at zero, then a good fit (asymptotically) would be achieved by selecting the distribution of as being a truncated version of the distribution of (see Figure 5), given by

where are densities, is a cumulative distribution function (cdf), is a parameter vector and is given. Notice that for a constant . Because of that, the likelihood over a sample is

and if for all , then

so that any maximizing the likelihood for will also maximize , proving our initial statement that the best asymptotic fit would be a truncated version of ’s distribution. This happens regardless of , so if we allowed to also be optimized, then it would be chosen to maximize ; from the monotonicity of maximization happens when approaches the sample minimum . We cannot have because in such a case at least one of the would be negative, thus making and the likelihood be zero.

The fact that has no influence in the best parameter found by MLEMaximum Likelihood Estimation is actually a problem here. Although truncation allows us to shift the support origin, it does not help us with our original objective of making the space of initial parameters easier to design. This is related to the moments of and being proportional to one another ().

In the impossibility of using truncated models, we turn back to the original problem of finding a model for by modelling just . Recall that this means the support of is approximated as , with being either fixed or given as a parameter of the distribution family. Thus, consider that follows a certain distribution parametrized by , whereas the candidate distribution family that we use is parametrized by ( can be fixed or not). We must then find the “best” in the parameter space. Let be a sample from the real distribution, then the average log likelihood can be expressed as (recall ):

and as

we have, by the law of large numbers

(degroot2012probability), its expected value:

(4)

which gives us a metric that we would like to maximize. With this we are seeking the model that obtains highest expected likelihood over data generated by the real underlying distribution .

Figure 6: The ideal distribution would be the truncated version of the real underlying distribution (dashed line). However, if we exclude truncated distributions as argued in the text, we end up with a suboptimal solution (solid line), obtained here by maximizing the average likelihood shown in Eq. 4.

We now consider the case where the distribution of is inferred from the same distribution family of . Here, the optimal solution (truncated version of ) is almost never included in the inference search space151515

In order for this to happen, the distribution would need to be memoryless, and we know only the exponential and geometric distributions are memoryless

(ross2010introduction).. Instead, the resulting distribution will be an approximation of this optimal solution, as shown in Figure 6. The area between the curves is illustrative of the difference between their cumulative probabilities, so it can be used to have an idea of how much they differ. We had constrained to be lower than the sample minimum ; for small samples, this leaves a large range over which could lie. Figure 7 shows what happens when we perform inference for different choices of . High values, nearer the sample minimum, will result in more disparate distributions than the truncated one. On the other hand, low values that are nearer to the origin of the original variable (lower values would also work) tend to yield a distribution more similar to the ideal. We consequently face a tradeoff as high values of is what allows recycling a grid of initial values for various phenomena.

Figure 7: Considering a certain Gamma distribution with a long left tail, this figure shows the best Gamma approximations to the ideal truncated version of , when performing inference on support . Each shaded area shows the region between two curves: i) the ideal truncated distribution, and ii) the curve obtained by maximizing the average log-likelihood. The curves have been displaced on the y-axis for better visualization.

The above discussion has so far considered that the statistician knows that the underlying random variable is supported on . However, it is often the case that this is not known with sufficient certainty. In fact, for distributions with long left tails, which are the main object of study here, we probably will not observe any values very near zero, even if the underlying distribution was indeed supported on . Nevertheless, the case of support is also covered if the coordinate system is translated.

8 Proposed Methods for Performing the Inference
Procedure

Due to the aforementioned hindrance in determining whether the underlying phenomenon is supported on , we argue that all semi-finite supported random variables must be considered as belonging to the interval until proven otherwise. As it is an estimator of the populational minimum , any value of above the sample minimum does not make sense. In this scenario, given a sample we would like to find the underlying distribution within a parametrized family supported on .

In order to ease the determination of initial parameters for the subsequent inference process, is to be chosen regardless of the real value of , merely aiming for having be low enough and be as near the sample minimum as possible, since this will minimize the approximation losses discussed in Section 7. Our choices are then to either estimate and then perform inference over using a family , or to add as a location parameter of such a family, which then becomes . Recall that we eliminated the possibility of truncation previously. We analyze both possibilities, and in the end propose a third alternative that deviates slightly from the usual procedure of classical inference. We remind that the objective is maximize likelihood and minimize computational cost.

I) Inferring the Location Parameter. Let represent the underlying phenomenon with support . We want to model it using a family of distributions supported on , though shifted to . That is, we actually model such that:

in which case, by abuse of notation, we say with constrained to lie in the interval ( the sample minimum). With this, MLE can then be performed to find and .

II) Estimating the Location Parameter. Since is strongly related to the populational minimum , in that the value of that maximizes likelihood when is , then it makes sense to estimate it from a sample. With the estimate we can then perform inference using a positive supported distribution as usual after subtracting from the sample (recall that ). Taking the sample minimum to estimate it, besides being very biased, also frequently results in the likelihood term for the smallest sample item being zero, rendering MLE impossible. After subtracting the estimate from the sample , the smallest one ends up being ; the problem here resides on the fact that many distributions require for a large range of their parameters, such as the Gamma distribution whenever the shape parameter is .

Shifting that estimate slightly to the left is thus needed, maybe by multiplying it by some factor. But what should this factor be? In our experience, deciding this automatically to various data sets with different shapes and scales happened to be quite difficult. For example, taking to be the upper end of the rescaled interval worked for data sets with smaller values, but not for larger ones where it was shifted too far from . In order to find better alternatives, we rely on order statistics.

The sample minimum has a known cumulative distribution (mood1950introduction):

(5)

for a sample of size of a variable with cdf . Therefore we can expect the sample variance to have, for small samples, a variance similar to that of the underlying variable, and we can expect the variance to reduce as grows. This reasoning brings us to the first estimator:

(6)

where is the variation coefficient161616, sample variance divided by sample mean. of the sample and is an arbitrary logarithm basis. Here and in subsequent estimators, we use the modulus because the underlying variable can have negative , in which case can also be negative. The interpretation is that we are moving to the left, with an intensity that is directly proportional to the data variability and inversely proportional to the sample size. Our experience showed

to be quite useful and computationally cheap. It is worth noting that taking the coefficient of variation eliminates, to a certain extent, problems caused by the scale of the data, since it involves a division by the sample mean. This estimator has the advantage of simplicity, and even though it gives a very rough estimator of the desired low quantile, it appears to work very well in practice.

We now turn to more complex alternatives, that have more theoretical grounds. Although there are many parametric approaches for estimating quantiles (gardes2005estimating; de2018high), estimating very low quantiles is a problem that has not yet been solved in a sufficiently “general” way. That is, most parametric solutions rely on assumptions about the underlying distribution or quantile functions (constraints on their derivatives, for example). To maintain generality (and because this later proved itself to work well), we opt for a general nonhardparametric approach, using the empirical cdf over samples as main tool. Given a sample , is defined as the proportion of sample items lower than , thus forming an increasing step function. Uniform convergence of to is given by the Glivenko-Cantelli theorem171717This, as well as all other results used hereafter, require independent and identically distributed (iid) sampling., so for sufficiently large we have information about the probability of a next sample to be lower than the current sample minimum. As this number decreases, the less we can expect the populational minimum to be lower than the actual sample minimum, meaning that we can then define a second estimator:

(7)

where we also embody the hope that the deviation is proportional to the sample standard deviation. A tighter estimate follows by noticing that the Law of Iterated Logarithm (vapnik1998statistical) gives the rate of convergence:

Now considering that is zero for any , we must have for sufficiently large :

which can then substitute the in Eq. (7):

(8)

The Dvoretzky–Kiefer–Wolfowitz inequality (massart1990tight) can also be invoked, which provides a different way to view the estimator. The inequality is:

and by doing the necessary manipulations, we derive that the following will hold with probability :

so if we choose to be very low, we can expect to be lower than or equal to the right-side of the above equation. Following the same logic as previously, we define another estimator:

(9)

which offers a probabilistic view, instead of the previous asymptotic view given by the Law of Iterated Logarithm. Figure 8 illustrates all of these estimators.

Figure 8: The low quantile estimators based on the data represented by the histogram in light gray. The data was generated from the density shown as a black line.

III) Iterative Determination of the Location Parameter. Inference by MLE begins with the assumption that the underlying distribution comes from a certain family. Under this assumption, we do have a lot of information about the underlying cdf and pdf. We intend to use this information to our advantage here.

The cdf of the sample minimum is given by Eq. (5). By inverting this equation we obtain the quantile function of the minimum:

(10)

With this, the median of the sample minimum is given by , under the assumptions that the underlying distribution resides in the specified family and has parameter . The median can be seen as a good “guess” for what the sample minimum should be, and so the sample should be shifted so that the sample minimum coincides with such a guess. That is, we want to find such that , and the function that undergoes MLE is

However, it is very unusual to place a sample measurement (in this case ) in the density function. This method has the advantage of not increasing the number of parameters, which favors better values of information criteria and reduce chance of overfitting. One worry here is that the aforementioned modification to the density function might make the likelihood surface very irregular and thus harder to optimize over. Despite that, experiments have shown it works quite well, as we show in the following; theoretical analysis of the optimization surface will be done in the future.

9 Experimental Results

So far we have tested the proposed methods only with the execution times data obtained during this project. We had multiple sets of execution time samples, and performed MLE using multiple distribution families. For each sample set and inference method, one of these distributions yields the best likelihood , which we take as being the measure of quality of that inference method. Finally, for each sample set one of the inference methods has the best quality, so we subtract the quality of each inference method by that best value. This results in quality measures for each inference method, which are plotted as boxplots in Figure 9. Here, a value of zero means it was the best inference method in that sample set, and negative values shows its deviation from the best method.

Figure 9: Comparison of inference methods based on the best likelihood (left) and best cross-validated likelihood (right) they yielded for each of the 37 sample sets. A value of zero means a case where that inference method achieved the best result among all other methods; negative values shows a disparity from the best method.

First consider the boxplot concerning the likelihood only. In this case the figure indicates that the inferC method (adds as a parameter of the distribution) has the best median as well as a low dispersion, so overall it performed best. However, this increases the parameter count, which in general is not desirable. Among the models that do not increase parameter count, estimators and displayed very similar performances, comparable to that of inferC and much better than the standard inference method. The main problem with the standard method is what we mentioned in the beginning of the chapter: the optimization algorithm fails to find the point of maximum likelihood, consequently giving some sort of preference to the simpler distribution families. The proposed iteratedC had a median very close to the best one, though it has a longer tail towards worse likelihood values. We consider its performance really good, especially considering that it has a solid theoretical reasoning (together with inferC), so we believe it should not be discarded. Rather than that, this method has a parameter that we fixed as in Section 8, so we hypothesize that this could be tuned to yield even better results.

There are a few notable changes when turning to the cross-validated metric. First of all there is a remarkable improvement in the performance of all estimators relative to inferC, with achieving the best median. inferC still displayed a very good performance, proving that it is quite robust, more than one could expect from a method that increases the parameter count and, as a consequence, the chance of overfitting. In general, all of our proposed methods (except iteratedC) proved to be robust, in the sense that their performance do not change significantly when cross validating. iteratedC displayed terrible performance when cross validating, but we attribute this more to how the method was formulated than to its actual usefulness. We intend to further investigate this in the future, in order to improve the method formulation.

Minimizing computational time of the inference process is also one of our objectives. For each sample set and each distribution model, each inference method took some amount of time to be performed. Recall that MLE optimization is done with many initial parameters (that we defined), which can lead the optimizer to converge or not. Furthermore, optimization is faster if the initial parameters are nearer to the optimal value, so using only one value as reference would probably be too biased towards the choice of initial parameters. We thus consider: 1) the average time elapsed in each of these optimizations, i.e. the time taken per initial parameter by the inference method in question; and 2) the average time for all initial parameters that led the optimizer to converge.

Figure 10 shows these average times for each distribution family181818Experiments performed in the Euler supercomputer www.cemeai.icmc.usp.br/Euler/index.html using an Intel Xeon E5-2680v2 2.8 GHz and 128 GB of RAM.. As expected, inferC

was the slowest in most cases, as it has one extra degree of freedom to optimize; it is followed by

iteratedC (probably because the optimization surface became more irregular, as discussed previously) and the standard method. Interestingly, the estimators were almost always faster than the standard method, with being the fastest one. This proves that the estimators we proposed do not only help achieving better likelihood values, but also hasten the optimization process. In most cases the time difference was not too big; in the worst case (lognormal), the estimators were times faster than inferC, and times faster than the standard method. Nevertheless, even small differences can make a huge difference when there is a large number of distributions, initial parameters and sample sets to undergo distribution fitting. If the whole inference process takes 5 minutes, a difference of only in time would make it 4 minutes only. The estimators suffer some degradation when considering all initial parameters, i.e. including those that did not converge; this shows that non-convergent cases can take more time than convergent ones. Despite that, the whole discussion above holds, except for the Weibull distribution, which we will consider as an anomalous case here, and not discuss it further.

Figure 10: Comparison of computational cost of the inference methods for each distribution family, considering only the initial parameters that led to convergence (top), and all initial parameters (bottom). The error bars show a confidence interval, obtained through bootstrapping.

To sum up, the inference methods , and inferC yield very good likelihood values, and all of them were robust. and were slightly more robust, and managed to decrease the time taken to perform MLE, most of the times even faster than the standard method. On the other hand inferC was the slowest method of all tested.

10 On the Suitability of Using Probability Models

The randomness present in execution times of programs is obvious. Despite this, the correct way of modelling it probabilistically is not as clear; in fact, before modelling any problem probabilistically, some time must be spent in interpreting the underlying phenomenon being observed and foreseeing what could go wrong in the model. We illustrate one possible problem in Figure 11. Suppose that the program under study (henceforth referred to as the experimental program) is run a number of times and the execution times are recorded, then this might yield a histogram such as that on the left of the figure. One could naively accept such histogram as representative of the whole population of execution times that the program could have yielded; however, knowledge about how the computer works internally forces us to be more skeptic about it.

Figure 11: Possible histogram for execution times of a program (to the left), and the histograms of the two halves of the whole sample (to the right), showing that the underlying population suffered a permanent change at some point in time.

If we plot the histograms for first and second halves of the samples, we might get what is shown to the right of Figure 11, which clearly shows that something is wrong, at least for sufficiently large (e.g., ), as there is the theoretically justified expectation that histograms of each half to be very similar to the histogram for the whole sample. More specifically, something has changed at some point in time, so the underlying populational distribution is different for the first half than for the second one. It could have happened, for example, that during the experiment some daemon process (e.g., an automatic backup began) started running and competing with our experimental program for the computer resources (tanenbaum2015modern), which led to degradation of the execution times. This can be dealt with in a few different forms:

  • [itemsep=0.3em, parsep=0.3em]

  • we can run the experimental program every some large interval of time, for a total period of a few days, so that the histogram reflects the probability distribution averaged over all possible states of the machine (e.g., busy or idle);

  • we can try to make sure the machine will be sufficiently stable during experiments;

  • the problem can be modelled as a time series, which would allow for a better prediction of the next execution time, based on previous ones (box2015time);

  • the samples of the experiment could be investigated for anomalies such as the one described above, using techniques from the area of novelty detection

    (markou2003novelty).

Deciding what to do and how to do, this is the problem of modelling and requires a deeper understanding of the problem domain (in our case, computer programs), which we try to better formulate in the following.

10.1 The Problem of Concurrent Processes

A program is a binary file containing machine instructions, each of which is a number of bits describing what the computer CPU must do. A process is a program in execution, that is, it has been loaded into primary memory and its instructions can be read at any time (controlled by the operating system) by the CPU for executing them (tanenbaum2015modern). During execution, the process can use all the computer resources, namely the CPU cores and cache memories, primary memory, secondary memory (e.g., hard disk), coprocessors such as GPUs, and so on (patterson2013computer; tanenbaum2016structured; stallings2003computer). The time to perform any operation with such resources is random, so they play a major role in the randomness of the program’s total execution time.

If there are multiple processes running in the same system, they compete for the aforementioned resources and will degrade the execution time of our experimental program. Modern CPUs are able to execute multiple processes at the same time; each can be in their own CPU core, in which case the different processes will share: cache memories (level 1 cache is usually local to the core, so no sharing happens), primary memory and access to the system bus (and consequently, access to disk, GPUs, etc). It is also possible for two processes to run in the same core (i.e., hyperthreading), in which case they will also share the core pipeline and the lowest level caches (advanced modern CPU technologies are well covered in the work of hennessy2011computer). Up to this point, each process runs continuously. If there are too many processes actively running, then in most operating systems processes will alternate between “executing” and “suspended” states, where the latter means the process is not running at all (tanenbaum2015modern). Finally, a last layer of randomness is introduced in cloud computing because programs are often run inside virtual machines, which adds a layer of abstraction between the program and the computer resources.

In the context of cloud computing, which is the focus of this monograph, machines tend to always have some process in active state. At each time point, all existing active programs will degrade the execution time of the experimental program in units of time, where is a random variable. This depends on the number of active processes and their types, that is, whether they are limited by CPU, memory or Input-Output operations (tanenbaum2015modern). In practice, it would be difficult for the cloud provider to keep track of the type of each application, so we disregard it here. Ideally, the information of how many active processes would be available, in which case it seems reasonable, based on what was discussed above, to model the execution time for the experimental program as

where is the probability distribution for the time of the program in a machine with no other active processes, is the number of active processes, and is the distribution of the time degradation caused by these active processes, and this distribution depends on

. By combining regression analysis and statistical inference

(degroot2012probability), this model already allows us to estimate a distribution for .

10.2 The Problem of Modelling the Program’s Inputs

A second problem arises when considering that a program can be given different input data. How does this affect the execution time? In order to understand this, we must introduce some definitions. Every program can be described by a control flow graph (CFG) (allen1970control), in which nodes represent a linear sequence of instructions, and edges represent jumps that make the CPU stop reading instructions from the current node and begin reading the ones in the target node . These jumps are most often conditional branches that decide the target node based on whether a certain predicate (such as i < 10) is true of false, which is indicated as labels on edges of the graph. A graph with nodes always has an entry node in which computation begins, and one or more exit nodes that terminate the program; for simplicity, and without loss of generality191919Note that if there are multiple exit nodes, we can turn them into normal nodes and make them point to a new, unified exit node., we will assume a single exit node, always being the entry point and always the exit one. Each execution of the program corresponds to a walk through this graph beginning on and ending on ; an execution path is thus a sequence of nodes , with and , visited during an execution.

Figure 12: Source code for calculating the -th fibonacci number (left) and its associated control flow graph (right).

An example is illustrated in Figure 12, where a program that computes the -th fibonacci number is converted to its associated CFG with nodes numbered from top to bottom following our previous assumptions. The input of this program is , the index of the fibonacci number that should be computed. As each instruction of the program operates deterministically upon the results of previous operations, we can conclude that the program itself is deterministic, that is, for a specific input the execution path of the program is uniquely defined.

The problem introduced by the variety of possible inputs is that the mix of instructions executed by the program can change a lot between two different inputs. The execution time probability distribution for the program depends a lot on this mix of instructions, so the variety of inputs must somehow be accounted for. One way to do this is to treat the input as a random variable, and assign a probability for the program to receive any given input, as done by braams2016deriving. Other ways specific to the WCET problem were discussed in Section 3.

A third way to deal with inputs, which will be the approach adopted here, is to assume the program will always follow the same path or a set of paths that is sufficiently similar to model their execution times under the same probability distribution. This assumption is reasonable in at least the following two scenarios:

  • [nosep]

  • the program will be run multiple times without changing the input, which is common for scientists that are performing experiments with their novel algorithms;

  • the program will be run multiple times with different inputs, but the input variable that will be changed is not expected to change the execution path significantly.

As an example of the latter scenario, take the problem of protein structure prediction, which is a hard problem that received a lot of attention as to how it can be distributed over multiple machines in order to improve speed (alva2016mpi) or throughput (kim2004protein); consequently, it is a strong candidate for being executed on the cloud (kajan2013cloud; mrozek2015scaling; yachdav2014predictprotein). There exist a number of stochastic optimization algorithms designed to solve this problem, such as done in the works of brasil2013multiobjective, gao2018incorporation and cutello2005multi, and one of their characteristic is the number of iterations parameter defined by the user, which controls how long the algorithm should be run for. The execution path of these algorithms in general depend more on such parameter than on the input protein itself, so running the program on different proteins using the same number of iterations tends to yield sufficiently similar execution paths. Note that cloud applications like this one, which are mostly run by researchers and scientists of varied knowledge areas, go under the special name of scientific workflows (shishido2018genetic) and, as a consequence of what was just discussed, is a major focus of the present work.

It is worth noticing that some algorithms, such as the stochastic optimization we mentioned above, make use of random number generators. In this case, each instruction is no longer deterministic, as we required earlier. This is fine as long as the generated numbers do not change the predicates along the execution path, which could, for instance, change the number of iterations in a for-loop and thus greatly impact the execution time. This often does not happen in most evolutionary algorithms, such as ant colony optimization, particle swarm optimization and genetic algorithms

(yu2010introduction). Therefore, we can still regard execution paths as similar in this context.

11 Selection of Distributions and Experimental
Programs

*Kw-CWGKumaraswamy complementary Weibull geometric There exist a huge number of probability distribution families that could be used here to fit the execution times of a program in an idle system (with no other active processes), obtained through experiments. Due to the nature of the problem in hand, it is more adequate to choose distributions whose support is (also called lifetime or survival models (klein2006survival)), as execution times are naturally continuous and positive. This restricted class of distributions is still enormous, so we decided to begin by using two large models with many parameters, which consequently are very flexible; after fitting the data, it could provide information on which other distributions to choose. The chosen large models are the Kumaraswamy complementary Weibull geometric (Kw-CWG, afify2017new) which has 5 parameters and

, and the odd log-logistic generalized gamma (OLL-GG,

prataviera2017odd) with parameters . The density functions of Kw-CWG and OLL-GG are respectively:

where is the gamma function defined as which has the property when , so it should be viewed as an extension of the factorial to the real (or complex) domain (boyce2012elementary); and is the incomplete gamma ratio defined as , where the integral is very similar to the one in , with just a different interval of integration. The Kw-CWG model is a generalization of the complementary Weibull geometric distribution (louzada2011complementary), and OLL-GG results from composing a generalized gamma with an odd log-logistic distribution (cordeiro2017generalized); both comprise (for fixed values of its parameters) a number of other distributions as sub-models.

Initial experiments indicated that the underlying distribution of execution times are mostly unimodal and two-tailed; there were some exceptions where two modes could be observed. In this light, we immediately exclude distributions such as the exponential, which is one-tailed. Due to their wide usage, the gamma, Weibull and lognormal models were used; much of their possible shapes are unimodal and two-tailed as desired. Their generalized versions, the generalized gamma (by stacy1962generalization) and the exponentiated Weibull (by mudholkar1993exponentiated), were also considered in hope that they would fit the data better. Also, since the normal distribution is broadly used in literature, it is also used; however, the fact that it is a distribution over the whole real line could place it in some sort of unfair disadvantage relative to the others. To avoid this problem, we also used a truncated form of the normal distribution, that is, if is the density function of the normal, then we take (with ) where the denominator is a normalization factor to force integrate to 1, as all probability distributions should. The probability distribution functions for these distributions are shown in Table 1 where unless specified otherwise and is the cumulative distribution of the usual normal distribution.

Weibull:
Gamma:
Gen. Gamma:
Exp. Weibull:
Normal:
Trunc. Normal:
Lognormal:
Table 1: Density functions of the probability distributions used.

Seven models were considered for the random variable , which represents execution times of a program in an idle system. Initially was considered as having its own probability distribution, which was estimated from the data. As this proved itself problematic, we used the other six methods shown in Chapter id1 for dealing with location parameters.

In order to test the hypothesis that execution times of programs can be modelled using the same parametrized distribution, we need a representative set of programs and computer platforms on which these programs will be run. In an attempt to be representative, we have chosen three programs that exercise different components of the computer: one CPU-bound program, one memory-bound and one IO-bound (more specifically, bounded by disk usage) (tanenbaum2015modern). Respectively, the programs202020These programs and other code resources can be found in github.com/matheushjs/ElfProbTET or mjsaldanha.com/sci-projects/3-prob-exec-times-1/. are: 1) calculation of the Mandelbrot set212121This set involves repeated calculations of the function , . It is fractal as it exhibits complex structures in arbitrarily small open sets (see alligood1996chaos), 2) calculation of the shortest path in a random graph using Dijkstra’s algorithm and 3) performing random insertions in a table of a database. Concerning machines, we attempted to build a selection as diverse as possible in terms of hardware, within the limits of infrastructural resources available to us; Table 2 shows this selection. We highlight that the selection includes CPUs of two different vendors, as well as HD and SSD disks, which we expect to be among the most influential factors on the probability distribution

Concerning machines, we attempted to build a selection as diverse as possible in terms of hardware, within the limits of infrastructural resources available to us; Table 2 shows this selection. We highlight that the selection includes CPUs of two different vendors, as well as HD and SSD disks, which we expect to be among the most influential factors on the probability distribution of execution times. Unfortunately, all CPUs follow the same ISAinstruction set architecture specification (x86-64, also known as AMD64), which we believe is a limitation of this work; future work should include other ISA specifications (e.g., PowerPC, ARM).

Machine Name CPU Memory Disk
Andromeda AMD FX(tm) 8350 Eight Core Processor 4x Corsair 8GB DIMM DDR3 Synchronous 800 MHz Seagate HD 2TB ST2000DM0011ER1
HalleyHD Intel Core i7 4790 CPU 3.60GHz 4x AMI 8GB DIMM DDR3 Synchronous 1600 MHz Seagate HD 2TB ST2000DM0011CH1
HalleySSD Intel Core i7 4790 CPU 3.60GHz 4x AMI 8GB DIMM DDR3 Synchronous 1600 MHz Kingston SSD 240GB SA400S3
Helix Intel Core i5 4440 CPU 3.10GHz 4x Kingston 4GB DIMM DDR3 Synchronous 1333 MHz Kingston SSD 240GB SA400S3
Table 2: Machines used for performing experiments. Most of them are from the Laboratory of Distributed Systems and Concurrent Programming http://lasdpc.icmc.usp.br/.

12 Experiments and Results

Each of the three selected programs were executed 1000 times in each of the four selected machines for each different input chosen222222There were 3 different inputs for the mandelbrot program, and 4 for the dijkstra and disk database programs., yielding 1000 samples of execution times for each triple program-input-machine. Then each of these sets of 1000 samples underwent maximum likelihood estimation to find best-fit parameters for each of the selected distribution families (an example is given in Figure 13), considering each of the seven proposed probabilistic modellings for execution times (see Section 11 for details on such decisions). The experimental programs were implemented in C and Python, and the statistical analysis was made in the R language; the reader can find all code and experimental data in our public repository or the homepage of this project20.

Figure 13: Example of result obtained by performing maximum likelihood estimation on a sample of execution times. Each line reflects the best parameters found for the specified distribution family.

A few technical details are worth mentioning. First, each program was run on an almost-idle machine, that is, no user application was being run, but system software and other daemons that occasionally become active were. Also, each set of 1000 samples was visually checked over whether the histograms of the first and second halves differed too much232323Experimental data is also available in this project’s home page; see footnote 20., which would characterize a lack of identical distribution, as discussed in Section 10. Third, care was taken so that random generators that could change the execution path of the programs (see Section 10.2) received a fixed seed. Finally, the Generalized Gamma, Kw-CWG and OLL-GG probability distributions were not readily available in R, so we implemented it ourselves. Since the Kw-CWG implementation in R was slow (due to it being an interpreted language), we made an efficient C++ implementation, which uses parallel computing if supported by the user’s machine. All these distributions were published it in the official R repository242424See cran.r-project.org/package=ggamma, cran.r-project.org/package=elfDistr and cran.r-project.org/package=ollggamma..

In order to evaluate how well the distributions fit the data, we used well-known metrics specific to this problem. Let be the logarithm of the maximum likelihood obtained, the metrics used are

where is the number of parameters of the distribution and is the sample size. These metrics are known as Akaike (AIC), consistent Akaike (CAIC), Hannan-Quinn (HQIC) and Bayesian (BIC) information criteria (anderson2004model), each of which try to penalize models with too many parameters to prevent overfitting; the maximized likelihood was also used in the usual form252525This form is merely so that it is directly comparable to the information criteria values. . Finally, we also 5-fold cross validated the metric, by fitting the models in subsamples of size and then calculating the metric in the remaining 262626Cross validation is also a way to control overfitting, but the underlying theory (discussed by vapnik1998statistical and devroye1996probabilistic, for example) is completely different than in information criteria.. For all metrics, the lower its value the better the model fits the data. Besides these, we also recorded the time elapsed for performing the optimization procedure, as estimation should not take too long if used in a practical situation of stochastic scheduling (see Section 1).

Figure 14: Number of times each distribution model achieved the best goodness-of-fit for each modelling for the execution time random variable. Stronger shades reflect higher values in the respective cell.
Figure 15: Number of times each distribution model achieved the second best goodness-of-fit.
Figure 14: Number of times each distribution model achieved the best goodness-of-fit for each modelling for the execution time random variable. Stronger shades reflect higher values in the respective cell.

*AICAkaike information criterion *CAICconsistent Akaike information criterion *HQICHannan–Quinn information criterion *BICbayesian information criterion

To begin with, an overview of the experimental results is shown in Figures 15 and 15. In Chapter id1 seven inference methods for a random variable were presented, which here we apply to the variable of execution times. For each of these methods and each distribution family, we fitted the distribution on all experimental data and then counted how often each distribution family achieved the best fitting. Thus, the figures show the performance of each pair method-distribution, in terms of the metrics mentioned previously (recall that lower is better).

Points worth noticing in these results are:

  1. [nosep,noitemsep]

  2. the OLL-GG family is the best model in most metrics and inference methods;

  3. the Kw-CWG performs well on the and iteratedC inference methods (cross validation in iteratedC is very peculiar, so it can be disregarded here);

  4. the exponentiated Weibull achieved very good results in the standard method, meaning it probably has a more regular likelihood surface, even for data with a location parameter with large value;

  5. the truncated normal and lognormal distributions achieve the best results with a reasonable frequency;

  6. the exponentiated Weibull is the second best model in most inference methods and metrics;

  7. there is a clear transition on the second places when looking at them in the order to , where the “win count” seems to spread out over all models, which might be caused by a worsening of the exponentiated Weibull on and .

Figure 16: Distance, in achieved log-likelihood (top) and cross validated log-likelihood (bottom), from the OLL-GG model in each of the 37 sample sets.

An alternative view of the results is shown in Figure 16. Here we focus on two metrics, the log-likelihood and its cross validated version. For each sample set we measure the log-likelihood for a distribution family, and take its distance from the log-likelihood obtained by the OLL-GG, which displays the best results according to the already mentioned Figures 15 and 15. A value of here shows an equal performance; a positive value, a better performance. This process produces values for each inference method and distribution family, which are presented as boxplots in Figure 16. Since some inference methods resulted in very similar boxplots, we removed these “duplicates”. The figures show that the Kw-CWG, generalized gamma and the exponentiated Weibull share very similar log-likelihoods with the OLL-GG, except in the standard method. It is also remarkable that the performance of Kw-CWG did not degrade a lot in the cross validated metric, which was not expected since it has more parameters and thus is more prone to overfitting; it could be that although it has more parameters, many combinations of these parameters lead to the same density function, meaning the space of representable functions is not 5-dimensional. Although the generalized gamma had fairly good results, it is more dispersed than the exponentiated Weibull and its medians are slightly more negative. Overall, the exponentiated Weibull seems to be a good choice with low number of parameters; for the complex models, there is still some doubt on whether to choose Kw-CWG or OLL-GG, though the latter is favored for having less parameters.

So far we have only answered the question of which distribution families can model execution times more accurately. However, in the context of cloud computing, computational cost also matters a lot. Figure 17 shows a comparison between the distribution families in this aspect272727Experiments performed in the Euler supercomputer www.cemeai.icmc.usp.br/Euler/index.html using an Intel Xeon E5-2680v2 2.8 GHz and 128 GB of RAM.. Recall that MLE optimization is done as many times as there are entries in the grid of initial parameters. Similar to what was discussed in Section 9, we again consider the average optimization time over all initial parameters and over the initial parameters that led the optimization to converge.

Figure 17: Comparison of average execution time over all initial parameters (bottom) and over the ones that led to convergence (top), distinguished by distribution family and considering four inference methods. Error bars show an interval of confidence obtained by bootstrapping.

The figure shows that there is a significant difference in computational cost between OLL-GG and Kw-CWG compared with E. Weibullexponentiated Weibull and G. gammageneralized gamma distributions. This was expected due to the difference in their number of parameters, although we expected Kw-CWG (5 parameters) to be slower than OLL-GG which has 4 parameters. It could be that the extra parameters in the Kw-CWG actually make the optimization surface easier to navigate around. One detail here is that more parameters imply that the experimenter will naturally need a larger grid of initial parameters, usually growing exponentially in size since normally one would assign initial values for each parameter , and the grid would be composed of all combinations of initial values for . Therefore, the disparity shown in Figure 17 between E. Weibull and G. gamma with Kw-CWG and OLL-GG actually means that in practice inference will be much faster when using the former two.

Although OLL-GG was slightly slower than Kw-CWG, it should still be faster in practice due to the same reasons given above, i.e. the grid of initial conditions will be smaller for the OLL-GG since it has less parameters. On top of that, OLL-GG demonstrates a better capability to model execution times, as demonstrated in Figures 9 and 15. Because of these advantages, we choose OLL-GG to be our choice for modelling execution times whenever a higher number of parameters is deemed adequate. If an option with low number of parameters is desired, we recommend the exponentiated Weibull, as it led to slightly more stable likelihood values in Figure 16 than the generalized gamma.

13 Limitations and Future Work

The main limitation of this work is the lack of machines of varied types so that experiments could also be carried out on them. Our initial hypothesis that a single distribution family could model all execution times in any machine cannot be considered as proved precisely due to the lack of different machines. Also, it would be ideal to test more distribution families, especially some that can fit bimodal distributions, as bimodality was observed in some experimental data. This project excluded programs that make use of coprocessors such as GPUs, which could also be investigated. Finally, the three chosen experimental programs were assumed to be representative of the set of all possible programs, under the argument that they exercised the three main different components of a computer: CPU, memory and disk; future work will validate such assumption empirically.

We also expect to investigate the relation of the distributions with the execution path of the programs; for now, it is hypothesized that Lyapunov’s central limit theorem (billingsley2008probability) can be used to justify that long execution paths tend to have distributions more similar to a normal.

Finally, the inference methods proposed in Chapter id1 lack more theoretical analysis. For example, we would like to prove that all estimators are consistent, which does not seem difficult if we use concepts of convergence in probability. We also need to perform experiments with varying sample sizes, which we plan to do in order to publish a paper on the subject.

14 Acknowledgements

The author thanks prof. Ricardo Marcacini for the valuable suggestion of including the cross validation metric in our experiments. I also thank the LaSDPC and BioCom laboratories for the computational and other resources, as well as CeMEAI (FAPESP grant 2013/07375-0) for providing access to their supercomputer.

startatroot

References