A Markov Chain Monte Carlo Approach to Cost Matrix Generation for Scheduling Performance Evaluation

03/21/2018
by   Louis-Claude Canon, et al.
0

In high performance computing, scheduling of tasks and allocation to machines is very critical especially when we are dealing with heterogeneous execution costs. Simulations can be performed with a large variety of environments and application models. However, this technique is sensitive to bias when it relies on random instances with an uncontrolled distribution. We use methods from the literature to provide formal guarantee on the distribution of the instance. In particular, it is desirable to ensure a uniform distribution among the instances with a given task and machine heterogeneity. In this article, we propose a method that generates instances (cost matrices) with a known distribution where tasks are scheduled on machines with heterogeneous execution costs.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 17

page 18

page 22

12/05/2020

A simple Markov chain for independent Bernoulli variables conditioned on their sum

We consider a vector of N independent binary variables, each with a diff...
04/14/2022

On Scheduling Mechanisms Beyond the Worst Case

The problem of scheduling unrelated machines has been studied since the ...
04/11/2020

Construction and Random Generation of Hypergraphs with Prescribed Degree and Dimension Sequences

We propose algorithms for construction and random generation of hypergra...
09/02/2018

A fast Metropolis-Hastings method for generating random correlation matrices

We propose a novel Metropolis-Hastings algorithm to sample uniformly fro...
04/08/2019

A Fast Scheme for the Uniform Sampling of Binary Matrices with Fixed Margins

Uniform sampling of binary matrix with fixed margins is an important and...
02/15/2019

A Comparison of Random Task Graph Generation Methods for Scheduling Problems

How to generate instances with relevant properties and without bias rema...
11/02/2021

Towards the 5/6-Density Conjecture of Pinwheel Scheduling

Pinwheel Scheduling aims to find a perpetual schedule for unit-length ta...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Empirical assessment is critical to determine the best scheduling heuristics on any parallel platform. However, the performance of any heuristic may be specific to a given parallel computer. In addition to experimentation on real platforms, simulation is an effective tool to quantify the quality of scheduling heuristics. Even though simulations provide weaker evidence, they can be performed with a large variety of environments and application models, resulting in broader conclusions. However, this technique is sensitive to bias when it relies on random instances with an uncontrolled or irrelevant distribution. For instance, in uniformly distributed random graphs, the probability that the diameter is 2 tends exponentially to 1 as the size of the graph tends to infinity

[1]. Even though such instances may be sometimes of interest, they prove useless in most practical contexts. We propose a method that generates instances with a known distribution for a set of classical problems where tasks must be scheduled on machines (or processors) with heterogeneous execution costs. This is critical to the empirical validation of many new heuristics like BalSuff[2] for the problem and PEFT[3] for in Graham’s notation[4].

In this context, an instance consists of a cost matrices, , where the element of row and column , , represents the execution cost of task on machine

. Like the diameter for graphs, multiple criteria characterize cost matrices. First, the heterogeneity can be determined globally with the variance of all costs, but also relatively to the rows or columns. For instance, the dispersion of the means on each row, which corresponds to the varying costs for each task, impacts the performance of some scheduling heuristics

[5]. The correlations between the rows and columns also play an important role as it corresponds to the machines being either related or specialized, with some affinity between the tasks and the machines[6].

Among existing methods, the shuffling one[5] starts by an initial matrix in which rows are proportional to each other (leading to large row and column correlations). Then, it proceeds to mix the values in the matrix such as to keep the same sum on each row and column. This ensures that the row and column heterogeneity remains stable, while the correlation decreases. However, this approach is heuristic and provides no formal guarantee on the distribution of the instances. In addition, when the number of shuffles increases, the cost CV increases, which leads to non-interpretable results (see Figure 1).

1_limit_shuffling

Figure 1:

Cost Coefficient-of-Variation (ratio of standard deviation to mean) and mean row and column correlations at each iteration of the shuffling method

[5] when generating a cost matrix. The shuffling method arbitrarily stops after 3 000 iterations (represented by the black vertical line).

While other methods exist, some of them with stronger formal guarantees, it remains an open problem to ensure a uniform distribution among the instances that have a given task and machine heterogeneity. Our contribution is to control the row and column heterogeneity, while limiting the overall variance and ensuring a uniform distribution among the set of possible instances. The approach is based on a Markov Chain Monte Carlo process and relies on contingency tables

111A contingency table is a matrix with the sum of each row (resp. column) displayed in an additional total row (resp. column). They are usually used to show the distribution of two variables.. More precisely, the proposed random generation process is based on two steps. For a given (number of tasks), (number of machines) and (sum of the cost of the tasks):

  1. Randomly generate the average cost of each task and the average speed of each machine. This random generation is performed uniformly using classical recursive algorithms[7]

    . In order to control the heterogeneity, we show how to restrict this uniform random generation to interesting classes of vectors. This step is described in Section 

    3.

  2. Next, the cost matrices can be generated using a classical MCMC approach: from an initial matrix, a random walk in the graph of contingency tables is performed. It is known (see for instance[8]) that if the Markov Chain associated with this walk is ergodic and symmetric, then the unique stationary distribution exists and is uniform. Walking enough steps in the graph leads to any state with the same probability. Section 4 provides several symmetric and ergodic Markov Chains for this problem. The main contribution of this section is to extend known results for contingency tables to contingency tables with min/max constraints.

In order to evaluate the mixing time of the proposed Markov Chains (the mixing time is the number of steps to walk in order to be close to the uniform distribution), we propose practical and statistical estimations in Section 

5. Note that obtaining theoretical bound on mixing time is a very hard theoretical problem, still open in the general case of unconstrained contingency tables. In Section 6, we used our random generation process to evaluate scheduling algorithms. The algorithms are implemented in R and Python and the related code, data and analysis are available in[9].

2 Related Work

Two main methods have been used in the literature: RB (range-based) and CVB (Coefficient-of-Variation-Based)[10, 11]. Both methods follow the same principle: vectors of

values are first generated using a uniform distribution for RB and a gamma distribution for CVB; then, each row is multiplied by a random value using the same distribution for each method. A third optional step consists in sorting each row in a submatrix, which increases the correlation of the cost matrix. However, these methods are difficult to use when generating a matrix with given heterogeneity and low correlation

[6, 5].

More recently, two additional methods have been proposed for a better control of the heterogeneity: SB (shuffling-based) and NB (noise-based)[5]. In the first step of SB, one column of size and one row of size are generated using a gamma distribution. These two vectors are then multiplied to obtain a cost matrix with a strong correlation. To reduce it, values are shuffled without changing the sum on any row or column as it is done is Section 4: selecting four elements on two distinct rows and columns (a submatrix of size

); and, removing/adding the maximum quantity to two elements on the same diagonal while adding/removing the same quantity to the last two elements on the other diagonal. While NB shares the same first step, it introduces randomness in the matrix by multiplying each element by a random variable with expected value one instead of shuffling the elements. When the size of the matrix is large, SB and NB provide some control on the heterogeneity but the distribution of the generated instances is unknown.

Finally, CNB (correlation noise-based) and CB (combination-based) have been proposed to control the correlation[6]. CNB is a direct variation of CB to specify the correlation more easily. CB combines correlated matrices with an uncorrelated one to obtain the desired correlation. As for SB and NB, both methods have asymptotic guarantees when the size of the matrix tends to infinity, but no guarantee on how instances are distributed.

The present work relies on contingency tables/matrices, which are important data structures used in statistics for displaying the multivariate frequency distribution of variables, introduced in 1904 by K. Pearson[12]. The MCMC approach is the most common way used in the literature for the uniform random generation of contingency tables (see for instance[13, 14]). Mixing time results have been provided for the particular case of sized tables in[15] and the latter using a coupling argument in[16]. In this restricted context a divide-and-conquer algorithm has recently been pointed out[17]. In practice, there are MCMC dedicated packages for most common programming languages: mcmc222https://cran.r-project.org/web/packages/mcmc/index.html for R, pymc333https://pypi.python.org/pypi/pymc/ for Python, …

More generally, random generation is a natural way for performance evaluation used, for instance in SAT-solver competitions444http://www.satcompetition.org/. In a distributed computing context, it has been used for instance for the random generation of DAG modelling graph task for parallel environments[18, 19].

3 Contingency vectors initialization

Considering tasks and machines, the first step in order to generate instances is to fix the average cost of each task and the average speed of each machine. Since and are fixed, instead of generating cost averages, we generate the sum of the cost on each row and column, which is related. The problem becomes, given and (total cost) to generate randomly (and uniformly) two vectors and satisfying:

(1)

with the following convention on notations: for any vector , is denoted .

Moreover, the objective is also to limit the maximum value. This is useful to avoid large variance: for this purpose we restrict the generation to vectors whose parameters are in a controlled interval . This question is addressed in this section using a classical recursive approach[7]. More precisely, let be positive integers and be the subset of elements of such that and for all , (i.e. the set of all possible vectors with values between and ). Let be the cardinal of . By decomposition one has

(2)

Moreover,

(3)

Algorithm 1 uniformly generates a random vector over .

Input: Integers , , ,
Output: such that and if it is possible
otherwise.
1 begin
2        if  or or  then
3              return
4       for  and  do
5              compute using (2) and (3).
6       for  do
7               if  then
8                      pick at random with
9              else
10                     
11              
12       return
13
Algorithm 1 Generate Sequences

Note that integers involved in these computations may become rapidly very large. Working with floating point approximations to represent integers may be more efficient. Moreover, with the rounded errors the random generation stays very close to the uniform distribution[20].

Figure 2 depicts the distribution of the values when varying the interval for and . Without constraint ( and ), the distribution is similar to an exponential one: small values are more likely to appear in a vector than large ones. When only the largest value is bounded ( and ), then the shape of the distribution is inverted with smaller values being less frequent. Finally, bounding from both sides ( and ) leads to a more uniform distribution.

2_hist_unif_vector

Figure 2: Frequency of each value in a vector of size with generated by Algorithm 1 for three combinations of constraints for the minimum and maximum . For each case, the frequencies were determined by generating vectors and are normalized to the maximum frequency. The frequency for large values when and are not shown.

Figure 3 shows the CV obtained for all possible intervals . The more constrained the values are, the lower the CV. The CV goes from 0 when either or (the vector contains only the value 10) to 1 when and .

3_heatmap_unif_vector

Figure 3: Mean CV in vectors of size with generated by Algorithm 1 for different constraints for the minimum and maximum . Each tile corresponds to vectors. The contour lines correspond to the levels in the legend (2.5, 5, 7.5 and 10).

It is also possible to generate uniform vectors using Boltzmann samplers[21]: this approach consists in generating each independently according to an exponential law of parameter . Theoretical results of[21] show that by choosing the right , the sum of the generated ’s is close to with a high probability. In order to get precisely

, it suffices to use a rejection approach. This is consistent with the seemingly exponential distribution in Figure 

2 in the unconstrained case. Moreover, in this case, Figure 3 shows that the CV is close to one, which is also the CV of an exponential distribution.

4 Symmetric Ergodic Markov Chains for the Random Generation

We can now generate two random vectors and containing the sum of each row and column with Algorithm 1. To obtain actual cost, we use Markov Chains to generate the corresponding contingency table. Random generation using finite discrete Markov Chains can easily be explained using random walk on finite graphs. Let be the finite set of all possible cost matrices (also called states) with given row and column sums: we want to sample uniformly one of its elements. However, is too large to be built explicitly. The approach consists in building a directed graph whose set of vertices is and whose set of edges represent all the possible transitions between any pair of states. Each edge of the graph is weighted by a probability with a classical normalization: for each vertex, the sum of the probabilities on outgoing edges is equal to . One can now consider random walks on this graph. A classical Markov Chain result claims that for some families of probabilistic graphs/Markov Chains, walking long enough in the graph, we have the same probability to be in each state, whatever the starting vertex of the walk[8, Theorem 4.9].

This is the case for symmetric ergodic Markov Chains[8, page 37]. Symmetric means that if there is an edge with probability , then the graph has an edge with the same probability. A Markov Chain is ergodic if it is aperiodic (the gdc of the lengths of loops of the graph is ) and if the graph is strongly connected. When there is a loop of length , the ergodicity issue reduces to the strongly connected problem. In general, the graph is not explicitly built and neighborhood relation is defined by a function, called a random mapping, on each state. For a general reference on finite Markov Chains with many pointers, see [8].

An illustration example is depicted on Fig 4. For instance, starting arbitrarily from the central vertex, after one step, we are in any other vertex with probability  (and with probability in the central vertex since there is no self-loop on it). After two steps, we are in the central vertex with probability  and in any other with probability . In this simple example, one can show that after step, the probability to be in the central node is and is for all the other nodes. All probabilities tends to when grows.

Figure 4: Example of the underlying graph of a Markov Chain when the sum of each row is three and the sum of column is two. Unless otherwise stated, each transition probability is .

This section is dedicated to building symmetric and ergodic Markov Chains for our problem. In Section 4.1 we define the sets that are interesting for cost matrices. In Section 4.2, Markov Chains are proposed using a dedicated random mapping and are proved to be symmetric and ergodic. Finally, in Section 4.3 we use classical techniques to transform the Markov Chains into other symmetric ergodic MC mixing faster (i.e. the number of steps required to be close to the uniform distribution is smaller).

Recall that are positive integers and that and satisfy Equation (1).

4.1 Contingency Tables

In this section, we define the state space of the Markov Chains. We consider contingency tables with fixed sums on rows and columns. We also introduce min/max constraints in order to control the variance of the value. We denote by the set of positive matrices over such that for every and every ,

(4)

For example, the matrix

is in , where and .

The first restriction consists in having a global minimal value and a maximal global value on the considered matrices. Let be positive integers. We denote by the subset of of matrices such that for all , . For example, . If , then . Moreover, according to Equation (4), one has

(5)

Now we consider min/max constraints on each row and each line. Let and . We denote by the subset of of matrices satisfying: for all , and . For instance,

Using Equation (4), one has for every ,

(6)

To finish, the more general constrained case, where min/max are defined for each element of the matrices. Let and be two matrices of positive integers. We denote by the subset of of matrices such that for all , . For instance, one has , with

For every , one has

(7)

where and .

4.2 Markov Chains

As explained before, the random generation process is based on symmetric ergodic Markov Chains. This section is dedicated to define such chains on state spaces of the form , , and . According to Equations (5), (6) and (7), it suffices to work on . To simplify the notation, let us denote by the set .

For any , any , such that and , we denote by the matrix defined by , , and otherwise. For instance, for and one has

Tuple is used as follow to shuffle a cost matrix and to transit from one state to another in the markov chain: is added to the current matrix, which preserves the row and column sums. Formally, let be the set of all possible tuples. Let be the mapping function from to defined by if and otherwise. The mapping is called at each iteration, changing the instance until it is sufficiently shuffled.

We consider the Markov chain defined on by the random mapping , where is a uniform random variable on .

The following result gives the properties of the markov chain and is an extension of a similar result[13] on . The difficulty is to prove that the underlying graph is strongly connected since the constraints are hindering the moves.

Theorem 1.

The Markov Chain is symmetric and ergodic.

The proof of Theorem 1 is based on Lemma 3 and 4.

Definition 2.

Let et be two elements of . A finite sequence of pairs of indices in is called a stair sequence for and if it satisfies the following properties:

  1. ,

  2. If , then ,

  3. If is even, then and

  4. If

    is odd, then

    and ,

  5. is even and ,

Consider, for instance, the matrices

The sequence is a stair sequence for and .

Lemma 3.

Let et be two distinct elements of . There exists a stair sequence for and .

Proof.

The proof is by construction. Since and are distinct, using the constraints on the sums of rows and columns, there exists a pair of indices such that . Now using the sum constraint on row , there exists such that . Set . Similarly, using the sum constraint on column , there exists such that . Set . Similarly, by the constraint on row , there exists such that . At this step, are pairwise distinct.

If , then is a stair sequence for and . Otherwise, by the -column constraint, there exists such that . Now, one can continue the construction until the first step we get either or with (this step exists since the set of possible indexes is finite). Note that we consider the smallest for which this is case.

  • If , , the sequence satisfies the conditions 2., 3. and 4. of Definition 2. Moreover both and are odd. The sequence satisfies the Conditions 2. to 5. of Definition 2. Since and by construction, . If follows that the sequence is a stair sequence for and .

  • If , then both and are even. The sequence satifies the Conditions 1. to 5. of Definition 2 and is therefore a stair sequence for and .

Given two matrices and , the distance from to , denoted , is defined by:

Lemma 4.

Let et be two distinct elements of . There exists such that and tuples such that and for every , .

Proof.

By Lemma 3, there exists a stair sequence for and . Without loss of generality (using a permutation of rows and columns) we may assume that and , for and .

To illustrate the proof, we introduce some matrix over , called difference matrices, such that: if , then ; if , then ; if , then ; and if , then .

Considering for instance the matrices and defined before, with a global minimum equal to and global maximum equal to , a difference matrix is

Note that it may exist several difference matrices since, for instance, some might be replaced by a

The proof investigates several cases:

  • If , then and works. Indeed, since , one has . Similarly, , and . It follows that and . In this case, the following matrix is a difference matrix:

  • If and if there exists such that , then, as for Case 0, works with : . Moreover if ; otherwise. In this case, the following matrix is a difference matrix:

  • If and Case 1 does not hold and if there exists such that , then, similarly, and works. One has if , and otherwise. In this case, the following matrix is a difference matrix:

  • If Cases 0 to 2 do not hold and if , then, similarly, and works. One has if , and otherwise. In this case, the following matrix is a difference matrix:

  • If Cases 0 to 3 do not hold. Since and , exists. In this case works. With . One has if , and otherwise. Moreover, for every , . In this case, the following matrix is a difference matrix:

One can now prove Theorem 1.

Proof.

If , then , proving that the Markov Chain is symmetric.

Let . We define the sequence by . The sequence is decreasing and positive. Therefore, one can define the smallest index such that . By construction, one also has . It follows that the Markov Chain is aperiodic.

Since is a distance, irreducibility is a direct consequence of Lemma 4. ∎

Consider the two matrices and defined previously with containing only the value 7. Case 4 of the proof can be applied. One has and