Influence of Incremental Constraints on Energy Consumption and Static Scheduling Time for Moldable Tasks with Deadline

06/19/2020
by   Jörg Keller, et al.
FernUniversität in Hagen
0

Static scheduling of independent, moldable tasks on parallel machines with frequency scaling comprises decisions on core allocation, assignment, frequency scaling and ordering, to meet a deadline and minimize energy consumption. Constraining some of these decisions reduces the solution space, i.e. may increase energy consumption, but may also reduce scheduling time or give the chance to tackle larger task sets. We investigate the influence of different constraints that lead from an unrestricted scheduler via two intermediate steps to the crown scheduler, by presenting integer linear programs for all four schedulers. We compare scheduling time and energy consumption for a benchmark suite of synthetic task sets of different sizes. Our results indicate that the final step towards the crown scheduler – the execution order constraint – is responsible for faster scheduling when task sets are small, and lower energy consumption when we deal with large task sets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

11/13/2010

Leakage-Aware Reallocation for Periodic Real-Time Tasks on Multicore Processors

It is an increasingly important issue to reduce the energy consumption o...
02/12/2018

Algorithms for robust production scheduling with energy consumption limits

In this work, we consider a scheduling problem faced by production compa...
04/01/2021

Energy-aware Task Scheduling with Deadline Constraint in DVFS-enabled Heterogeneous Clusters

Energy conservation of large data centers for high-performance computing...
05/29/2018

Scheduling under dynamic speed-scaling for minimizing weighted completion time and energy consumption

Since a few years there is an increasing interest in minimizing the ener...
07/05/2020

Performance Evaluation of Orchestra Scheduling in Time Slotted Channel Hopping Networks

In this paper, we evaluate the performance of networks that use RPL (Rou...
03/02/2021

Capelin: Data-Driven Capacity Procurement for Cloud Datacenters using Portfolios of Scenarios – Extended Technical Report

Cloud datacenters provide a backbone to our digital society. Inaccurate ...
12/19/2019

Energy Minimization in DAG Scheduling on MPSoCs at Run-Time: Theory and Practice

Static (offline) techniques for mapping applications given by task graph...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Parallel programs can often be formulated as a set of tasks working on a stream of inputs [MelotKEK19]. While the tasks may exhibit dependencies, task instances working on different inputs are independent. Throughput requirements impose a maximum deadline until all task instances must be executed, while the nature of the platform — often embedded or mobile — necessitates to restrict energy consumption as much as possible. To meet the deadline, it is often necessary to parallelize some or all tasks.

Static scheduling of independent, parallelizable tasks of known workloads with a common deadline onto a parallel machine with frequency-scalable cores comprises a number of steps. For moldable tasks, i. e. tasks where the degree of parallelization must be fixed prior to execution and cannot be changed during execution, these steps are allocation, mapping, scaling and ordering. This means that for each task, the degree of parallelism must be determined and an appropriate subset of cores must be assigned, the operating frequency chosen from the available frequency levels and the tasks ordered in time so that they do not overlap. The settings must be such that all tasks terminate before the common deadline. Among the schedules that achieve this, it is often desirable to choose one that minimizes another property such as the energy consumption by the tasks, which is mainly determined by the tasks’ execution frequencies.

The decisions in these phases are not independent of each other. Especially, any sub-optimal decision in the allocation, mapping or ordering phases that would result in missing the deadline can only be compensated by increasing the operating frequencies (assuming that maximum frequency will always suffice to meet the deadline), which in turn increases the energy consumption.

An unrestricted schedule with the above properties looks like the solution of a kind of puzzle game (cf. Fig. 1(a)). It can be computed by an integer linear program (ILP) [MelotKEK19], which however needs a large number of variables and thus can only be used for very moderate task counts.

Crown Scheduling [MelotKKE14] is an approach where allocation, mapping and ordering are restricted (cf. Fig. 1(b)) in order to reduce the number of decision variables in the ILP to allow scheduling of larger task sets or reduce scheduling time for the same task set compared to an unrestricted scheduler. The price to pay is that the solution space for crown scheduling is only a subset of the solution space of the unrestricted scheduler, so that the crown-optimal solution found may have a higher energy consumption.

Figure 1: Example schedules for moldable tasks: (a) unrestricted schedule (left), (b) crown schedule (right).

It turns out [MelotKEK19] that crown scheduling only leads to a moderate increase in energy consumption in most cases but often reduces the scheduling time and/or the number of ILP solver timeouts compared to an unrestricted scheduler.

In the present research, we investigate the influence which each of the above steps has on energy consumption and scheduler execution time when crown scheduling constraints are applied. Thus, we compare four schedulers:

  • unrestricted scheduler,

  • unrestricted scheduler with allocation constrained to powers of two,

  • unrestricted scheduler with constrained allocation and assignment restricted to consecutive cores starting with a core whose index is a multiple of the allocation (e. g., for an allocation of 4, possible assignments are , , and so on),

  • and crown scheduling, where, additionally, the tasks are ordered in time such that tasks with larger allocation are always executed before tasks with smaller allocation.

We compare the schedulers with a benchmark of synthetic task sets of different sizes and evaluate both energy consumption and scheduling time of the different schedulers. We find that the last of the above mentioned steps is decisive in either reducing scheduling time or obtaining a higher quality solution if the solver’s timeout is reached, depending on task set size.

The remainder of the article is structured as follows. In Section 2, we present background information on static scheduling and energy efficiency, and discuss related work. Section 3 presents integer linear programs for static schedulers implementing progressively more expansive restrictions, starting from an unrestricted scheduler and arriving at a crown scheduler. In Section 4, we report on experiments with synthetic task sets to compare the different schedulers, while Section 5 concludes and gives an outlook onto future work.

2 Background

2.1 Task Scheduling

We assume that we have independent tasks , where , each with its workload , given in number of cycles. Thus, if the task is run at frequency , its runtime is . Each task can be parallelized on up to cores. Since we deal with moldable tasks, the number of cores it runs on cannot change during execution, and non-preemption prohibits suspension and subsequent continuation of a task. Moreover, a task-individual function specifies its parallel efficiency when run on cores. A task is executed at one of operating frequencies , , its per-core runtime can then be determined as . All tasks shall be completed until a given deadline .

We furthermore assume that the task set is to be scheduled to a homogeneous machine with cores , where each core can be scaled to one of the available discrete operating frequencies independently of the other cores. In the following, we take the frequency scaling overhead to be negligible. For all frequency levels, the corresponding per-core power consumption is known and assumed to be constant.111Of course, core power consumption does not only vary with the operating frequency. Since the instruction mix executed by the processor affects power consumption as well [seo:2015], it can also be task (type)-specific, which we do not model here. Other factors influencing a core’s power consumption are the voltage, which we assume is always set to the least possible value for the given frequency, and the chip temperature, which can be kept in check via cooling.

Scheduling a task set to a given machine then consists of a number of steps, which may be performed subsequently one at a time, partially conjoined, or even all at the same time. By allocation we understand that for each task, the number of cores it should run on is determined. By mapping we understand that the task is assigned to a subset of cores for execution, where the size of this subset must correspond to the task’s allocation. By scaling we understand that the task is assigned an operating frequency to determine its runtime. By ordering we understand that each task is assigned a start time (and thus an end time, as the runtime is known), such that no two tasks overlap in execution, i. e. if two tasks’ assignments are not disjunct, the tasks are not allowed to overlap in time (so-called feasible schedule, for examples of a feasible schedule see e. g. Figure 1) and must be ordered such that the start time of one task must be at least the end time of the other task or vice versa.

2.2 Energy Consumption

The energy required for executing a task can be computed as the product of its per-core runtime, the core’s power consumption at the designated operating frequency , and the number of cores the task runs on:

The total energy consumption the execution of a schedule causes is the sum of all the tasks’ energy consumption values:

Here, we do not model the energy consumption when cores are idle, as we choose the deadlines sufficiently tight for long idle periods not to occur.

When scheduling under a deadline constraint one can choose among all feasible schedules, i. e. schedules not violating the deadline (as long as there is more than one). This creates the potential to optimize for some other feature, which guides said choice accordingly. In the current paper, we opt for minimizing the energy consumption during the schedule’s execution, see Section 3.

2.3 Related Work

Most research in the area of scheduling consider to either find optimal solutions or a particular approach to constrain the large solution space. Turek et al. [Turek1992] consider scheduling of moldable tasks on multiprocessors with the goal of makespan minimization and give approximations. Pruhs et al. [pruhs2008speed] present an optimal scheme, but they only consider sequential tasks, assume continuous frequencies, and optimize makespan for a given energy budget. Sanders and Speck [SandersSpeck2012] investigate energy-efficient scheduling for malleable tasks with preemption, while we consider moldable tasks and non-preemption. Zahaf et al. [Zahaf2017] present a solution to schedule moldable tasks, but their solution uses non-linear integer programming, and their focus is on heterogeneity of the platform and on modelling of the power consumption. Xu et al. [xu12]

propose optimal and heuristic solutions to schedule moldable tasks. They use a bookshelf approach to order tasks, which however seems inferior to crown scheduling

[MelotKKE14]. Crown scheduling [MelotKKE14] applies a particular set of constraints, but only compares to other constrained and unrestricted [MelotKEK19] schedulers. Ye et al. [Ye2018] investigate online scheduling of moldable task sets to minimize makespan, while we consider static scheduling to minimize energy under a deadline constraint.

3 Schedulers with different Constraints

The most basic scheduler, which marks our starting point, is the unrestricted scheduler. The constraints applying here solely ensure the resulting schedule’s feasibility but do not impose any further limitations. To compute a schedule, the scheduler solves an ILP with decision variables , another decision variables , decision variables , decision variables , and decision variables . The underlying semantics is as follows:

  • iff runs on cores on frequency level ,

  • iff runs on at frequency ,

  • iff precedes on one or more cores runs on,

  • is the time when execution of commences,

  • is the time when execution of terminates.

As discussed in Section 2, the corresponding ILP minimizes the energy required for executing the resulting schedule:

[2] E_total = ∑_i,j,k x_i,j,k ⋅t_j(w_j,f_k) ⋅Pow(f_k) ⋅w_j ∀j  ∑_i,k x_i,j,k=1 ∀j  e_j≤M ∀j  s_j≥0 ∀j  e_j=s_j + ∑_i,k x_i,j,k ⋅t_j(w_j,f_k) ∀j  y_j,j=0 ∀j,j’  y_j,j’ + y_j’,j≤1 ∀j,j’ ≠j  s_j≥e_j’ - (1 - y_j’,j) ⋅M ∀j,j’¡j,i  y_j,j’ + y_j’,j≥∑_k z_i,j,k + z_i,j’,k - 1 ∀j,k  ∑_i z_i,j,k=∑_i i ⋅x_i,j,k.

As with all the schedulers presented in this section, the objective function to be minimized is the total energy consumption . Constraint (3) ensures that each task is scheduled exactly once. Constraints (3) and (3) guarantee that each task starts and completes execution in , while (3) ties to by setting to the sum of and ’s per-core runtime. Constraint (3) prohibits self-precedence, and (3) mutual precedence of any two tasks. Constraint (3) ensures that a task’s execution can only begin if all preceding tasks have completed. Constraint (3) forces specifying a preference relation for tasks sharing one or more cores. Finally, (3) ascertains consistency of allocation and mapping for each task.

Moving from the unrestricted scheduler to the allocation-constrained schedules requires the introduction of an additional constraint:

Thus, allocations which are not powers of 2 are banned.

Proceeding to the group scheduler, we establish the concept of core groups as in [MelotKKE14], cf. Figure 2. We now have core groups of different sizes. The root group comprises all cores. It is decomposed into the disjoint and equally sized groups (ranging over to ) and (spanning to ), which are in turn divided in the same fashion, and so on. Ultimately, the leaf groups ( to in Figure 2) contain one core only.

Figure 2: Core group structure of a processor with 8 cores

The decision variables are not needed for the group scheduler and therefore are dropped from the ILP. For the decision variables , the semantics must be modified as follows:

  • iff runs in core group at frequency .

We furthermore acknowledge that when is mapped to , being the number of cores in . While constraints (3) to (3) remain as they are, (3) is removed, and (3) is replaced by the following restriction:

That way, we make sure that a preference relation holds whenever two tasks are in the same group or one is in a subgroup of the other’s group. Here, denotes the index set of groups embraced by , including .

Since the crown scheduler features a predetermined execution order (cf. Section 1), the constraints previously controlling precedence relations and task start and completion times are now disposed of. The only remaining decision variables are the , whose semantics is the same as for the group scheduler, and we still have given that is mapped to . Regarding the ILP’s constraints, solely (3) is carried over from the group scheduler. Beyond that, two new constraints are adopted:

(1)
(2)

Here, (1) precludes a task from being mapped to a group whose size is larger than the task’s maximum width. Constraint (2) ensures that no core receives more work than it can handle until the deadline by requiring for each core that the accumulated runtime of tasks executed in any of the groups is a member of does not exceed .

4 Evaluation

We have conducted experiments with synthetic task sets, where . For each cardinality, we have created 10 task sets for a total of 40 task sets. The tasks’ workloads are randomly determined integers in and maximum widths were chosen randomly from

, both based on a uniform distribution but under the restriction that

when choosing . Thus, no large tasks with low maximum width occur, which might call for loose deadlines to produce a feasible schedule in the first place. We have computed schedules for machines with 4 and 8 cores to cover the aspect of machine size. For any combination of task set size and machine size, four schedules per task set were determined via the four scheduling techniques presented in Section 3: unrestricted scheduling, scheduling under allocation constraints, scheduling under allocation and group constraints, and crown scheduling.

All schedulers assume a generic core with power consumption modelled similar to ARM’s big.LITTLE architecture [kessler:2019]. The parallel efficiency is computed as in [MelotKKE14]:

where is executed on cores, and the deadline is determined as in [kessler:2019]:

where for and for . These values were the lowest still yielding feasible solutions in all cases for the respective machine sizes. Here, and denote the machine’s minimum and maximum operating frequencies, which in our case are and , cf. [kessler:2019].

For solving the ILPs, we have deployed the Gurobi 8.1.0 solver and the gurobipy module for Python. All schedules were computed on an AMD Ryzen 7 2700X with 8 cores and SMT. The ILP solver chooses itself how many of the up to 16 threads it uses. The timeout was set to 5 minutes real (wall clock) time.

Aside from the schedules’ total energy consumption as a measure of the schedules’ quality the schedulers’ execution time is of major interest in the present context. Generally speaking, solving an ILP is an expensive procedure, oftentimes requiring extensive computations. Table 1 gives a first impression regarding the schedulers’ resource consumption by presenting the number of timeouts reached for all combinations of scheduler, machine size, and task set size. As one can see, for small task sets of size 4, no timeouts have occurred. Large task sets of size 32 always lead to reaching the timeout. Differences between the four schedulers can only be observed for task set sizes of 8 and 16 and both machine sizes. As one would expect, the more constraints a scheduler is subject to, the fewer timeouts it encounters. The largest gap can be found between group and crown scheduler. When looking at task sets of size 16, the crown scheduler reaches the timeout in 1 of 20 cases, while all other schedulers never discover an optimal solution before the timeout occurs. On these grounds, one may surmise that the crown scheduler’s predefined execution order – its distinctive feature in our investigation – substantially lowers the effort in the scheduling process.

# cores # tasks unrestricted allocpow2 group crown
4 4 0 0 0 0
8 2 1 1 0
16 10 10 10 0
32 10 10 10 10
total 22 21 21 10
8 4 0 0 0 0
8 3 1 1 0
16 10 10 10 1
32 10 10 10 10
total 23 21 21 11
total total 45 42 42 21
Table 1: Number of timeouts for the schedulers under consideration and various combinations of task set size and machine size

To get a clearer picture, Table 2

provides the average scheduling times (CPU times, i. e. sum of user and system times) and standard deviation for each combination of scheduler, machine size, and task set size. Figure

3 shows average scheduling time values for the constrained schedulers relative to the unrestricted scheduler. We can see that the situation is similar for both machine sizes examined here. For very small task sets of size 4, all of the schedulers have produced solutions rapidly ( of scheduler execution time). For the crown scheduler, this also applies to task sets of size 8 (the corresponding bar in fact is hardly noticeable), whereas the other schedulers’ execution times are significantly longer. Here, restricting the allocation to powers of 2 halves scheduling time in relation to unrestricted scheduling, while adding the group constraints does not yield further gains. When looking at task sets of size 16, all schedulers but the crown scheduler constantly ran into the 5 minute wall clock timeout. Apparently, the unrestricted as well as the allocation-constrained scheduler were executed in 16 threads, while the group scheduler ran in 8 threads. This decision was made by the ILP solver. The crown scheduler not only makes do with roughly 35% of the unrestricted scheduler’s execution time, it also affords optimal solutions in all cases but one222One should note though that these solutions are optimal with regard to the crown scheduler’s solution space, which is severely restricted in comparison to the unrestricted scheduler’s. We will further consider solution quality below., cf. Table 3. The largest gap in terms of resource consumption thus again opens up between the group and the crown scheduler. For large task sets of size 32, all schedulers have reached the timeout in any case. Interestingly, the crown scheduler was executed in 16 threads, while the other three schedulers ran in 8 threads (and therefore their CPU time is half the crown scheduler’s). In most cases, standard deviation is fairly low indicating a roughly uniform scheduling time over all 10 task sets considered for a particular combination of machine size and task set size. For each scheduler, there is one task set size where standard deviation is high, suggesting that some task sets could be scheduled quickly and others took substantially longer, possibly even until timeout. Interestingly, the task set size in question is 16 for the crown scheduler and 8 for all other schedulers, leading to the conjecture that scheduling difficulty rises more slowly for the crown scheduler with increasing task set size.

# tasks unrestricted allocpow2 group crown
time () st. dev. time () st. dev. time () st. dev. time () st. dev.
4 cores
4 0.450 0.311 0.358 0.245 0.305 0.232 0.048 0.027
8 1658.959 2105.047 927.608 1513.510 859.809 1542.652 0.444 0.269
16 4757.157 30.308 4759.938 18.197 2397.118 0.524 1598.856 1280.000
32 2393.429 1.212 2393.876 1.728 2381.849 2.181 4739.042 52.842
8 cores
4 0.478 0.350 0.291 0.256 0.158 0.125 0.047 0.019
8 1716.667 2130.501 713.359 1486.013 567.248 1479.633 0.990 0.617
16 4696.628 44.324 4699.877 30.329 2394.067 1.042 1797.960 2019.197
32 2392.079 1.518 2392.394 0.701 2372.305 4.703 4771.125 2.076
Table 2: Average scheduling times (CPU) and standard deviation values for the schedulers under consideration, for various combinations of task set size and machine size
Figure 3: Scheduling times (CPU) for the schedulers under consideration grouped by task set size (values averaged over 10 task sets each), relative to the unrestricted scheduler. Left: 4-core machine, right: 8-core machine.

When it comes to the schedulers’ performance in terms of solution quality, a first approach may be the number of optimal solutions each scheduler produces. From Table 3 one can gather that introducing the group constraints does not lead to an increase in optimal solutions discovered over the allocation-constrained scheduler. Both perform slightly better than the unrestricted scheduler though. The crown scheduler once again is far ahead of the other schedulers, mostly due to its strong performance for medium-sized task sets. One must keep in mind here that these figures reflect each scheduler’s performance with regard to its own search space. Obviously, a smaller search space is beneficial when an optimal solution is to be found within a fixed period of time.

# cores # tasks unrestricted allocpow2 group crown
4 4 10 10 10 10
8 8 9 9 10
16 0 0 0 10
32 0 0 0 0
total 18 19 19 30
8 4 10 10 10 10
8 7 9 9 10
16 0 0 0 9
32 0 0 0 0
total 17 19 19 29
total total 35 38 38 59
Table 3: Number of optimal solutions for the schedulers under consideration and various combinations of task set size and machine size

It is therefore of great interest to compare the energy consumption values for the schedules produced by the four schedulers. Table 4 shows the respective values relative to the unrestricted scheduler’s. For small task sets of 4 tasks, the constrained allocation leads to slightly higher energy consumption ( on average). Further restrictions do not bring about yet another loss of solution quality. All schedules for the small task sets are optimal. Here, the unrestricted scheduler capitalizes on the more extensive search space. When task sets are larger, this benefit turns into a burden. Although the unrestricted scheduler’s solution space comprises all the other schedulers’ solution spaces, it does not manage to discover equally good solutions in due time. As one can see from Table 4, restricting the allocation does not change much in terms of energy consumption. Introducing additional group constraints in many cases does not have a massive impact, either. On the machine with 4 cores one can notice though that the deviation in both directions may be more pronounced: for the task sets with 16 tasks, the schedules’ energy consumption is at 96% of the unrestricted scheduler’s on average, for the largest task sets with 32 tasks, it climbs to 114%. Again, the most significant shift must be ascribed to the crown scheduler. For both machine sizes, the figures show a clear trend: the larger the task sets, the more energy is saved compared to the unrestricted scheduler. Since this observation does not apply to the group scheduler, one is lead to conjecture that the crown scheduler’s predetermined execution order is the relevant factor enabling it to encounter higher quality solutions within a given time frame in relation to the other schedulers. Presumably, the execution order constraint considerably downsizes the search space without eliminating all the high quality solutions at the same time.

# cores # tasks allocpow2 group crown
best avg. worst best avg. worst best avg. worst
4 4 1.00 1.03 1.14 1.00 1.03 1.14 1.00 1.03 1.14
8 1.00 1.00 1.01 1.00 1.00 1.01 1.00 1.00 1.01
16 0.88 1.00 1.10 0.90 0.96 1.04 0.87 0.95 0.99
32 0.88 0.99 1.11 0.88 1.14 1.84 0.83 0.89 0.97
total 0.88 1.00 1.14 0.88 1.04 1.84 0.83 0.97 1.14
8 4 1.00 1.03 1.15 1.00 1.03 1.15 1.00 1.03 1.15
8 1.00 1.01 1.01 1.00 1.01 1.01 1.00 1.01 1.01
16 0.99 1.00 1.03 0.98 0.99 1.00 0.97 0.98 0.99
32 0.95 1.00 1.08 0.94 1.00 1.09 0.93 0.96 0.99
total 0.95 1.01 1.15 0.94 1.01 1.15 0.93 0.99 1.15
total total 0.88 1.01 1.15 0.88 1.02 1.84 0.83 0.98 1.15
Table 4: Computed energy consumption values for the execution of the produced schedules, for the schedulers under consideration and for various combinations of task set size and machine size, relative to the unrestricted scheduler

All in all, in this section we have carved out that introducing allocation and group constraints yields similar solution quality when compared to an unrestricted scheduler, while scheduling time is significantly lower for small task sets. A further massive runtime decrease can be observed for the crown scheduler, as long as the timeout is not hit, which is constantly the case when task sets are large. Moreover, the crown scheduler’s execution order constraints are likely to be credited with an improvement in solution quality, i. e. schedule energy consumption, over the other schedulers for large task sets. As we have seen, the gap broadens with increasing task set size. Only for very small task sets, the unrestricted scheduler delivers an uncontested performance. All these findings are largely independent of the machine size. Eventually, our investigation has revealed that solely constraining the allocation and potentially forming groups does not award the assets of the crown scheduling technique: a very low scheduling time when task sets are small, and a superior solution quality for larger task sets when scheduling time is limited. In nearly all scenarios, taking the additional step from group to crown scheduler thus pays off.

5 Conclusions

We have presented a study on the evolution of scheduling time and energy efficiency of the resulting schedules when progressively constraining an unrestricted scheduler’s search space, for sets of independent, non-preemptive, moldable tasks and parallel machines with discrete frequency levels. Our studies indicate that constraining the tasks’ execution order has most influence on both scheduler execution time and energy efficiency, given that scheduling time is constrained as well. Thus, in most of the considered scenarios users are well-advised to deploy the crown scheduler, except for very small task sets, which is when the unrestricted scheduler can produce superior solutions without struggling with time constraints.

Future work will comprise the study of more fine-grained constraints. For example, one could first constrain assignments to consecutive processors, without being so strict as to only allow assignments within core groups. Also, the order in which constraints are applied can be varied, for example assignment could be constrained before allocation. Furthermore, evaluation shall be extended to task sets derived from real applications.

Acknowledgments

We thank Christoph Kessler for many discussions and years – past and future – of fruitful and inspiring collaboration.

References