Parallel programs can often be formulated as a set of tasks working on a stream of inputs [MelotKEK19]. While the tasks may exhibit dependencies, task instances working on different inputs are independent. Throughput requirements impose a maximum deadline until all task instances must be executed, while the nature of the platform — often embedded or mobile — necessitates to restrict energy consumption as much as possible. To meet the deadline, it is often necessary to parallelize some or all tasks.
Static scheduling of independent, parallelizable tasks of known workloads with a common deadline onto a parallel machine with frequency-scalable cores comprises a number of steps. For moldable tasks, i. e. tasks where the degree of parallelization must be fixed prior to execution and cannot be changed during execution, these steps are allocation, mapping, scaling and ordering. This means that for each task, the degree of parallelism must be determined and an appropriate subset of cores must be assigned, the operating frequency chosen from the available frequency levels and the tasks ordered in time so that they do not overlap. The settings must be such that all tasks terminate before the common deadline. Among the schedules that achieve this, it is often desirable to choose one that minimizes another property such as the energy consumption by the tasks, which is mainly determined by the tasks’ execution frequencies.
The decisions in these phases are not independent of each other. Especially, any sub-optimal decision in the allocation, mapping or ordering phases that would result in missing the deadline can only be compensated by increasing the operating frequencies (assuming that maximum frequency will always suffice to meet the deadline), which in turn increases the energy consumption.
An unrestricted schedule with the above properties looks like the solution of a kind of puzzle game (cf. Fig. 1(a)). It can be computed by an integer linear program (ILP) [MelotKEK19], which however needs a large number of variables and thus can only be used for very moderate task counts.
Crown Scheduling [MelotKKE14] is an approach where allocation, mapping and ordering are restricted (cf. Fig. 1(b)) in order to reduce the number of decision variables in the ILP to allow scheduling of larger task sets or reduce scheduling time for the same task set compared to an unrestricted scheduler. The price to pay is that the solution space for crown scheduling is only a subset of the solution space of the unrestricted scheduler, so that the crown-optimal solution found may have a higher energy consumption.
It turns out [MelotKEK19] that crown scheduling only leads to a moderate increase in energy consumption in most cases but often reduces the scheduling time and/or the number of ILP solver timeouts compared to an unrestricted scheduler.
In the present research, we investigate the influence which each of the above steps has on energy consumption and scheduler execution time when crown scheduling constraints are applied. Thus, we compare four schedulers:
unrestricted scheduler with allocation constrained to powers of two,
unrestricted scheduler with constrained allocation and assignment restricted to consecutive cores starting with a core whose index is a multiple of the allocation (e. g., for an allocation of 4, possible assignments are , , and so on),
and crown scheduling, where, additionally, the tasks are ordered in time such that tasks with larger allocation are always executed before tasks with smaller allocation.
We compare the schedulers with a benchmark of synthetic task sets of different sizes and evaluate both energy consumption and scheduling time of the different schedulers. We find that the last of the above mentioned steps is decisive in either reducing scheduling time or obtaining a higher quality solution if the solver’s timeout is reached, depending on task set size.
The remainder of the article is structured as follows. In Section 2, we present background information on static scheduling and energy efficiency, and discuss related work. Section 3 presents integer linear programs for static schedulers implementing progressively more expansive restrictions, starting from an unrestricted scheduler and arriving at a crown scheduler. In Section 4, we report on experiments with synthetic task sets to compare the different schedulers, while Section 5 concludes and gives an outlook onto future work.
2.1 Task Scheduling
We assume that we have independent tasks , where , each with its workload , given in number of cycles. Thus, if the task is run at frequency , its runtime is . Each task can be parallelized on up to cores. Since we deal with moldable tasks, the number of cores it runs on cannot change during execution, and non-preemption prohibits suspension and subsequent continuation of a task. Moreover, a task-individual function specifies its parallel efficiency when run on cores. A task is executed at one of operating frequencies , , its per-core runtime can then be determined as . All tasks shall be completed until a given deadline .
We furthermore assume that the task set is to be scheduled to a homogeneous machine with cores , where each core can be scaled to one of the available discrete operating frequencies independently of the other cores. In the following, we take the frequency scaling overhead to be negligible. For all frequency levels, the corresponding per-core power consumption is known and assumed to be constant.111Of course, core power consumption does not only vary with the operating frequency. Since the instruction mix executed by the processor affects power consumption as well [seo:2015], it can also be task (type)-specific, which we do not model here. Other factors influencing a core’s power consumption are the voltage, which we assume is always set to the least possible value for the given frequency, and the chip temperature, which can be kept in check via cooling.
Scheduling a task set to a given machine then consists of a number of steps, which may be performed subsequently one at a time, partially conjoined, or even all at the same time. By allocation we understand that for each task, the number of cores it should run on is determined. By mapping we understand that the task is assigned to a subset of cores for execution, where the size of this subset must correspond to the task’s allocation. By scaling we understand that the task is assigned an operating frequency to determine its runtime. By ordering we understand that each task is assigned a start time (and thus an end time, as the runtime is known), such that no two tasks overlap in execution, i. e. if two tasks’ assignments are not disjunct, the tasks are not allowed to overlap in time (so-called feasible schedule, for examples of a feasible schedule see e. g. Figure 1) and must be ordered such that the start time of one task must be at least the end time of the other task or vice versa.
2.2 Energy Consumption
The energy required for executing a task can be computed as the product of its per-core runtime, the core’s power consumption at the designated operating frequency , and the number of cores the task runs on:
The total energy consumption the execution of a schedule causes is the sum of all the tasks’ energy consumption values:
Here, we do not model the energy consumption when cores are idle, as we choose the deadlines sufficiently tight for long idle periods not to occur.
When scheduling under a deadline constraint one can choose among all feasible schedules, i. e. schedules not violating the deadline (as long as there is more than one). This creates the potential to optimize for some other feature, which guides said choice accordingly. In the current paper, we opt for minimizing the energy consumption during the schedule’s execution, see Section 3.
2.3 Related Work
Most research in the area of scheduling consider to either find optimal solutions or a particular approach to constrain the large solution space. Turek et al. [Turek1992] consider scheduling of moldable tasks on multiprocessors with the goal of makespan minimization and give approximations. Pruhs et al. [pruhs2008speed] present an optimal scheme, but they only consider sequential tasks, assume continuous frequencies, and optimize makespan for a given energy budget. Sanders and Speck [SandersSpeck2012] investigate energy-efficient scheduling for malleable tasks with preemption, while we consider moldable tasks and non-preemption. Zahaf et al. [Zahaf2017] present a solution to schedule moldable tasks, but their solution uses non-linear integer programming, and their focus is on heterogeneity of the platform and on modelling of the power consumption. Xu et al. [xu12]
propose optimal and heuristic solutions to schedule moldable tasks. They use a bookshelf approach to order tasks, which however seems inferior to crown scheduling[MelotKKE14]. Crown scheduling [MelotKKE14] applies a particular set of constraints, but only compares to other constrained and unrestricted [MelotKEK19] schedulers. Ye et al. [Ye2018] investigate online scheduling of moldable task sets to minimize makespan, while we consider static scheduling to minimize energy under a deadline constraint.
3 Schedulers with different Constraints
The most basic scheduler, which marks our starting point, is the unrestricted scheduler. The constraints applying here solely ensure the resulting schedule’s feasibility but do not impose any further limitations. To compute a schedule, the scheduler solves an ILP with decision variables , another decision variables , decision variables , decision variables , and decision variables . The underlying semantics is as follows:
iff runs on cores on frequency level ,
iff runs on at frequency ,
iff precedes on one or more cores runs on,
is the time when execution of commences,
is the time when execution of terminates.
As discussed in Section 2, the corresponding ILP minimizes the energy required for executing the resulting schedule:
 E_total = ∑_i,j,k x_i,j,k ⋅t_j(w_j,f_k) ⋅Pow(f_k) ⋅w_j ∀j ∑_i,k x_i,j,k=1 ∀j e_j≤M ∀j s_j≥0 ∀j e_j=s_j + ∑_i,k x_i,j,k ⋅t_j(w_j,f_k) ∀j y_j,j=0 ∀j,j’ y_j,j’ + y_j’,j≤1 ∀j,j’ ≠j s_j≥e_j’ - (1 - y_j’,j) ⋅M ∀j,j’¡j,i y_j,j’ + y_j’,j≥∑_k z_i,j,k + z_i,j’,k - 1 ∀j,k ∑_i z_i,j,k=∑_i i ⋅x_i,j,k.
As with all the schedulers presented in this section, the objective function to be minimized is the total energy consumption . Constraint (3) ensures that each task is scheduled exactly once. Constraints (3) and (3) guarantee that each task starts and completes execution in , while (3) ties to by setting to the sum of and ’s per-core runtime. Constraint (3) prohibits self-precedence, and (3) mutual precedence of any two tasks. Constraint (3) ensures that a task’s execution can only begin if all preceding tasks have completed. Constraint (3) forces specifying a preference relation for tasks sharing one or more cores. Finally, (3) ascertains consistency of allocation and mapping for each task.
Moving from the unrestricted scheduler to the allocation-constrained schedules requires the introduction of an additional constraint:
Thus, allocations which are not powers of 2 are banned.
Proceeding to the group scheduler, we establish the concept of core groups as in [MelotKKE14], cf. Figure 2. We now have core groups of different sizes. The root group comprises all cores. It is decomposed into the disjoint and equally sized groups (ranging over to ) and (spanning to ), which are in turn divided in the same fashion, and so on. Ultimately, the leaf groups ( to in Figure 2) contain one core only.
The decision variables are not needed for the group scheduler and therefore are dropped from the ILP. For the decision variables , the semantics must be modified as follows:
iff runs in core group at frequency .
That way, we make sure that a preference relation holds whenever two tasks are in the same group or one is in a subgroup of the other’s group. Here, denotes the index set of groups embraced by , including .
Since the crown scheduler features a predetermined execution order (cf. Section 1), the constraints previously controlling precedence relations and task start and completion times are now disposed of. The only remaining decision variables are the , whose semantics is the same as for the group scheduler, and we still have given that is mapped to . Regarding the ILP’s constraints, solely (3) is carried over from the group scheduler. Beyond that, two new constraints are adopted:
Here, (1) precludes a task from being mapped to a group whose size is larger than the task’s maximum width. Constraint (2) ensures that no core receives more work than it can handle until the deadline by requiring for each core that the accumulated runtime of tasks executed in any of the groups is a member of does not exceed .
We have conducted experiments with synthetic task sets, where . For each cardinality, we have created 10 task sets for a total of 40 task sets. The tasks’ workloads are randomly determined integers in and maximum widths were chosen randomly from
, both based on a uniform distribution but under the restriction thatwhen choosing . Thus, no large tasks with low maximum width occur, which might call for loose deadlines to produce a feasible schedule in the first place. We have computed schedules for machines with 4 and 8 cores to cover the aspect of machine size. For any combination of task set size and machine size, four schedules per task set were determined via the four scheduling techniques presented in Section 3: unrestricted scheduling, scheduling under allocation constraints, scheduling under allocation and group constraints, and crown scheduling.
All schedulers assume a generic core with power consumption modelled similar to ARM’s big.LITTLE architecture [kessler:2019]. The parallel efficiency is computed as in [MelotKKE14]:
where is executed on cores, and the deadline is determined as in [kessler:2019]:
where for and for . These values were the lowest still yielding feasible solutions in all cases for the respective machine sizes. Here, and denote the machine’s minimum and maximum operating frequencies, which in our case are and , cf. [kessler:2019].
For solving the ILPs, we have deployed the Gurobi 8.1.0 solver and the gurobipy module for Python. All schedules were computed on an AMD Ryzen 7 2700X with 8 cores and SMT. The ILP solver chooses itself how many of the up to 16 threads it uses. The timeout was set to 5 minutes real (wall clock) time.
Aside from the schedules’ total energy consumption as a measure of the schedules’ quality the schedulers’ execution time is of major interest in the present context. Generally speaking, solving an ILP is an expensive procedure, oftentimes requiring extensive computations. Table 1 gives a first impression regarding the schedulers’ resource consumption by presenting the number of timeouts reached for all combinations of scheduler, machine size, and task set size. As one can see, for small task sets of size 4, no timeouts have occurred. Large task sets of size 32 always lead to reaching the timeout. Differences between the four schedulers can only be observed for task set sizes of 8 and 16 and both machine sizes. As one would expect, the more constraints a scheduler is subject to, the fewer timeouts it encounters. The largest gap can be found between group and crown scheduler. When looking at task sets of size 16, the crown scheduler reaches the timeout in 1 of 20 cases, while all other schedulers never discover an optimal solution before the timeout occurs. On these grounds, one may surmise that the crown scheduler’s predefined execution order – its distinctive feature in our investigation – substantially lowers the effort in the scheduling process.
|# cores||# tasks||unrestricted||allocpow2||group||crown|
To get a clearer picture, Table 2
provides the average scheduling times (CPU times, i. e. sum of user and system times) and standard deviation for each combination of scheduler, machine size, and task set size. Figure3 shows average scheduling time values for the constrained schedulers relative to the unrestricted scheduler. We can see that the situation is similar for both machine sizes examined here. For very small task sets of size 4, all of the schedulers have produced solutions rapidly ( of scheduler execution time). For the crown scheduler, this also applies to task sets of size 8 (the corresponding bar in fact is hardly noticeable), whereas the other schedulers’ execution times are significantly longer. Here, restricting the allocation to powers of 2 halves scheduling time in relation to unrestricted scheduling, while adding the group constraints does not yield further gains. When looking at task sets of size 16, all schedulers but the crown scheduler constantly ran into the 5 minute wall clock timeout. Apparently, the unrestricted as well as the allocation-constrained scheduler were executed in 16 threads, while the group scheduler ran in 8 threads. This decision was made by the ILP solver. The crown scheduler not only makes do with roughly 35% of the unrestricted scheduler’s execution time, it also affords optimal solutions in all cases but one222One should note though that these solutions are optimal with regard to the crown scheduler’s solution space, which is severely restricted in comparison to the unrestricted scheduler’s. We will further consider solution quality below., cf. Table 3. The largest gap in terms of resource consumption thus again opens up between the group and the crown scheduler. For large task sets of size 32, all schedulers have reached the timeout in any case. Interestingly, the crown scheduler was executed in 16 threads, while the other three schedulers ran in 8 threads (and therefore their CPU time is half the crown scheduler’s). In most cases, standard deviation is fairly low indicating a roughly uniform scheduling time over all 10 task sets considered for a particular combination of machine size and task set size. For each scheduler, there is one task set size where standard deviation is high, suggesting that some task sets could be scheduled quickly and others took substantially longer, possibly even until timeout. Interestingly, the task set size in question is 16 for the crown scheduler and 8 for all other schedulers, leading to the conjecture that scheduling difficulty rises more slowly for the crown scheduler with increasing task set size.
|time ()||st. dev.||time ()||st. dev.||time ()||st. dev.||time ()||st. dev.|
When it comes to the schedulers’ performance in terms of solution quality, a first approach may be the number of optimal solutions each scheduler produces. From Table 3 one can gather that introducing the group constraints does not lead to an increase in optimal solutions discovered over the allocation-constrained scheduler. Both perform slightly better than the unrestricted scheduler though. The crown scheduler once again is far ahead of the other schedulers, mostly due to its strong performance for medium-sized task sets. One must keep in mind here that these figures reflect each scheduler’s performance with regard to its own search space. Obviously, a smaller search space is beneficial when an optimal solution is to be found within a fixed period of time.
|# cores||# tasks||unrestricted||allocpow2||group||crown|
It is therefore of great interest to compare the energy consumption values for the schedules produced by the four schedulers. Table 4 shows the respective values relative to the unrestricted scheduler’s. For small task sets of 4 tasks, the constrained allocation leads to slightly higher energy consumption ( on average). Further restrictions do not bring about yet another loss of solution quality. All schedules for the small task sets are optimal. Here, the unrestricted scheduler capitalizes on the more extensive search space. When task sets are larger, this benefit turns into a burden. Although the unrestricted scheduler’s solution space comprises all the other schedulers’ solution spaces, it does not manage to discover equally good solutions in due time. As one can see from Table 4, restricting the allocation does not change much in terms of energy consumption. Introducing additional group constraints in many cases does not have a massive impact, either. On the machine with 4 cores one can notice though that the deviation in both directions may be more pronounced: for the task sets with 16 tasks, the schedules’ energy consumption is at 96% of the unrestricted scheduler’s on average, for the largest task sets with 32 tasks, it climbs to 114%. Again, the most significant shift must be ascribed to the crown scheduler. For both machine sizes, the figures show a clear trend: the larger the task sets, the more energy is saved compared to the unrestricted scheduler. Since this observation does not apply to the group scheduler, one is lead to conjecture that the crown scheduler’s predetermined execution order is the relevant factor enabling it to encounter higher quality solutions within a given time frame in relation to the other schedulers. Presumably, the execution order constraint considerably downsizes the search space without eliminating all the high quality solutions at the same time.
|# cores||# tasks||allocpow2||group||crown|
All in all, in this section we have carved out that introducing allocation and group constraints yields similar solution quality when compared to an unrestricted scheduler, while scheduling time is significantly lower for small task sets. A further massive runtime decrease can be observed for the crown scheduler, as long as the timeout is not hit, which is constantly the case when task sets are large. Moreover, the crown scheduler’s execution order constraints are likely to be credited with an improvement in solution quality, i. e. schedule energy consumption, over the other schedulers for large task sets. As we have seen, the gap broadens with increasing task set size. Only for very small task sets, the unrestricted scheduler delivers an uncontested performance. All these findings are largely independent of the machine size. Eventually, our investigation has revealed that solely constraining the allocation and potentially forming groups does not award the assets of the crown scheduling technique: a very low scheduling time when task sets are small, and a superior solution quality for larger task sets when scheduling time is limited. In nearly all scenarios, taking the additional step from group to crown scheduler thus pays off.
We have presented a study on the evolution of scheduling time and energy efficiency of the resulting schedules when progressively constraining an unrestricted scheduler’s search space, for sets of independent, non-preemptive, moldable tasks and parallel machines with discrete frequency levels. Our studies indicate that constraining the tasks’ execution order has most influence on both scheduler execution time and energy efficiency, given that scheduling time is constrained as well. Thus, in most of the considered scenarios users are well-advised to deploy the crown scheduler, except for very small task sets, which is when the unrestricted scheduler can produce superior solutions without struggling with time constraints.
Future work will comprise the study of more fine-grained constraints. For example, one could first constrain assignments to consecutive processors, without being so strict as to only allow assignments within core groups. Also, the order in which constraints are applied can be varied, for example assignment could be constrained before allocation. Furthermore, evaluation shall be extended to task sets derived from real applications.
We thank Christoph Kessler for many discussions and years – past and future – of fruitful and inspiring collaboration.