Efficient energy management has become an important issue for modern computing systems due to higher computational power demands in today’s computing systems, e.g. sensor networks, satellites, multi-robot systems, as well as personal electronic devices. There are two common schemes used in modern computing energy management systems. One is dynamic power management (DPM), where certain parts of the system are turned off during the processor idle state. The other is dynamic voltage and frequency scaling (DVFS), which reduces the energy consumption by exploiting the relation between the supply voltage and power consumption. In this work, we consider the problem of scheduling real-time tasks on heterogeneous multiprocessors under a DVFS scheme with the goal of minimizing energy consumption, while ensuring that both the execution cycle requirement and timeliness constraints of real-time tasks are satisfied.
1.1 Terminologies and Definitions
This section provides basic terminologies and definitions used throughout the paper.
Task : An aperiodic task is defined as a triple ; is the required number of CPU cycles needed to complete the task, is the task’s relative deadline and is the arrival time of the task. A periodic task is defined as a triple where is the task’s period. If the task’s deadline is equal to its period, the task is said to have an ‘implicit deadline’. The task is considered to have a ‘constrained deadline’ if its deadline is not larger than its period, i.e. . In the case that the task’s deadline can be less than, equal to, or greater than its period, it is said to have an ‘arbitrary deadline’. Throughout the paper, we will refer to a task as an aperiodic task model unless stated otherwise, because a periodic task can be transformed into a collection of aperiodic tasks with appropriately defined arrival times and deadlines, i.e. the instance of a periodic task , where , arrives at time , has the required execution cycles and an absolute deadline at time . Moreover, for a periodic taskset, we only need to find a valid schedule within its hyperperiod , defined as the least common multiple (LCM) of all task periods, i.e. the total number of job instances of a periodic task during the hyperperiod is equal to . The taskset is defined as a set of all tasks. The taskset is feasible if there exists a schedule such that no task in the taskset misses the deadline.
Speed : The operating speed is defined as the ratio between the operating frequency of processor type- and the maximum system frequency , i.e. , , where .
Minimum Execution Time111In the literature, this is often called ‘worst-case execution time’. However, in the case where the speed is allowed to vary, using the term ‘minimum execution time’ makes more sense, since the execution time increases as the speed is scaled down. For simplicity of exposition, we also assume no uncertainty, hence ‘worst-case’ is not applicable here. Extensions to uncertainty should be relatively straightforward, in which case then becomes ‘minimum worst-case execution time’. : The minimum execution time is the execution time of task when executed at the maximum system frequency , i.e. .
Task Density222When all tasks are assumed to have implicit deadlines, this is often called ‘task utilization’. : For a periodic task, a task density is defined as the ratio between the task execution time and the minimum of its deadline and its period, i.e. , where is the task execution speed.
Taskset Density : A taskset density of a periodic taskset is defined as the summation of all task densities in the taskset, i.e. The minimum taskset density is given by
System Capacity : The system capacity is defined as , where is the maximum speed of processor type-, i.e. , is the total number of processors of type-.
Migration Scheme: A global scheduling scheme allows task migration between processors and a partitioned scheduling scheme does not allow task migration.
Feasibility Optimal: An algorithm is feasibility optimal if the algorithm is guaranteed to be able to construct a valid schedule such that no deadlines are missed, provided a schedule exists.
Energy Optimal: An algorithm is energy optimal when it is guaranteed to find a schedule that minimizes the energy, while meeting the deadlines, provided such a schedule exists.
Step Function: A function is a step (also called a piecewise constant) function, denoted , if there exists a finite partition of and a set of real numbers such that for all , .
1.2 Related Work
Due to the heterogeneity of the processors, one should not only consider the different operating frequency sets among processors, but also the hardware architecture of the processors, since task execution time will be different for each processor type. In other words, the system has to be captured by two aspects: the difference in operating speed sets and the execution cycles required by different tasks on different processor types.
With these aspects, fully-migration/global based scheduling algorithms, where tasks are allowed to migrate between different processor types, are not applicable in practice, since it will be difficult to identify how much computational work is executed on one processor type compared to another processor type due to differences in instruction sets, register formats, etc. Thus, most of the work related to heterogeneous multiprocessor scheduling are partition-based/non-preemptive task scheduling algorithms [1, 2, 3, 4, 5, 6, 7], i.e. tasks are partitioned onto one of the processor types and a well-known uniprocessor scheduling algorithm, such as Earliest Deadline First (EDF) , is used to find a valid schedule. With this scheme, the heterogeneous multiprocessor scheduling problem is reduced to a task partitioning problem, which can be formulated as an integer linear program (ILP). Examples of such work are  and .
However, with the advent of ARM two-type heterogeneous multicores architecture, such as the big.LITTLE architecture , that supports task migrations among different core types, a global scheduling algorithm is possible. In [10, 11], the first energy-aware global scheduling framework for this special architecture is presented, where an algorithm called Hetero-Split is proposed to solve a workload assignment and a Hetero-Wrap algorithm to solve a schedule generation problem. Their framework is similar to ours, except that we adopt a fluid model to represent a scheduling dynamic, our assigned operating frequency is time-varying and the CPU idle energy consumption is also considered.
A fluid model is the ideal schedule path of a real-time task. The remaining execution time is represented by a straight line where the slope of the line is the task execution speed. However, a practical task execution path is nonlinear, since a task may be preempted by other tasks. The execution interval of a task is represented by a line with a negative slope and a non-execution interval is represented by a line with zero slope.
There are at least two well-known homogeneous multiprocessor scheduling algorithms that are based on a fluid scheduling model: Proportionate-fair (Pfair)  and Largest Local Remaining Execution Time First (LLREF) . Both Pfair and LLREF are global scheduling algorithms. By introducing the notion of fairness, Pfair ensures that at any instant no task is one or more quanta (time intervals) away from the task’s fluid path. However, the Pfair algorithm suffers from a significant run-time overhead, because tasks are split into several segments, incurring frequent algorithm invocations and task migrations. To overcome the disadvantages of quantum-based scheduling algorithms, the LLREF algorithm splits/preempts a task at two scheduling events within each time interval . One occurs when the remaining time of an executing task is zero and it is better to select another task to run. The other event happens when the task has no laxity, i.e. the difference between the task deadline and the remaining execution time left is zero, hence the task needs to be selected immediately in order to finish the remaining workload in time.
The unified theory of the deadline partitioning technique and its feasibility optimal versions, called DP-FAIR, for periodic and sporadic tasks are given in . Deadline Partitioning (DP)  is the technique that partitions time into intervals bounded by two successive task deadlines, after which each task is allocated the workload and is scheduled at each time interval. A simple optimal scheduling algorithm based on DP-FAIR, called DP-WRAP, was presented in . The DP-WRAP algorithm partitioned time according to the DP technique and, at each time interval, the tasks are scheduled using McNaughton’s wrap around algorithm . McNaughton’s wrap around algorithm aligns all task workloads along a real number line, starting at zero, then splits tasks into chunks of length 1 and assigns each chunk to the same processor. Note that the tasks that have been split migrate between the two assigned processors. The work of  was extended in [16, 17] by incorporating a DVFS scheme to reduce power consumption.
However, the algorithms that are based on the fairness notion [13, 18, 19, 14, 16, 17] are feasibility optimal, but have hardly been applied in a real system, since they suffer from high scheduling overheads, i.e. task preemptions and migrations. Recently, two feasibility optimal algorithms that are not based on the notion of fairness have been proposed. One is the RUN algorithm , which uses a dualization technique to reduce the multiprocessor scheduling problem to a series of uniprocessor scheduling problems. The other is U-EDF , which generalises the earliest deadline first (EDF) algorithm to multiprocessors by reducing the problem to EDF on a uniprocessor.
Alternatively to the above methods, the multiprocessor scheduling problem can also be formulated as an optimization problem. However, since the problem is NP-hard 
, in general, an approximated polynomial-time heuristic method is often used. An example of these approaches can be found in[23, 24]
, which consider energy-aware multiprocessor scheduling with probabilistic task execution times. The tasks are partitioned among the set of processors, followed with computing the running frequency based on the task execution time probabilities. Among all of the feasibility assignments, an optimal energy consumption assignment is chosen by solving a mathematical optimization problem, where the objective is to minimize some energy function. The constraints are to ensure that all tasks will meet their deadlines and only one processor is assigned to a task. In partitioned scheduling algorithms, such as[23, 24], once a task is assigned to a specific processor, the multiprocessor scheduling problem is reduced to a set of uniprocessor scheduling problems, which is well studied . However, a partitioned scheduling method cannot provide an optimal schedule.
The main contributions of this work are:
The formulation of a real-time multiprocessor scheduling problem as an infinite-dimensional continous-time optimal control problem.
Three mathematical programming formulations to solve a hard real-time task scheduling problem on heterogeneous multiprocessor systems with DVFS capabilities are proposed.
We provide a generalised optimal speed profile solution to a uniprocessor scheduling problem with real-time taskset.
Our work is a multiprocessor scheduling algorithm that is both feasibility optimal and energy optimal.
Our formulations are capable of solving a multiprocessor scheduling problem with any periodic tasksets as well as aperiodic tasksets, compared to existing work, due to the incorporation of a scheduling dynamic and a time-varying speed profile.
The proposed algorithms can be applied to both an online scheduling scheme, where the characteristics of the taskset is not known until the time of execution, and an offline scheduling scheme, where the taskset information is known a priori.
Moreover, the proposed formulations can also be extended to a multicore architecture, which only allows frequency to be changed at a cluster-level, rather than at a core-level, as explained in Section 2.3.
This paper is organized as follows: Section 2 defines our feasibility scheduling problem in detail. Details on solving the scheduling problem with finite-dimensional mathematical optimization is given in Section 3. The optimality problem formulations are presented in Section 4. The simulation setup and results are presented in Section 5. Finally, conclusions and future work are discussed in Section 6.
2 Feasibility Problem Formulation
Though our objective is to minimize the total energy consumption, we will first consider a feasiblity problem before presenting an optimality problem.
2.1 System model
We consider a set of real-time tasks that are to be partitioned on a two-type heterogeneous multiprocessor system composed of processors of type-. We will assume that the system supports task migration among processor types, e.g. sharing the same instruction set and having a special interconnection for data transfer between processor types. Note that is the same for all processor types, since the instruction set is the same.
2.2 Task/Processor Assumptions
All tasks do not share resources, do not have any precedence constraints and are ready to start at the beginning of the execution. A task can be preempted/migrated between different processor types at any time. The cost of preemption and migration is assumed to be negligible or included in the minimum task execution times. Processors of the same type are homogeneous, i.e. having the same set of operating frequencies and power consumptions. Each processor’s voltage/speed can be adjusted individually. Additionally, for an ideal system, a processor is assumed to have a continuous speed range. For a practical system, a processor is assumed to have a finite set of operating speed levels.
2.3 Scheduling as an Optimal Control Problem
Below, we will refer to the sets , and , where is the largest deadline of all tasks. Note that are short-hand notations for , respectively. The scheduling problem can therefore be formulated as the following infinite-dimensional continous-time optimal control problem:
where the state is the remaining minimum execution time of task at time , the control input is the execution speed of the processor of type- at time and the control input is used to indicate the processor assignment of task at time , i.e. if and only if task is active on processor of type-. Notice that here we formulated the problem with speed selection at a core-level; a stricter assumption of a multicore architecture, i.e. a cluster-level speed assignment, is straightforward. Particularly, by replacing a core-level speed assignment with a cluster-level speed assignment in the above formulation.
The initial conditions on the minimum execution time of all tasks and task deadline constraints are specified in (1a) and (1b), respectively. The fluid model of the scheduling dynamic is given by the differential constraint (1c). Constraint (1d) ensures that each task will be assigned to at most one non-idle processor at a time. Constraint (1e) quarantees that each non-idle processor will only be assigned to at most one task at a time. The speeds are constrained by (1f) to take on values from . Constraint (1g) emphasis that task assignment variables are binary. Lastly, (1h) denotes that the control inputs should be step functions.
3 Solving the Scheduling Problem with Finite-dimensional Mathematical Optimization
The original problem (1) will be discretized by introducing piecewise constant constraints on the control inputs and . Let , which we will refer to as the major grid, denote the set of discretization time steps corresponding to the distinct arrival times and deadlines of all tasks within , where .
3.1 Mixed-Integer Nonlinear Program (MINLP-DVFS)
The above scheduling problem, subject to piecewise constant constraints on the control inputs, can be naturally formulated as an MINLP, defined below. Since the context switches due to task preemption and migration can jeopardize the performance, a variable discretization time step  method is applied on a minor grid, so that the solution to our scheduling problem does not depend on the size of the discretization time step. Let denote the set of discretization time steps on a minor grid on the interval with , so that is to be determined for all from solving an appropriately-defined optimization problem.
Let and be short notations for and . Define the notation . Denote the discretized state and input sequences as
Let and be step functions inbetween time instances on a minor grid, i.e.
Let denote the set of all tasks within , i.e. . Define a task arrival time mapping by such that for all and a task deadline mapping by such that for all . Define and let be short notation for
By solving a first-order ODE with piecewise constant input, a solution of the scheduling dynamic (1c) has to satisfy the difference constraint
|subject to (4a) and|
3.2 Computationally Tractable Multiprocessor Scheduling Algorithms
The time to compute a solution to problem (4) is impractical even with a small problem size. However, if we relax the binary constraints in (4g) so that the value of can be interpreted as the percentage of a time interval during which the task is executed (this will be denoted as in later formulations), rather than the processor assignment, the problem can be reformulated as an NLP for a system with continuous operating speed and an LP for a system with discrete speed levels. The NLP and LP can be solved at a fraction of the time taken to solve the MINLP above. Particularly, the heterogeneous multiprocessor scheduling problem can be simplified into two steps:
- STEP 1:
Determine the percentage of task execution times and execution speed within a time interval such that the feasibility constraints are satisfied.
- STEP 2:
From the solution given in the workload partitioning step, find the execution order of all tasks within a time interval such that no task will be executed on more than one processor at a time.
3.2.1 Solving the Workload Partitioning Problem as a Continuous Nonlinear Program (NLP-DVFS)
Since knowing the processor on which a task will be executed does not help in finding the task execution order, the corresponding processor assignment subscript of the control variables and is dropped to reduce the number of decision variables. Moreover, partitioning time using only a major grid (i.e. ) is enough to guarantee a valid solution, i.e. the percentage of the task exection time within a major grid is equal to the sum of all percentages of task execution times in a minor grid. Since we only need a major grid, we define the notation and . Note that we make an assumption that . We also assume that the set of allowable speed levels is a closed interval given by the lower bound and upper bound .
Consider now the following finite-dimensional NLP:
where is defined as the percentage of the time interval for which task is executing on a processor of type- at speed . (6d) guarantees that a task will not run on more than one processor at a time. The constraint that the total workload at each time interval should be less than or equal to the system capacity is specified in (6e). Upper and lower bounds on task execution speed and percentage of task execution time are given in (6f) and (6g), respectively.
3.2.2 Solving the Workload Partitioning Problem as a Linear Program (LP-DVFS)
The problem (6) can be further simplified to an LP if the set of speed levels is finite, as is often the case for practical systems. We denote with the execution speed at level of an -type processor, where is the total number of speed levels of an -type processor. Let be short-hand for .
Consider now the following finite-dimensional LP:
where is the percentage of the time interval for which task is executing on a processor of type- at a speed level . Note that all constraints are similar to (6), but the speed levels are fixed.
3.2.3 Task Ordering Algorithm
This section discusses how to find a valid schedule in the task ordering step for each time interval . Since the solutions obtained in the workload partitioning step are partitioning workloads of each task on each processor type within each time interval, one might think of using McNaughton’s wrap around algorithm  to find a valid schedule for each processor within the processor type. However, McNaughton’s wrap around algorithm only guarantees that a task will not be executed at the same time within the cluster. There exists a possibility that a task will be assigned to more than one processor type (cluster) at the same time.
To avoid a parallel execution on any two clusters, we can adopt the Hetero-Wrap algorithm proposed in  to solve a task ordering problem of a two-type heterogeneous multiprocessor platform. The algorithm takes the workload partitioning solution to STEP 1 as its inputs and returns , which is a task-to-processor interval assignment on each cluster. Note that, for a solution to problem (7), we define the total execution workload of a task and assume that the percentage of execution times of each task at all frequency levels will be grouped together in order to minimize the number of migrations and preemptions. In order to be self-contained, the Hetero-Wrap algorithm is given in Algorithm 1.
Specifically, the algorithm classifies the tasks into four subsets: (i) a setof migrating tasks with , (ii) a set of migrating tasks with , (iii) a set of partitioned tasks on cluster of type-1, and (iv) a set of partitioned tasks on cluster of type-2. The algorithm then employs the following simple rules:
For a type-1 cluster, tasks are scheduled in the order of and using McNaughton’s wrap around algorithm. That is, a slot along the number line is allocated, starting at zero, with the length equal to and the task is aligned with its assigned workload on empty slots of the cluster in the specified order starting from left to right.
For a type-2 cluster, in the same manner, tasks are scheduled using McNaughton’s wrap around algorithm, but in the order of and starting from right to left. Note that the order of tasks in has to be consistent with the order in a type-1 cluster.
However, the algorithm requires a feasible solution to (6) or (7), in which has at most one task, which we will call an inter-cluster migrating task. From Theorem 3, we can always transform a solution to (6) into a solution to (7). Therefore, we only need to show that there exists a solution to (7) with at most one inter-cluster migrating tasks that lies on the vertex of the feasible region by the following facts and lemma.
Among all the solutions to an LP, at least one solution lies at a vertex of the feasible region. In other words, at least one solution is a basic solution.
The Fundamental Theorem of Linear Programming, which states that if a feasible solution exists, then a basic feasible solution exists [27, p.38].
A feasible solution to an LP that is not a basic solution can always be converted into a basic solution.
This follows from the Fundamental Theorem of Linear Programming [27, p.38].
[28, Fact 2] Consider a linear program for some , , . Suppose that constraints are nonnegative constraints on each variable, i.e. and the rest are linearly independent constraints. If , then a basic solution will have at most non-zero values.
A unique basic solution can be identified by any linearly independent active constraints. Since there are nonnegative constraints and , a basic solution will have at most non-zero values.
For a solution to (7) that lies on the vertex of the feasible region, there will be at most one inter-cluster partitioning task.
The number of variables subjected to nonnegative constraint (7f) at each time interval of (7) is . The number of variables subjected to a set of necessary and sufficient feasibility constraints (7d)-(7e) is . Note that we do not count the number of variables in (7c) because (7c) and (7d) are linearly dependent constraints for a given value of . If we assume that and each processor type has at least one speed level, then it follows from Fact 6 that the number of non-zero values of variable , a solution to (7) at the vertex of the feasible region, is at most . Let be the number of tasks assigned to two processor types. Therefore, there are entries of variable that are non-zero. This implies that , i.e. the number of inter-cluster partitioning tasks is at most one.
For this example, , and .
The existence of a valid schedule is proven in [11, Thm 3]. It follows from Facts 4–6 and Lemma 7 that one can compute a solution with at most one inter-cluster partitioning task. Given a solution to (6)/(7) and the output from Algorithm 1 for all intervals, choose to be a step function such that when and otherwise, . Specifically, one can verify that the following condition holds
Then it is straightforward to show that (1) is satisfied.
Note that, although, we need to solve the same multiprocessor scheduling problem with two steps in this section, the computation times to solve (6) or (7) is extremely fast compared to solving problem (1), i.e. even for a small problem, the times to compute a solution of (4) can be up to an hour, while (6) or (7) can be solved in milliseconds using a general-purpose desktop PC with off-the-shelf optimization solvers. Furthermore, the complexity of Algorithm 1 is .
4 Energy Optimality
4.1 Energy Consumption model
A power consumption model can be expressed as a summation of dynamic power consumption and static power consumption . Dynamic power consumption is due to the charging and discharging of CMOS gates, while static power consumption is due to subthreshold leakage current and reverse bias junction current . The dynamic power consumption of CMOS processors at a clock frequency is given by
|where the constraint|
has to be satisfied . Here denotes the effective switch capacitance, is the supply voltage, is the threshold voltage ( V) and is a hardware-specific constant.
From (9b), it follows that if increases, then the supply voltage may have to increase (and if decreases, so does ). In the literature, the total power consumption is often simply expressed as an increasing function of the form
where and are hardware-dependent constants, while the static power consumption is assumed to be either constant or zero .
The energy consumption of executing and completing a task at a constant speed is given by
In the literature, it is often assumed that is an increasing function of the operating speed. However, because is a decreasing function, it follows that the energy consumed might not be an increasing function if is non-zero; Figure 6 gives an example of when the energy is non-monotonic, even if the power is an increasing function of clock frequency.
This result implies the existence of a non-zero energy-efficient speed , i.e. the minimizer of (11) [31, 32, 33]. Moreover, in the work of , the non-convex relationship between the energy consumption and processor speed can be observed as a result of scaling supply voltage.
The total energy consumption of executing a real-time task can be expressed as a summation of active energy consumption and idle energy consumption, i.e. , where is the energy consumption when the processor is busy executing the task and is the energy consumption when the processor is idle. The energy consumption of executing and completing a task at a constant speed is
where is the total power consumption in the active interval, is the total power consumption during the idle period. and are dynamic and static power consumption during the active period, respectively. Similarly, and are the dynamic and static power consumption during the idle period. will be assumed to be a constant, since the processor is executing a nop (no operation) instruction at the lowest frequency during the idle interval. and are also assumed to be constants where . Note that is strictly greater than zero.
4.2 Optimality Problem Formulation
The scheduling problem with the objective to minimize the total energy consumption of executing the taskset on a two-type heterogeneous multiprocessor can be formulated as the following optimal control problems:
|subject to (1).|