In a multi-tasking system, mutual exclusion for the accesses to shared resources, e.g., data structures, files, etc., has to be guaranteed to ensure the correctness of these operations. Such accesses to shared resources are typically done within the so-called critical sections, which can be protected by using binary semaphores or mutex locks. Therefore, at any point in time no two task instances are in their critical sections that access the same shared recourse. Moreover, advanced embedded computing systems heavily interact with the physical world, and timeliness of computation is an essential requirement of correctness. To ensure safe operations of such embedded systems, the satisfaction of the real-time requirements, i.e., worst-case timeliness, needs to be verified.
If aborting or restarting a critical section is not allowed, due to mutual exclusion, a higher-priority job may have to be stopped until a lower-priority job unlocks the requested shared resource that was already locked earlier, a so-called priority inversion. The study of mutual exclusion in uniprocessor real-time systems can be traced back to the priority inheritance protocol (PIP) and priority ceiling protocol (PCP) by Sha et al.  in 1990 and the stack resource policy (SRP) by Baker  in 1991. The Immediate PCP, a variant of the PCP, has been implemented in Ada (called Ceiling locking) and POSIX (called Priority Protect Protocol).
To schedule real-time tasks on multiprocessor platforms, there have been three widely adopted paradigms: partitioned, global, and semi-partitioned scheduling. The partitioned scheduling approach partitions the tasks statically among the available processors, i.e., a task is always executed on the assigned processor. The global scheduling approach allows a task to migrate from one processor to another at any time. The semi-partitioned scheduling approach decides whether a task is divided into subtasks statically and how each task/subtask is then assigned to a processor. A comprehensive survey of multiprocessor scheduling in real-time systems can be found in .
The design of synchronization protocols for real-time tasks on multiprocessor platforms started with the distributed priority ceiling protocol (DPCP) , followed by the multiprocessor priority ceiling protocol (MPCP) .111Neither of these two protocols had a concrete name in the original papers. In the literature, most authors referred to the protocols in  as DPCP and  as MPCP, respectively. The MPCP is based on partitioned fixed-priority scheduling and adopts the PCP for local resources. When requesting global resources that are shared by several tasks on different processors, the MPCP executes the corresponding critical sections with priority boosting. By contrast, under the DPCP, the sporadic/periodic real-time tasks are scheduled based on partitioned fixed-priority scheduling, except when accessing resources that are bound to a different processor. That is, the DPCP is semi-partitioned scheduling that allows migration at the boundary of critical and non-critical sections.
Over the years, many locking protocols have been designed and analyzed, including the multiprocessor stack resource policy (MSRP) , the flexible multiprocessor locking protocol (FMLP) , the multiprocessor PIP , the locking protocol (OMLP) , the Multiprocessor Bandwidth Inheritance (M-BWI) , gEDF-vpr , LP-EE-vpr , and the Multiprocessor resource sharing Protocol (MrsP) . Also, several protocols for hybrid scheduling approaches such as clustered scheduling , reservation-based scheduling , and open real-time systems  have been proposed in recent years. To support nested critical sections, Ward and Anderson [46, 47] introduced the Real-time Nested Locking Protocol (RNLP) , which adds supports for fine-grained nested locking on top of non-nested protocols.
However, the performance of these protocols highly depends on 1) how the tasks are partitioned and prioritized, 2) how the resources are shared locally and globally, and 3) whether a job/task being blocked should spin or suspend itself.
Regarding task partitioning, Lakshmanan et al. 
presented a synchronization-aware partitioned heuristic for the MPCP, which organizes the tasks that share common resources into groups and attempts to assign each group of tasks to the same processor. Following the same principle, Nemati et al. presented a blocking-aware partitioning method that uses an advanced cost heuristic algorithm to split a task group when the entire group fails to be assigned on one processor. In subsequent work, Hsiu et al.  proposed a dedicated-core framework to separate the execution of critical sections and normal sections, and employed a priority-based mechanism for resource sharing, such that each request can be blocked by at most one lower-priority request. Wieder and Brandenburg  proposed a greedy slacker partitioning heuristic in the presence of spin locks. The resource-oriented partitioned (ROP) scheduling was proposed by Huang et. al  in 2016 and later refined by von der Brüggen et al.  with release enforcement for a special case.
For priority assignment, most of the results in the literature use rate-monotonic (RM) or earliest-deadline-first (EDF) scheduling. To the best of our knowledge, the priority assignment for systems with shared resources has only been seriously explored in a small numbers of papers, e.g., relative deadline assignment under release enforcement in , priority assignment for spinning , reasonable priority assignments under global scheduling , and the optimal priority assignment used in the greedy slack algorithm in . However, no theoretical evidence has been provided to quantify the non-optimality of the above heuristics.
Although many multiprocessor locking protocols have been proposed in the literature, there are a few unsolved fundamental questions when real-time tasks share resources (via locking mechanisms) in multiprocessor systems:
What is the fundamental difficulty?
What is the performance gap of partitioned, semi-partitioned, and global scheduling?
Is it always beneficial to prioritize critical sections?
To answer the above questions, we focus on the simplest and the most basic setting: all tasks have the same period and release their jobs always at the same time, so-called frame-based real-time task systems, and are scheduled on identical (homogeneous) processors. Specifically, we assume that each critical section is non-nested and is guarded by only one binary semaphore or one mutex lock.
Contribution: Our contributions are as follows:
We show that finding a schedule of the tasks to meet the given common deadline is -hard in the strong sense regardless of the number of processors in the system. Therefore, there is no polynomial-time approximation algorithm that can bound the allocated number of processors to meet the given deadline. Moreover, the -hardness holds under any scheduling paradigm. Therefore, the allowance of preemption or migration does not reduce the computational complexity.
We propose a dependency graph approach for multiprocessor synchronization, which consists of two steps: 1) the construction of a directed acyclic graph (DAG), and 2) the scheduling of this DAG. We prove that for minimizing the makespan the lower bound of the approximation ratio of such an approach is at least under any scheduling paradigm and under partitioned or semi-partitioned scheduling.
We demonstrate how existing results in the literature of uniprocessor non-preemptive scheduling can be adopted to construct the DAG in the first step of the dependency graph approach when each task has only one critical section. This results in several polynomial-time scheduling algorithms with different constant approximation bounds for minimizing the makespan. Specifically, the best approximation developed is a polynomial-time approximation scheme with an approximation ratio of for any under semi-partitioned scheduling strategies. We further discuss methodologies and tradeoffs of preemptive against non-preemptive scheduling algorithms and partitioned against semi-partitioned scheduling algorithms.
We also implemented the dependency graph approach as a prototype in [13, 8]. The experimental results show that the overhead is almost the same as the state-of-the-art multiprocessor locking protocols. Moreover, we also provide extensive numerical evaluations, which demonstrate the performance of the proposed approach under different scheduling constraints. Comparing to the state-of-the-art resource-oriented partitioned (ROP) scheduling, our approach shows significant improvement.
2 System Model
2.1 Task Model
In this paper, we will implicitly consider frame-based real-time task systems to be scheduled on identical (homogeneous) processors. The given tasks release their jobs at the same time and have the same period and relative deadline. Our studied problem is the task synchronization problem where all tasks have exactly one (not nested) critical section, denoted as TS-OCS. Specifically, each task releases a job (at time for notational brevity) with the following properties:
is the execution time of the first non-critical section of the job.
is the execution time of the (first) critical section of the job, in which a binary semaphore or a mutex is used to control the access of the critical section.
is the execution time of the second non-critical section of the job.
A subjob is a critical section or a non-critical section. Therefore, there are three subjobs of a job of task . We assume the task set T is given and that the deadline is either implicit, i.e., identical to the period, or constrained, i.e., smaller than the period. The cardinality of a set X is . We also make the following assumptions:
For each task in T, , , and .
The execution of the critical sections guarded by one binary semaphore must be sequentially executed under a total order. That is, if two tasks share the same semaphore, their critical sections must be executed one after another without any interleaving.
The execution of a job cannot be parallelized, i.e., a job must be sequentially executed in the order of .
There are in total binary semaphores.
The paper will implicitly focus on the above task model. In Section 8, we will explain how the algorithms in this paper can be extended to periodic task systems under certain conditions.
2.2 Scheduling Strategies
Here, we define scheduling strategies and the properties of a schedule for a frame-based real-time task system. Note that the terminologies used here are limited to the scenario where each task in T releases only one job at time . Therefore, we will use the term jobs and tasks interchangeable.
A schedule is an assignment of the given jobs (tasks) to one of the identical processors, such that each job is executed (not necessarily consecutively) until completion. A schedule for T can be defined as a function , where denotes that the job of task is executed at time on processor , and denotes that processor is idle at time . We assume that a job has to be sequentially executed, i.e., intra-task parallelism is not possible. Therefore, it is not feasible to run a job in parallel on two processors, i.e., for any if .
Some other constraints may also be introduced. A schedule is non-preemptive if a job cannot be preempted by any other job, i.e., there is only one interval with on processor for each task in T. A schedule is preemptive if a job can be preempted, i.e., more than one interval with for any task in T on processor is allowed.
For a partitioned schedule, a job has to be executed on one processor. That is, there is exactly one processor with for every task in T. Such a schedule can be preemptive or non-preemptive. For a global schedule, a job can be arbitrarily executed on any of the processors at any time point. That is, it is possible that and for and . By definition, a global schedule is preemptive (for frame-based real-time task systems) in our model. For a semi-partitioned schedule, a subjob (either a critical section or a non-critical section) has to be executed on one processor. Such a semi-partitioned schedule can be preemptive or non-preemptive.
Based on the above definitions, a partitioned schedule is also a semi-partitioned schedule, and a semi-partitioned schedule is also a global schedule.
2.3 Scheduling Theory
In the rich literature of scheduling theory, one specific objective is to minimize the completion time of the jobs, called makespan. For frame-based real-time task systems, if the makespan of the jobs released at time is no more than the relative deadline, then the task set can be feasibly scheduled to meet the deadline.222Note that the deadline is never larger than the period in our setting. We state the makespan problem for TS-OCS that is studied here as follows:
The TS-OCS Makespan Problem: We are given identical (homogeneous) processors. There are tasks arriving at time . Each task is given by and has at most one critical section, guarded by one binary semaphore. The objective is to find a schedule that minimizes the makespan.
Alternatively, we can also investigate the bin packing version of the problem, i.e., minimizing the number of allocated processors to meet a given common deadline .
The TS-OCS Bin Packing Problem: We are given identical (homogeneous) processors. There are tasks arriving at time with a common deadline . Each task is given by and has at most one critical section, guarded by one binary semaphore. The objective is to find a schedule to meet the deadline with the minimum number of allocated processors.
Essentially, the decision versions of the makespan and the bin packing problems are identical:
The TS-OCS Schedulability Problem: We are given identical (homogeneous) processors. There are tasks arriving at time with a common deadline . Each task is given by and has at most one critical section, guarded by one binary semaphore. The objective is to find a schedule to meet the deadline by using the processors.
In the domain of scheduling theory, a scheduling problem is described by a triplet .
: describes the machine environment.
: specifies the processing characteristics and constraints.
: presents the objective to be optimized.
For example, the scheduling problem deals with a uniprocessor system, in which the input is a set of jobs with different release times and different absolute deadlines, and the objective is derive a non-preemptive schedule which minimizes the maximum lateness. The scheduling problem deals with a homogeneous multiprocessor system, in which the input is a set of jobs with the same release times, and the objective is derive a partitioned schedule which minimizes the makespan. The scheduling problem is an extension of by further considering the precedence constraints of the jobs. The scheduling problem further allows preemption. Note that in classical scheduling theory, preemption in parallel machines implies the possibility of job migration from one machine to another.333In real-time systems, this is not necessarily the case. For instance, under preemptive partitioned scheduling a job can be preempted and resumed later on the same processor without migration. Therefore, the scheduling problem allows job preemption and migration, i.e., preemptive global scheduling.
2.4 Approximation Metrics
Since many scheduling problems are -hard in the strong sense, polynomial-time approximation algorithms are often used. In the realm of real-time systems, there are two widely adopted metrics:
The Approximation Ratio compares the resulting objectives of (i) scheduling algorithm and (ii) an optimal algorithm when scheduling any given task set. Formally, an algorithm for the makespan problem (i.e., Definition 1) has an approximation ratio , if given any task set T, the resulting makespan is at most on processors, where is the minimum (optimal) makespan to schedule T on processors. An algorithm for the bin packing problem (i.e., Definition 2) has an approximation ratio , if given any task set T, it can find a schedule of T on processors to meet the common deadline, where is the minimum (optimal) number of processors required to feasibly schedule T.
The Speedup Factor [26, 36] of a scheduling algorithm indicates the factor by which the overall speed of a system would need to be increased so that the scheduling algorithm always derives a feasible schedule to meet the deadline, provided that there exists one at the original speed. This is used for the problem in Definition 3.
3 Dependency Graph Approach for Multiprocessor Synchronization
To handle the studied makespan problem in Definition 1, we propose a Dependency Graph Approach, which involves two steps:
In the first step, a directed graph is constructed. A subjob (i.e., a critical section or a non-critical section) is a vertex in . The subjob is a predecessor of the subjob . The subjob is a predecessor of the subjob . If two jobs of and share the same binary semaphore, i.e., , then either the subjob is the predecessor of that of or the subjob is the predecessor of that of . All the critical sections guarded by a binary semaphore form a chain in , i.e., the critical sections of the binary semaphore follow a total order. Therefore, we have the following properties in set :
The two directed edges and are in .
Suppose that is the set of the tasks which require the same binary semaphore . Then, the tasks in follow a certain total order such that is a directed edge in when .
Fig. 1 provides an example for a task dependency graph with one binary semaphore. Since there are binary semaphores in the task set, the task dependency graph has in total connected subgraphs, denoted as . In each connected subgraph , the corresponding critical sections of the tasks that request critical sections guarded by the same semaphore form a chain and have to be executed sequentially. For example, in Fig. 1, the dependency graph forces the scheduler to execute the critical section prior to any of the other three critical sections.
In the second step, a corresponding schedule of on processors is generated. The schedule can be based on system’s restrictions or user’s preferences, i.e., either preemptive or non-preemptive schedules, either global, semi-partitioned, or partitioned schedules.
In the dependency graph approach, the second step has been widely studied in scheduling theory. That is, a solution of the problem results in a semi-partitioned schedule, since the dependency graph is constructed by considering a critical section or a non-critical section as a subjob. Moreover, a solution of the problem results in a global schedule. For deriving a partitioned schedule, we can force the subjobs generated by a job to be tied to one processor. That is, targets a partitioned non-preemptive schedule and targets a partitioned preemptive schedule.
Therefore, the key issue is the construction of the dependency graph. An alternative view of the dependency graph approach is to build the dependency graph assuming a sufficient number of processors (i.e., using as many processors as possible) in the first step, and then the second step considers the constraint of the number of processors. Towards the first step, we need the following definition:
A critical path of a task dependency graph is one of the longest paths of . The critical path length of is denoted by
For the rest of this paper, we denote a dependency task graph of the input task set T that has the minimum critical path length as . Note that is independent of .
is the lower bound of the TS-OCS makespan problem for task set T on processors.
This comes from the setting of the problem, i.e., each task has only one critical section guarded by one binary semaphore, and the definition of the graph , i.e., using as many processors as possible.
A feasible schedule of a task dependency graph respect to the precedence constraints defined in and the specified scheduling requirement, e.g., being global/semi-partitioned/partitioned and preemptive/non-preemptive. is the makespan of .
With the above definitions, we can recap the objectives of the two steps in the dependency graph approach. In the first step, we would like to construct a dependency graph to minimize , and in the second step, we would like to construct a schedule to minimize .
We conclude this section by stating the following theorem:
The optimal makespan of the TS-OCS makespan problem for T on processors is at least
where is a dependency task graph of T that has the minimum critical path length.
The lower bound comes from Lemma 1 and the lower bound is due to the pigeon hole principle.
4 Computational Complexity and Lower Bounds
This section presents the computational complexity and lower bounds of approximation ratios of the dependency graph approach.
4.1 Computational Complexity
The following theorem shows that constructing is unfortunately -hard in the strong sense.
Constructing a dependency task graph that has the minimum critical path length is -hard in the strong sense.
This theorem is proved by a reduction from the decision version of the scheduling problem , i.e., uniprocessor non-preemptive scheduling, in which the objective is to minimize the maximum lateness assuming that each job in the given job set J has its known processing time , arrival time , and absolute deadline . This problem is -hard in the strong sense by a reduction from the 3-Partition problem . Suppose that the decision version of the scheduling problem is to validate whether there exists a schedule in which the finishing time of each job is no less than .
Let be any positive integer greater than . For each job in J, we construct a task with one critical section, where is set to , is set to , and is set to . By the setting, , and for every constructed task . The critical sections of all the constructed tasks are guarded by only one binary semaphore. Let the task set constructed above be T. The above input task set T by definition is a feasible input task set for the one-critical-section task synchronization problem.
We now prove that there is a non-preemptive uniprocessor schedule for J in which all the jobs can meet their deadlines if and only if there is a dependency task graph with a critical path length less than or equal to for the constructed task set T.
If part, i.e., holds: Without loss of generality, we index the tasks in T so that the critical section of is the immediate predecessor of the critical section in , e.g., as in Fig. 1. Suppose that is the subgraph of that consists of only the vertices representing and the corresponding edges. Let be the longest path in that ends at the vertex representing .
By definition, is . Moreover, is for every task in T. Since and , we know that for every task in T.
We can now construct the uniprocessor non-preemptive schedule for J by following the same execution order. Here, we index the jobs in J corresponding to T. The finishing time of job is . The finishing time of job is .
This proves the if part.
Only-If part, i.e., there is a uniprocessor non-preemptive schedule in which all the deadlines of the jobs in J are met: The proof for the if part can be reverted and the same arguments can be applied. Due to space limitation, we omit the details.
The makespan problem with task synchronization for T on processors is -hard in the strong sense even if is sufficiently large under any scheduling paradigm.
This comes directly from Theorem 2. Consider that there are processors. The if-and-only-if proof in Theorem 2 can be extended by introducing a concrete schedule that executes the two non-critical sections of task one processor and the critical section of task on processor .444The same statement also holds for using processors, but the proof is more involved.
Theorem 3 expresses the fundamental difficulty of the multiprocessor synchronization problem and shows that a very simplified version of this problem is -hard in the strong sense regardless of the number of processors and the underlying scheduling paradigm. Therefore, the allowance of preemption or migration does not reduce the computational complexity. The fundamental problem is the sequencing of the critical sections, which is independent from the underlying scheduling paradigm. Therefore, no matter what flexibility the scheduling algorithm has (unless aborting and restarting a critical section is allowed), the computational complexity remains -hard in the strong sense.
4.2 Remarks: Bin Packing
Although the focus of this paper is the makespan problem in Definition 1 and the schedulability problem in Definition 3, we also state the following theorems to explain the difficulty of the bin packing problem in Definition 2.
Minimizing the number of processors for a given common deadline of T with task synchronization for T (i.e., Definition 2) is -hard in the strong sense under any scheduling paradigm.
There is no polynomial-time (approximation) algorithm to minimize the number of processors for a given common deadline of T with task synchronization for T under any scheduling paradigm unless .
4.3 Lower Bounds
The dependency graph approach requires two steps. The following theorem shows that even if both steps are optimized, the resulting schedule for the makespan problem with task synchronization is not optimal and has an asymptotic lower bound of the approximation ratio.
The optimal schedule on identical processors for the dependency graph that has the minimum critical path length is not optimal for the TS-OCS makespan problem and can have an approximation bound of at least
under any scheduling paradigm, and
under partitioned or semi-partitioned scheduling.
We prove this theorem by providing a concrete input instance as follows:
Suppose that is a given integer with and we have tasks.
We assume a small positive number which is close to and a number which is much greater than , i.e., .
All tasks have a critical section guarded by the same binary semaphore.
Task has , and
Task has , and for .
We need to show that the optimal dependency graph of this input instance in fact leads to the specified bound. The proof is in Appendix.
5 Algorithms to Construct
The key to success is to find . Unfortunately, as shown in Theorem 2, finding is -hard in the strong sense. However, finding good approximations is possible. The problem to construct is called the dependency-graph construction problem. Here, instead of presenting new algorithms to find good approximations of , we explain how to use the existing algorithms of the scheduling problem to derive good approximations of .
It should be first noted that the problem cannot be approximated with a bounded approximation ratio because the optimal schedule may have no lateness at all and any approximation leads to an unbounded approximation ratio. However, a variant of this problem can be easily approximated. This is known as the delivery-time model of the problem . In this model, each job has its release time , processing time , and delivery time . After a job finishes its execution on a machine, its result (final product) needs amount of time to be delivered to the customer. The objective is to minimize the makespan . Therefore, the effective deadline of job on the given single machine is . Since is a constant, this is effectively equivalent to the case when is set to .
The delivery-time model of the problem can then be effectively approximated. Moreover, our problem to construct a good dependency graph for T is indeed equivalent to the delivery-time model of the problem . To show such equivalence, Algorithm 1 presents the detailed transformation. For each semaphore , suppose that is the set of tasks that use (Line 1 in Algorithm 1). For each task set , we transform the problem to construct to an equivalent delivery-time model of the problem (Line 3 to Line 8). Then, we construct the graph based on the derived schedule of an approximation algorithm for the delivery-time model of the problem .
An -approximation algorithm for the delivery-time model of the problem applied in Algorithm 1 guarantees to derive a dependency graph with .
This theorem can be proved by a counterpart of the proof of Theorem 2. We will show that Algorithm 1 is in fact an L-reduction (i.e., a reduction that preserves the approximation ratio) from the input task set to the delivery-time model of the problem . In this L-reduction, there is no loss of the approximation ratio.
First, by definition, two tasks are independent if they do not share any semaphore. Moreover, since the TS-OCS problem assumes that a task accesses at most one binary semaphore, a task can only appear at most in one for a certain . Therefore, .
To show that the reduction preserves the approximation ratio, we only need to prove the one-to-one mapping. One possibility is to prove that a schedule for the input instance of the problem delivers the last result at time if and only if the corresponding graph constructed by using Lines 9 and 10 in Algorithm 1 has a critical path length . This is unfortunately not possible because a (technically bad but possible) schedule for the input instance of the problem can be arbitrarily alerted by inserting useless delays.
Fortunately, for a given permutation to order the tasks in , we can always construct a schedule for the input instance of the problem by respecting the given order and their release times. Such a schedule for the input instance of the problem delivers the last result at time if and only if the corresponding graph constructed by using Lines 9 and 10 in Algorithm 1 has a critical path length . Moreover, the schedule for one such permutation is optimal for the input instance of the problem .
Therefore, the approximation ratio is perserved while constructing . According to the above discussions, . Moreover,
According to Theorem 7 and Algorithm 1, we can simply apply the existing algorithms of the scheduling problem in the delivery-time model to derive by using well-studied branch-and-bound methods, see for example [14, 32, 35], or good approximations of , see for example [22, 37]. Here, we will summarize several polynomial-time approximation algorithms. The details can be found in .
For the delivery-time model of the scheduling problem , the extended Jackson’s rule (JKS) is as follows: “Whenever the machine is free and one or more jobs is available for processing, schedule an available job with largest delivery time,” as explained in .
The extended Jackson’s rule (JKS) is a polynomial-time -approximation algorithm for the dependency-graph construction problem.
Potts  observed some nice properties when the extended Jackson’s rule is applied. Suppose that the last delivery is due to a job . Let be the earliest scheduled job so that the machine in the problem is not idle between the processing of and . The sequence of the jobs that are executed sequentially from to is called a critical sequence. By the definition of , all jobs in the critical sequence must be released no earlier than the release time of job . If the delivery time of any job in the critical sequence is not shorter than the delivery time of , then it can be proved that the extended Jackson’s rule is optimal for the problem . However, if the delivery time of a job in the critical sequence is shorter than the delivery time of , the extended Jackson’s rule may start a non-preemptive job too early. Such a job that appears last in the critical sequence is called the interference job of the critical sequence.
Potts  suggested to attempt at improving the schedule by forcing some interference job to be executed after the critical job , i.e., by delaying the release time of from to . This procedure is repeated for at most iterations and the best schedule among the iterations is returned as the solution.
Potts’ iterative process (Potts) is a polynomial-time -approximation algorithm for the dependency-graph construction problem.
Hall and Shmoys  further improved the approximation ratio to by handling a special case when there are two jobs and with and where is and running Potts’ algorithm for iterations.555Hall and Shmoys  further use the concept of forward and inverse problems of the input instance of . As they are not highly related, we omit those details.
Algorithm HS is a polynomial-time -approximation algorithm for the dependency-graph construction problem.
The algorithm that has the best approximation ratio for the delivery-time model of the problem is a polynomial-time approximation scheme (PTAS) developed by Hall and Shmoys .
The dependency-graph construction problem admits a polynomial-time approximation scheme (PTAS), i.e., the approximation bound is under the assumption that is a constant for any .
6 Algorithms to Schedule Dependency Graphs
This section presents our heuristic algorithms to schedule the dependency graph derived from Algorithm 1. We first consider the special case when there is a sufficient number of processors, i.e., .
For a task set T, to be scheduled on identical processors, the makespan of the schedule which executes task on only one processor as early as possible by respecting to the precedence constraints defined in a given task dependency graph is if . By definition, the schedule is a partitioned schedule for the given jobs and non-preemptive with respect to the subjobs.
Since , all the tasks can start their first non-critical sections at time . Therefore, the critical section of task arrives exactly at time . Then, the finishing time of the critical section of task is exactly the longest path in that finishes at the vertex representing . Therefore, the makespan of such a schedule is exactly .
For the remaining part of this section, we will focus on the other case when . We will heavily utilize the concept of list schedules developed by Graham  and the extensions to schedule the dependency graph derived from Section 5. A list schedule works as follows: Whenever a processor idles and there are subjobs eligible to be executed (i.e., all of their predecessors in have finished), one of the eligible subjobs is executed on the processor. When the number of eligible subjobs is larger than the number of idle processors, many heuristic strategies exist to decide which subjobs should be executed with higher priorities. Graham  showed that the list schedules can be generated in polynomial time and have a approximation ratio for the scheduling problem .
For the rest of this section, we will explain how to use or extend list schedules to generate partitioned or semi-partitioned and preemptive or non-preemptive schedules based on .
6.1 Semi-Partitioned Scheduling
In a list schedule, since the subjobs of a task are scheduled individually, a task in the generated list schedule may migrate among different processors, thus representing a semi-partitioned schedule. However, a subjob by default is non-preemptive in list schedules.
The following lemma is widely used in the literature for the list schedules developed by Graham . All the existing results of federated scheduling, e.g., [31, 6, 15], for scheduling sporadic dependent tasks (that are not due to synchronizations) all implicitly or explicitly use the property in this lemma.
The makespan of a list schedule of a given task dependency graph for task set T on processors is at most .
The original proof can be traced back to Theorem 1 by Graham  in 1969. We omit the proof here as this is a standard procedure in the proof of list schedules for the scheduling problem .
If for a certain , the makespan of a list schedule of the task dependency graph for task set T on processors has an approximation bound of if .
Since , the makespan of a list schedule of , denoted as , is
We now conclude the approximation ratio.
The default list schedulers are non-preemptive in the subjob level. However, it may be more efficient if the second non-critical section of a task can be preempted by a critical section. Otherwise, the processors may be busy executing second non-critical sections and a critical section has to wait. As a result, not only this critical section itself but also its successors in may be unnecessary postponed and therefore increase the makespan. This problem can be handled by preempting second non-critical sections. Allowing such preemption in the scheduler design can be achieved easily as follows:
In the algorithm, the scheduling decision is made at a time when there is a subjob eligible or finished.
Whenever a subjob representing a critical section is eligible, it can be assigned to a processor that executes a second non-critical section of a job by preempting that subjob.
6.2 Partitioned Scheduling
In a partitioned schedule of the frame-based task set , all subjobs of a task must be executed on the same processor. Therefore, the list scheduling algorithm variant must ensure that once the first subjob of task is executed on a processor, all subsequent subjobs of task are tied to the same processor in any generated list schedule. Specifically, the problem is termed as in Section 2.3.
A special case of has been recently studied to analyze OpenMP systems by Sun et al.  in 2017. They assumed that the synchronization subjob of a task always takes place at the end of the task. Our dependency graph unfortunately does not satisfy the assumption because the synchronization subjob is in fact in the middle of a task. However, fixing this issue is not difficult. We illustrate the key strategy by using Fig. 2. The subgraph of that consists of only the vertices of the first non-critical sections and the critical sections in fact satisfies the assumption made by Sun et al. . Therefore, we can generate a multiprocessor schedule for the dependency graph on processors by using the BFS algorithm (an extension of the breadth-first-scheduling algorithm) by Sun et al. . It can be imagined that the subjobs that represent the second non-critical sections are background workload and can be executed only at the end of the schedule or when the available idle time is sufficient to complete .
Alternatively, in order to improve the parallelism, another heuristic algorithm can be applied where all the first non-critical sections are scheduled before any of the critical sections using list scheduling. Once the first non-critical section of task is assigned on a processor, the remaining execution of task is forced to be executed on that processor.
If the second non-critical sections can be preempted, it can be imagined that the subjobs that represent the second non-critical sections are background workload and can be executed whenever its processor idles and preempted by the first non-critical sections or the critical sections on the processor. For completeness, we illustrate the algorithm in Algorithm 2 in the Appendix.
7 Timing Anomaly
So far, we assume that , , and are exact for a task . However, the execution of a subjob of task can be finished earlier than the worst case. It should be noted that list schedules are in this case not sustainable, i.e., the reduction of the execution time of a subjob can lead to a worse makespan due to the well-known multiprocessor timing anomaly observed by Graham . There are three ways to handle such timing anomaly: 1) ignore the early completion and stick to the offline schedule, 2) reclaim the unused time (slack) carefully without creating timing anomaly, e.g., , or 3) use a safe upper bound, e.g., Lemma 7 to account for all possible list schedules. Each of them has advantages and disadvantages. It is up to the designers to choose whether they want to be less effective (Option 1), pay more runtime overhead (Option 2), or be more pessimistic by taking always a safe upper bound (Option 3).
Due to multiprocessor timing anomaly, a dependency graph with a longer critical path may have a better makespan in the resulting list schedule. Our approach can be easily improved by returning and scheduling the intermediate dependency graphs in Algorithms Potts and HS.
8 Periodic Tasks with Different Periods
Our approach can be extended to periodic tasks with different periods under an assumption that a binary semaphore is only shared among the tasks that have the same period. For each of the semaphores, a DAG is constructed using Algorithm 1. Afterwards, the resulting DAGs can be scheduled using any approach for multiprocessor DAG scheduling, e.g., global scheduling , Federated Scheduling  as well as enhanced versions like Semi-Federated Scheduling  and Reservation-Based Federated Scheduling .
This section presents the evaluations of the proposed approach. We will first explain how our approach can be implemented by using existing routines in and provide the measured overhead in . Then, we will demonstrate the performance of the proposed approach by applying numerical evaluations for different configurations.
9.1 Implementations and Overheads
The hardware platform used in our experiments is a cache-coherent SMP, consisting of two 64-bit Intel Xeon Processor E5-2650Lv4 running at 1.7 GHz, with 35 MB cache and 64 GB of main memory. We have implemented our dependency graph approach in , in order to investigate the overheads. Both partitioned and semi-partitioned scheduling algorithms presented in Section 6 have been implemented in under the plug-in Partitioned Fixed Priority (P-FP), detailed in the Appendix. The patches of our implementation have been released in .
In Table I, we report the following overheads of different protocols, including the existing protocols DPCP, and MPCP in and our implementation of the partitioned dependency graph approach (PDGA) and the semi-partitioned dependency graph approach (SDGA):
CXS: context-switch overhead.
RELEASE: time spent to enqueue a newly released job in a ready queue.
SCHED2: time spent to perform post context switch and management activities.
SCHED: time spent to make a scheduling decision (scheduler to find the next job).
SEND-RESCHED: inter-processor interrupt latency, including migrations.
Table I shows that the overheads of our approach and of other protocols implemented in are comparable.
|CXS||30.93 (1.51)||31.1 (0.67)||31.21 (0.71)||30.95 (1.54)|
|RELEASE||32.63 (3.96)||19.48 (3.91)||19.77 (4.03)||21.64 (4.3)|
|SCHED2||28.7 (0.18)||29.78 (0.15)||29.91 (0.16)||29.74 (0.2)|
|SCHED||31.43 (1.2)||31.38 (0.78)||31.4 (0.83)||31.26 (1.11)|
|SEND-RESCHED||47.01 (14.42)||31.83 (3.45)||45.23 (4.33)||41.53 (7.24)|
9.2 Numerical Performance Evaluations
We conducted evaluations with = 4, 8 and 16 processors. Depending on , we generate task sets, each with tasks. For each task set T, we generated synthetic tasks with by applying the RandomFixedSum method  and enforced that for each task . The number of shared resources (binary semaphores) was set to . The length of the critical section is a fraction of the total execution time of task , depended on . The remaining part was split into and by drawing randomly uniform from and setting to .
For a generated task set T, we calculated a lower bound on the optimal makespan based on Eq. (1). Since deriving is computationally expensive, we used as a safe approximation for , where is the sum of the lengths of the critical sections that share semaphore . If the relative deadline of the task set is less than , the task set is not schedulable by any algorithm. We compare the performance of different algorithms according to the acceptance ratio by setting the relative deadline in the range of . We name the developed algorithms using the following rules: 1) JKS/POTTS in the first part: using the extended Jackson’s rule or Potts to construct the dependency graph;666We did not implement Lemma 5 due to the complexity issue. Algorithm HS in general has similar performance to POTTS. 2) SP/P in the second part: semi-partitioned or partitioned scheduling algorithm is applied777In Section 6.2, we presented two strategies for task partitioning: one is based on  (detailed in Appendix) and another is a simple heuristic by performing the list scheduling algorithm based on the first non-critical sections. In all the experiments regarding partitioned scheduling, we observed that the latter (i.e., the simple heuristic) performed better. All the presented results for partitioned scheduling are therefore based on the simple heuristic. ; 3) P/NP in the third part: preemptive or non-preemptive for the second non-critical sections.
We evaluated all 8 combinations under different settings as shown in Fig. 3. Due to space limitation, only a subset of the results is presented. In general, the semi-partitioned scheduling algorithms clearly outperform the partitioned strategies, independently from the algorithm used to construct the dependency graph. In addition, the preemptive scheduling policy with respect to the second computation segment is superior to the non-preemptive strategy and POTTS (usually) performs slightly better than JKS. We analyze the effect of the three parameters individually by changing:
We also compare our approach with the Resource Oriented Partitioned (ROP) scheduling with release enforcement by von der Brüggen et al.  which is designed to schedule periodic tasks with one critical section on a multiprocessor platform. The concept of the ROP is to have a resource centric view instead of a processor centric view. The algorithm 1) binds the critical sections of the same resource to the same processor, thus enabling well known uniprocessor protocols like PCP to handle the synchronization, and 2) schedule the non-critical sections on the remaining processors using a state-of-the-art scheduler for segmented self-suspension tasks, namely SEIFDA . Among the methods in , we evaluated FP-EIM-PCP (under fixed-priority scheduling) and EDF-EIM-PCP (under dynamic-priority scheduling). It has been shown in  that EDF-EIM-PCP dominates all existing methods. We performed another set of evaluations by adopting aforementioned settings and testing the utilization level in a step of , where the utilization of a task set T is . Fig. 4 presents the evaluation results. Due to space limitation, only a subset of the results is presented, but the others have very similar curve tendencies. For readability, we only select two combinations in our proposed approach that outperform the others. The results in Fig. 4 show that for frame-based tasks, our approach outperforms ROP significantly. We note that Fig. 4 is only for frame-based tasks, and the results for periodic task systems discussed in Section 8 are further presented in Appendix.
This paper tries to answer a few fundamental questions when real-time tasks share resources in multiprocessor systems. Here is a short summary of our findings:
The fundamental difficulty is mainly due to the sequencing of the mutual exclusive accesses to the share resources (binary semaphores). Adding more processors, removing periodicity and job recurrence, introducing task migration, or allowing preemption does not make the problem easier from the computational complexity perspective.
The performance gap of partitioned and semi-partitioned scheduling in our study is mainly due to the capability to schedule the subjobs constrained by the dependency graph. Although partitioned scheduling may seem much worse than semi-partitioned scheduling in our evaluations, this is mainly due to the lack of understanding of the problem in the literature. Further explorations are needed to understand these scheduling paradigms for a given dependency graph.
The dependency graph approach is not work-conserving for the critical sections, since a critical section may be ready but not executed due to the artificially introduced precedence constraints. Existing multiprocessor synchronization protocols mainly assume work-conserving for granting the accesses of the critical sections via priority boosting. Our study reveals a potential to consider cautious and non-work-conserving synchronization protocols in the future.
Acknowledgement: This paper is supported by DFG, as part of the Collaborative Research Center SFB876, project A3 and B2 (http://sfb876.tu-dortmund.de/). The authors thank Zewei Chen and Maolin Yang for their tool SET-MRTS (Schedulability Experimental Tools for Multiprocessors Real Time Systems, https://github.com/RTLAB-UESTC/SET-MRTS-public) to evaluate the LP-GFP-FMLP, LP-PFP-DPCP, LP-PFP-MPCP, GS-MSRP, and LP-GFP-PIP in Fig. 5.
Proof of Theorem 6. Due to the design of the task set, there are only different dependency graphs, depending on the position of in the execution order. Suppose that the critical section of task is the -th critical section in the dependency graph. It can be proved that the critical path of this dependency graph is . We sketch the proof:
The non-critical section must be part of the critical path since , which is greater than any for any .
The longest path that ends at the vertex representing has 1) one non-critical section, 2) critical sections from for , and 3) 1 critical section from task . Therefore, this length is .
Combining the two scenarios, we reach the conclusion.
Therefore, the dependency graph that has the minimum critical path length is the one where ’s critical section is the first one among the critical sections. The optimal schedule of the dependency graph on processors has the following properties:
Task finishes its critical section at time .
Before time , none of the second non-critical sections is executed. Therefore, the makespan of any feasible schedule of on processors is
Moreover, when the scheduling policy is either semi-partitioned or partitioned scheduling, by the pigeon hole principle, at least one processor must execute of the second non-critical sections no earlier than . Therefore, the makespan of a feasible semi-partitioned or partitioned schedule of on processors is
We can have another feasible partitioned schedule :
The first non-critical section is executed on processor , and the first non-critical sections of the other tasks are executed on the first processors based on list scheduling. All the first non-critical sections finish no later than . Each of the first processors executes exactly tasks since there are tasks with identical properties on these processors.
The critical sections of tasks are executed sequentially by following the above reversed-index order on the same processor of the corresponding first non-critical sections, starting from time .
At time , all the second non-critical sections of are eligible to be executed. We execute them in parallel on the first processors by respecting the partitioned scheduling strategy. That is, each of the first processors executes exactly tasks with . The makespan of these tasks is .
At time , the critical section of starts its execution on processor . Furthermore, at time , the second non-critical section of is executed on processor and it is finished at time .
As a result, the makespan of the above partitioned schedule is exactly .
Therefore, the approximation bound of the optimal task dependency graph approach is at least under any scheduling paradigm and is at least under partitioned or semi-partitioned scheduling paradigm. We reach the conclusion by taking .
Pseudo-code of the Partitioned Preemptive Scheduling in Section 6.2 For notational brevity, we define two vertices and to represent the first and second non-critical sections of task and to represent the critical section of task . Let be the set of tasks in T assigned to processor for . The pseudo-code is listed in Algorithm 2. It consists of three blocks: initialization from Line 1 to Line 4, scheduling of the first non-critical sections and the critical sections of the tasks according to from Line 5 to Line 23, and scheduling of the second non-critical sections of the tasks from Line 24 to Line 28.
The first block is self-explained in Algorithm 2. We will focus on the second and third blocks of Algorithm 2. Our scheduling algorithm executes the first non-critical sections and the critical sections non-preemptively. Whenever a subjob finishes at time , we examine the following scenarios on each processor for :
If there is a pending critical section on processor that is eligible at time according to the dependency graph , we would like to execute the critical section as soon as possible. Therefore, this critical section is executed as soon as it is eligible and the processor idles (i.e., Lines 12-13).
Else if there is a task in in which its first non-critical section has not finished yet at time , we would like to execute it (Lines 14-15).
Otherwise, there is no eligible subjob to be executed at time . If there is still an unassigned task, we select one and assign it to processor by starting its first non-critical section at time (Lines 16-19).
In all the above steps, task can be arbitrarily selected if there are multiple tasks satisfying the specified conditions. We note that the schedule is in fact offline. Therefore, after we finish the schedule of the first non-critical sections and the critical sections, in the third block in Algorithm 2
, we can pad the idle time of the schedule on a processorwith the second non-critical sections assigned on processor , starting from time . The only attention is not to start earlier than the finishing time of its critical section. Of course, to minimize the makespan, we should always pad the idle time as early as possible.