We systematically survey the literature on analytically sound multiprocessor real-time locking protocols from 1988 until 2018, covering the following topics:
progress mechanisms that prevent the lock-holder preemption problem (Section 3),
spin-lock protocols (Section 4),
independence-preserving (or fully preemptive) locking protocols (Section 7),
reader-writer and -exclusion synchronization (Section 8),
support for nested critical sections (Section 9), and
implementation and system-integration aspects (Section 10).
A special focus is placed on the suspension-oblivious and suspension-aware analysis approaches for semaphore protocols, their respective notions of priority inversion, optimality criteria, lower bounds on maximum priority-inversion blocking, and matching asymptotically optimal locking protocols.
In contrast to the thoroughly explored and well understood uniprocessor real-time synchronization problem, the multiprocessor case considered herein is still the subject of much ongoing work. In particular, uniprocessor real-time locking protocols that are both optimal and practical are readily available since the late 1980s and early 1990s [B:91, SRL:90, R:91], can flexibly support mutual exclusion, reader-writer synchronization, multi-unit resources, and have been widely adopted and deployed in industry (in POSIX, OSEK/AUTOSAR, etc.). Not so in the multiprocessor case: there is no widely agreed-upon standard protocol or approach, most proposals have focused exclusively on mutual exclusion to date (with works on reader-writer, -exclusion, and multi-unit synchronization starting to appear only in the past decade), and questions of optimality, practicality, and industrial adoption are still the subject of ongoing exploration and debate. Another major difference to the uniprocessor case is the extent to which nested critical sections are supported: a large fraction of the work on multiprocessor real-time locking protocols to date has simply disregarded (or defined away) fine-grained nesting (i.e., cases where a task holding a resource may dynamically acquire a second resource), though notable exceptions exist (discussed in Section 9). No such limitations exist in the state of the art w.r.t. uniprocessor real-time synchronization.
We can thus expect the field to continue to evolve rapidly: it is simply not yet possible to provide a “final” and comprehensive survey on multiprocessor real-time locking, as many open problems remain to be explored. Nonetheless, a considerable number of results has accumulated since the multiprocessor real-time locking problem was first studied more than three decades ago. The purpose of this survey is to provide a systematic review of the current snapshot of this body of work, covering most papers in this area published until the end of 2018.
We restrict the scope of this survey to real-time locking protocols for shared-memory multiprocessors (although some of the techniques discussed in Section 6 could also find applications in distributed systems), and do not consider alternative synchronization strategies such as lock- and wait-free algorithms, transactional memory techniques, or middleware (or database) approaches that provide a higher-level transactional interface or multi-versioned datastore abstraction (as any of these techniques warrants a full survey on its own). We further restrict the focus to runtime mechanisms as commonly provided by real-time operating systems (RTOSs) or programming language runtimes for use in dynamically scheduled systems and exclude fully static planning approaches (e.g., as studied by X:93) that resolve (or avoid) all potential resource conflicts statically during the construction of the system’s scheduling table so that no runtime resource arbitration mechanisms are needed.
Our goal is to structure the existing body of knowledge on multiprocessor real-time locking protocols to aid the interested reader in understanding key problems, established solutions, and recurring themes and techniques. We therefore focus primarily on ideas, algorithms, and provable guarantees, and place less emphasis on empirical performance comparisons or the chronological order of developments.
2 The Multiprocessor Real-Time Locking Problem
We begin by defining the core problems and objectives, summarizing common assumptions, and surveying key design parameters. Consider a shared-memory multiprocessor platform consisting of identical processors (or cores) hosting sequential tasks (or threads) denoted as . Each activation of a task is called a job. In this document, we use the terms ‘processor‘ and ‘core‘ interchangeably, and do not precisely distinguish between jobs and tasks when the meaning is clear from context.
The tasks share a number of software-managed shared resources that are explicitly acquired and released with lock() and unlock() calls.111Multicore platforms also often feature hardware-managed, implicitly shared resources such as last-level caches (LLCs), memory controllers, DRAM banks, etc.; techniques for managing such hardware resources are not the subject of this survey. Common examples of such resources include shared data structures, OS-internal structures such as the scheduler’s ready queue(s), I/O ports, memory-mapped device registers, ring buffers, etc. that must be accessed in a mutually exclusive fashion. (We consider weaker exclusion requirements later in Section 8.)
The primary objective of the algorithms considered herein is to serialize all resource requests—that is, all critical sections, which are code segments surrounded by matching lock() and unlock() calls—such that the timing constraints of all tasks are met, despite the blocking delays that tasks incur when waiting to lock a contested resource. In particular, in order to give nontrivial response-time guarantees, one must bound the maximum (i.e., worst-case) blocking delay incurred by any task due to contention for shared resources. To this end, a multiprocessor real-time locking protocol determines which type of locks are used, the rules that tasks must follow to request a lock on a shared resource, and how locks interact with the scheduler. We discuss these design questions in more detail below.
2.1 Common Assumptions
Although there exists much diversity in system models and general context, most surveyed works share the following basic assumptions (significant deviations will be noted where relevant). Tasks are typically considered to follow a periodic or sporadic activation pattern, where the two models can be used largely interchangeably since hardly any work on multiprocessor real-time locking protocols to date has exploited the knowledge of future periodic arrivals.222Notable exceptions include proposals by CTB:94 [CTB:94] and SUBC:19 [SUBC:19]. The timing requirements of the tasks are usually expressed as implicit or constrained deadlines (rather than arbitrary deadlines), since arbitrary deadlines that exceed a task’s period (or minimum inter-arrival time) allow for increased contention and cause additional analytical complications.
For the purpose of schedulability analysis, but not for the operation of a locking protocol at runtime, it is required to know a worst-case execution time (WCET) bound for each task, usually including the cost of all critical sections (but not including any blocking delays), and also individually for each critical section (i.e., a maximum critical section length must be known for each critical section). Furthermore, to enable a meaningful blocking analysis, the maximum number of critical sections in each job (i.e., the maximum number of lock() calls per activation of each task) must be known on a per-resource basis.
While it is generally impossible to make strong a priori response-time guarantees without this information being available at analysis time (for at least some of the tasks), to be practical, it is generally desirable for a locking protocol to work as intended even if this information is unknown at runtime. For example, RTOSs are typically used for many purposes, and not all workloads will be subject to static analysis—the implemented locking protocol should function correctly and predictably nonetheless. Similarly, during early prototyping phases, sound bounds are usually not yet available, but the RTOS is expected to behave just as it will in the final version of the system.
With regard to scheduling, most of the covered papers assume either partitioned or global multiprocessor scheduling. Under partitioned scheduling, each task is statically assigned to exactly one of the processors (i.e., its partition), and each processor is scheduled using a uniprocessor policy such as fixed-priority (FP) scheduling or earliest-deadline first (EDF) scheduling. The two most prominent partitioned policies are partitioned FP (P-FP) and partitioned EDF (P-EDF) scheduling.
Under global scheduling, all tasks are dynamically dispatched at runtime, may execute on any of the processors, and migrate freely among all processors as needed. Widely studied examples include global FP (G-FP) and global EDF (G-EDF) scheduling, as well as optimal policies such as Pfair scheduling [BCPV:96, SA:06].
A third, more general notion is clustered scheduling, which generalizes both global and partitioned scheduling. Under clustered scheduling, the set of cores is split into a number of disjoint clusters (i.e., disjoint subsets of cores), tasks are statically assigned to clusters, and each cluster is scheduled locally and independently using a “global” scheduling policy (w.r.t. the cores that form the cluster). We let denote the number of cores in a cluster. Global scheduling is a special case of clustered scheduling with a single “cluster” and , and partitioned scheduling is the other extreme with clusters and .
While clustered scheduling is a more general assumption (i.e., any locking protocol designed for clustered scheduling also works for global and partitioned scheduling), it also comes with the combined challenges of both global and partitioned scheduling, and is hence generally much more difficult to deal with. Historically, most authors have thus focused on either global or partitioned scheduling.
Under partitioned and clustered scheduling, it is useful to distinguish between global and local resources. Under partitioned (respectively, clustered) scheduling, a shared resource is considered local if it is accessed only by tasks that are all assigned to the same core (respectively, cluster). In contrast, a resource is global if it is accessed by at least two tasks assigned to different partitions (respectively, clusters). The advantage of this distinction is that local resources can be managed with existing, simpler, and often more efficient protocols:
under partitioned scheduling, local resources can be managed using one of the known, optimal uniprocessor protocols (e.g., the PCP [SRL:90] or the SRP [B:91]); and
under clustered scheduling, local resources can be managed using a (usually simpler) protocol for global scheduling (instantiated within each cluster).
In this survey, we therefore consider only global resources, which are the more challenging kind of shared resources to manage.
2.2 Key Design Choices
Given the common setting outlined so far, there are a number of key design questions that any multiprocessor real-time locking protocol must address. We next provide a high-level overview of these issues.
2.2.1 Request Order
The first key design parameter is the serialization order for conflicting requests. Whenever two or more requests for a resource are simultaneously blocked, the protocol must specify a policy for sequencing the waiting requests.
The two most common choices are FIFO queuing, which ensures basic fairness (i.e., non-starvation), is easy to implement, and analysis friendly, and priority queuing, which allows control over the amount of blocking incurred by different tasks. As a third choice, there also exist some situations (discussed in Sections 5.1 and 7) in which the use of hybrid queues consisting of both FIFO- and priority-ordered segments can be advantageous.
When using priority queues, each request must be associated with a priority to determine its position in the wait queue. The common choice is to use a job’s scheduling priority, but it is also possible to use a separate request priority, which can be selected on a per-resource or even on a per-critical-section basis. The latter obviously provides a considerable degree of flexibility, but is not commonly considered since it introduces a nontrivial configuration problem. FIFO and priority queuing can be generalized in a straightforward way by requiring that equal-priority requests are satisfied in FIFO order.
Finally, implementation concerns may sometimes necessitate the use of unordered locks (e.g., primitive test-and-set spin locks), which do not provide any guarantees regarding the order in which conflicting critical sections will execute. Unordered locks are decidedly not analysis-friendly, but sometimes unavoidable.
2.2.2 Spinning vs. Suspending
The second main question is how tasks should wait in case of contention. The two principal choices are busy-waiting (i.e., spinning) and suspending. In the case of busy-waiting, a blocked task continues to occupy its processor and simply executes a tight delay (or spin) loop, continuously checking whether it has been granted the lock, until it gains access to the shared resource. Alternatively, if tasks wait by suspending, a blocked task yields the processor and is taken out of the scheduler’s ready queue until it is granted the requested resource.
Determining how blocked tasks should wait is not an easy choice, as there are many advantages and disadvantages associated with either approach. On the one hand, suspension-based waiting is conceptually more efficient: busy-waiting obviously wastes processor cycles, whereas suspension-based waiting allows the wait times of one task to be overlaid with useful computation of another task. On the other hand, busy-waiting is easier to implement, easier to analyze, requires less OS support, and the cost of suspending and resuming a task can easily dwarf typical critical section lengths. Spin locks also provide predictability advantages that can aid in the static analysis of the system (e.g., a busy-waiting task “protects” its processor and cache state, whereas it is virtually impossible to predict the cache contents encountered by a resuming task).
Whether spinning or suspending is more efficient ultimately depends on workload and system characteristics, such as the cost of suspending and resuming tasks relative to critical section lengths, and it is impossible to categorically declare one or the other to be the “best” choice. Generally speaking, “short” critical sections favor busy-waiting, whereas “long” critical sections necessitate suspensions, but the threshold between “short” and “long” is highly system- and application-specific. We discuss spin-based locking protocols in Sections 4, 7, and 8 and suspension-based locking protocols in Sections 5–8.
2.2.3 Progress Mechanism
A third major choice is the question of how to deal with the lock-holder preemption problem, which is tightly coupled to the choice of scheduler, the employed analysis approach, and the constraints of the target workload. If a lock-holding task is preempted by a higher-priority task, then any other task waiting for the resource held by the preempted task is transitively delayed as well. This can give rise to (potentially) unbounded priority inversions (i.e., excessive delays that are difficult to bound), which must be avoided in a real-time system. To this end, it is at times necessary to (selectively) force the execution of lock-holding tasks by means of a progress mechanism. As this is a crucial aspect of multiprocessor real-time locking protocols, we dedicate Section 3 to the lock-holder preemption problem and common solutions.
2.2.4 Support for Fine-Grained Nesting
Fine-grained locking, where a task concurrently acquires multiple locks in a nested, incremental fashion, is a major source of complications in both the analysis and the implementation of multiprocessor real-time locking protocols. In particular, it comes with the risk of deadlock, and even if application programmers are careful to avoid deadlocks, nesting still introduces intricate transitive blocking effects that are extremely challenging to analyze accurately and efficiently. Furthermore, in many cases, support for fine-grained nesting leads to substantially more involved protocol rules and more heavy-weight OS support.
As already mentioned in Section 1, consequently many works on multiprocessor real-time locking protocols simply disallow the nesting of critical sections altogether, or under partitioned scheduling, restrict nesting to local resources only, where it can be resolved easily with classic uniprocessor solutions [B:91, SRL:90, R:91].
Another common approach is to sidestep the issue by relying on two-phase locking schemes with all-or-nothing semantics, where a task either atomically acquires all requested locks, or holds none, or simple group lock approaches that automatically aggregate fine-grained, nested critical sections into coarse-grained, non-nested lock requests. From an analysis point of view, the two-phase and group locking schemes are conveniently similar to protocols that disallow nesting altogether, but from a developer’s point of view, they impose limitations that may be difficult to accommodate in practice.
Only in recent years have there been renewed efforts towards full, unrestricted support for fine-grained nesting (e.g., [WA:12, BBW:16]), and there remains ample opportunity for future work. We discuss the issues surrounding fine-grained nesting and the state of the art in Section 9, and until then focus exclusively on non-nested critical sections.
2.2.5 In-Place vs. Centralized Critical Sections
The final choice is where to execute a critical section once the lock has been acquired. In shared-memory systems, the typical choice is to execute critical sections in place, meaning that a task executes its critical sections as part of its regular execution, on the processor that it is (currently) assigned to by the scheduler.
However, that is not the only choice. It is also possible to a priori designate a synchronization processor for a particular resource, to the effect that all critical sections (pertaining to that resource) must be executed on the designated synchronization processor. This can yield analytical benefits (i.e., less blocking in the worst case [B:13a]), a reduction in worst-case overheads [CVB:14], and throughput benefits due to improved cache affinity [LDTL:12]. Furthermore, for specialized hardware resources, such as certain I/O devices, it might simply be unavoidable on some platforms (e.g., in a heterogenous multiprocessor platform, one processor might be a designated I/O processor). We discuss protocols that rely on designated synchronization processors in Section 6 and for now focus exclusively on in-place execution.
2.3 Analysis and Optimization Problems
In addition to the just-discussed design choices, which determine the runtime behavior of the protocol, there are also a number of challenging design-time problems related to a priori timing and schedulability analysis of the system, fundamental optimality questions, and system optimization and design-space exploration problems.
Most prominently, as a prerequisite to schedulability analysis, the blocking analysis problem asks to bound the worst-case delay due to resource conflicts. That is, given a workload, a multiprocessor platform, and a specific multiprocessor real-time locking protocol, the objective is to compute a safe (and as accurate as possible) upper bound on the maximum additional delay due to resource contention encountered by a given task in any possible execution of the system. Once such a bound is known for each task, hard timing guarantees can be made with a blocking-aware schedulability or response-time analysis. In some instances, it can be more accurate to carry out both the blocking analysis and the blocking-aware schedulability analysis jointly in a single step (e.g., [YWB:15]).
A large number of ad-hoc, protocol-specific blocking analyses have been developed over the years. Additionally, a more general holistic blocking analysis framework [B:11] and a blocking analysis approach based on linear programming (LP) and mixed-integer linear programming (MILP) have been introduced to systematically reduce analysis pessimism [B:13a, WB:13a, YWB:15, BB:16, BBW:16]. These more recent analysis frameworks represent a general approach that can be (and has been) applied to many different locking protocols; we will note their use and discuss advantages in the context of specific locking protocols in Sections 4.1, 5.2.2, 5.2.3, and 9.
Clearly, it is generally desirable for blocking bounds to be as low as possible. However, if locks are used to resolve contention at runtime, it is also obvious that, in the worst case, some delays are inevitable. This observation naturally leads to the question of asymptotic blocking optimality: generally speaking, what is the least bound on maximum blocking that any protocol can achieve? This question has been studied primarily in the context of suspension-based locking protocols (since the spin-based case is relatively straightforward), and a number of protocols with provably optimal asymptotic blocking bounds have been found. We will discuss this notion of optimality and the corresponding protocols in Section 5.
Another notion of optimality that has been used to characterize multiprocessor real-time locking protocols is resource augmentation and processor speedup bounds, which relate a protocol’s timing guarantees to that of a hypothetical optimal one in terms of the additional resources (or processor speed increases) needed to overcome the protocol’s non-optimality. While this is a stronger notion of optimality than asymptotic blocking optimality—speedup and resource-augmentation results consider both blocking and scheduling, whereas asymptotic blocking optimality is concerned solely with the magnitude of blocking bounds—existing speedup and resource-augmentation results have been obtained under very restrictive assumptions (e.g., only one critical section per task) and yield quite large augmentation and speedup factors. We briefly mention some relevant works in Section 11.1.
Last but not least, a large number of challenging system optimization and design-space exploration problems can be formalized under consideration of synchronization constraints. Prominent examples include:
task mapping problems—given a processor platform, a scheduling policy, and a locking protocol, find an assignment of tasks to processors (or clusters) that renders the system schedulable, potentially while optimizing some other criteria (e.g., average response times, memory needs, energy or thermal budgets, etc.);
resource mapping problems—select for each shared resource a designated synchronization processor such that the system becomes schedulable;
platform minimization problems—given a workload, scheduling policy, and locking protocol, minimize the number of required cores;
policy selection problems—given a workload and a platform, identify (potentially on a per-resource basis) a locking protocol (or an alternative synchronization approach) that renders the system schedulable, again potentially while simultaneously optimizing for other criteria; and
many variations and combinations of these and similar problems.
Not surprisingly, virtually all interesting problems of these kinds are NP-hard since they typically involve solving one or more bin-packing-like problems. While a detailed consideration of optimization techniques is beyond the scope of this survey, we briefly mention some representative results that exemplify these types of system integration and optimization problems in Section 11.1.
2.4 Historical Perspective
Historically, the field traces its roots to the 1980s. While a discussion of the challenges surrounding multiprocessor synchronization in real-time systems, including a discussion of the respective merits of spin- and suspensions-based primitives, can be already found in an early critique of ADA [REMC:81] published in 1981, the first multiprocessor real-time locking protocol backed by a sound schedulability analysis taking worst-case blocking delays into account is RSL:88’s Distributed Priority Ceiling Protocol (DPCP) [RSL:88], which appeared in 1988. This result was followed in 1990 by the Multiprocessor Priority Ceiling Protocol (MPCP) [R:90], the second foundational protocol that had (and continues to have) a large impact on the field, and which in many ways still represents the prototypical suspension-based protocol for partitioned scheduling (which however should not obscure the fact that many other well-performing alternatives have been proposed since).
Throughout the 1990s, a number of protocols and lock implementations appeared; however, as multiprocessor real-time systems were still somewhat of a rarity at the time, multiprocessor synchronization was not yet a major concern in the real-time community. This fundamentally changed with the advent of multicore processors and multiprocessor system-on-a-chip (MPSoC) platforms in the early to mid 2000s. Motivated by these trends, and the desire to minimize the use of stack memory in such systems, GLD:01 published a highly influential paper in 2001 proposing the Multiprocessor Stack Resource Policy (MSRP) [GLD:01]. While GLD:01’s schedulability and blocking analysis has since been superseded by later, more accurate analyses [B:11, WB:13a, BB:16, BBW:16], the MSRP remains the prototypical spin-based multiprocessor real-time locking protocol for partitioned scheduling. Another influential paper motivated by the widespread emergence of multicore processors as the standard computing platform was published in 2007 by BLBA:07, who introduced the Flexible Multiprocessor Locking Protocol (FMLP) [BLBA:07], which combined many of the advantages of the MPCP and the MSRP, and which was the first to provide full support for both global and partitioned scheduling.
The FMLP paper marked the beginning of the recent surge in interest in multiprocessor real-time locking: since 2007, every year has seen more than a dozen publications in this area—about 140 in total in the past decade, which is almost three times as many as published in the 20 years prior. We dedicate the rest of this survey to a systematic review (rather than a chronological one) of this vibrant field.
3 Progress Mechanisms
At the heart of every effective multiprocessor real-time locking protocol is a progress mechanism to expedite the completion of critical sections that otherwise might cause excessive blocking to higher-priority or remote tasks. More specifically, a progress mechanism forces the scheduling of lock-holding tasks (either selectively or unconditionally), thereby temporarily overriding the normal scheduling policy. In this section, we review the major progress mechanisms developed to date and provide example schedules that illustrate key ideas.
As a convention, unless noted otherwise, we use fixed-priority (FP) scheduling in our examples and assume that tasks are indexed in order of strictly decreasing priority (i.e., is always the highest-priority task). All examples in this section further assume the use of basic suspension- or spin-based locks (i.e., raw locks without additional protocol rules); by design the specifics are irrelevant.
3.1 Priority Inversion on Uniprocessors
To understand the importance of progress mechanisms, and why multiprocessor-specific mechanisms are needed, it is helpful to first briefly review the threat of “unbounded priority inversions” on uniprocessors and how it is mitigated in classic uniprocessor real-time locking protocols.
Figure 1 shows the classic “unbounded priority inversion” example of three tasks under FP scheduling. At time 1, the lowest-priority task locks a resource that it shares with the highest-priority task . When is activated at time 2, it also tries to lock the resource (at time 3), and thus becomes blocked by ’s critical section, which intuitively constitutes a priority inversion since is pending (i.e., it has unfinished work to complete before its deadline) and has higher priority than , but is scheduled instead of . When the “middle-priority” task is activated at time 4, it preempts the lock-holding, lower-priority task , which delays the completion of ’s critical section, which in turn continues to block , until completes and yields the processor at time 19 (at which point has already missed its deadline).
Since has lower priority than the pending (but not scheduled) , this delay also constitutes a priority inversion. And since the length of this priority inversion is determined by ’s WCET, which in general could be arbitrarily large, this is traditionally considered to be an “unbounded” priority inversion (even though technically it is bounded by the maximum scheduling interference incurred by ). That is, a priority inversion is traditionally considered “bounded” only if its maximum-possible duration can be bounded in terms of only the maximum critical section length and the number of concurrent critical sections, and which is independent of all tasks’ WCETs, since WCETs are expected to usually be (much) larger than typical critical section lengths.
To summarize, on uniprocessors, “unbounded” priority inversion arises as a consequence of the lock-holder preemption problem, and it is problematic because it renders the response times of high-priority tasks (e.g., ’s in Figure 1) dependent on the WCETs of lower-priority tasks (e.g., ’s in Figure 1). This contradicts the purpose of priority-driven scheduling, where higher-priority tasks (or jobs) should remain largely independent of the processor demands of lower-priority tasks (or jobs). On uniprocessors, classic progress mechanisms such as priority inheritance [SRL:90] or priority-ceiling protocols [SRL:90, B:91] avoid unbounded priority inversion, either by raising the priority of lock-holding tasks or by delaying the release of higher-priority tasks.
3.2 Priority Inversion on Multiprocessors
The lock-holder preemption problem of course also exists on multiprocessors. For example, Figure 2, shows a situation comparable to Figure 1 involving four tasks on processors under G-FP scheduling. Analogously to the uniprocessor example, the lock-holding task is preempted due to the arrival of two higher-priority tasks at times 2 and 3, respectively, which in turn induces an “unbounded” priority inversion (i.e., an unwanted dependency on the WCETs of lower-priority tasks) in the highest-priority (and blocked) task .
However, in addition to creating an unwanted dependency of high- on low-priority tasks (or jobs), an untimely preemption of a lock holder can also result in undesirable delays of remote tasks or jobs. For instance, consider the example in Figure 3, which shows a partitioned fixed-priority (P-FP) schedule illustrating an “unbounded” priority inversion due to a remote task. Compared to the uniprocessor example in Figure 1, the roles of and have been switched, and has been assigned (by itself) to processor 2. Again, the lock-holding task is preempted at time 3 by a higher-priority task ( in this case). As a result, task transitively incurs a delay proportional to ’s WCET, even though is, from the point of view of , an unrelated remote task that ’s response time intuitively should not depend on.
Specifically, even though may have a numerically higher priority, when analyzing each processor as a uniprocessor system (the standard approach under partitioned scheduling), the delay transitively incurred by due to must be considered an extraordinary source of interference akin to a priority inversion since there is no local higher-priority task that executes on processor 2 while is pending.
As a result, multiprocessor systems require a more general notion of “priority inversion.” To capture delays due to remote critical sections, the definition of priority inversion under partitioned scheduling must include not only the classic case where a (local) lower-priority task is scheduled instead of a pending, but blocked (local) higher-priority task, but also the case where a processor idles despite the presence of a pending (but remotely blocked) higher-priority task. Analogously, under global scheduling on an -processor platform, any situation in which fewer than higher-or-equal-priority tasks are scheduled while some task is waiting constitutes a priority inversion. Both cases (partitioned and global scheduling) can be captured precisely with the following definition. Recall that clustered scheduling generalizes both global and partitioned scheduling.
A job of task , assigned to a cluster consisting of cores, suffers priority-inversion blocking (pi-blocking) at time if and only if
is pending (i.e., released and incomplete) at time ,
is not scheduled at time , and
fewer than equal- or higher-priority jobs of tasks assigned to cluster are scheduled on processors belonging to ’s assigned cluster .
Under partitioned scheduling , and under global scheduling . We prefer the specific term “pi-blocking” rather than the more common, but also somewhat vague term “blocking” since the latter is often used in an OS context to denote suspensions of any kind, whereas we are explicitly interested only in delays that constitute a priority inversion.
Note that Definition 1 is defined in terms of jobs (and not tasks) to cover the full range of job-level fixed-priority (JLFP) policies, and in particular EDF scheduling (which belongs to the class of JLFP policies). We will further refine Definition 1 in Section 5 to take into account certain subtleties related to the analysis of self-suspensions.
Applying Definition 1 to the example in Figure 1, we observe that indeed incurs pi-blocking from time 3 until time 20, since is pending, but not scheduled, and fewer than higher-priority jobs are scheduled, matching the intuitive notion of “priority inversion.” In Figure 2, suffers pi-blocking from time 2 until time 19 since fewer than higher-priority jobs are scheduled while waits to acquire the resource shared with (under global scheduling, any locking-induced suspension constitutes a priority inversion for the top highest-priority jobs). Similarly, in Figure 3, suffers pi-blocking from time 2 until time 19 since in its cluster (i.e., its assigned core) it is pending, not scheduled, and fewer than higher-priority jobs are scheduled (in fact, none are scheduled at all).
3.3 Non-Preemptive Sections
Several mechanisms (i.e., scheduling rules) have been proposed to ensure a bounded maximum (cumulative) duration of pi-blocking. The most simple solution is to let tasks spin and make every lock request a non-preemptive section: if tasks execute critical sections non-preemptively, then it is simply impossible for a lock holder to be preempted within a critical section.
Figure 4 illustrates how turning critical sections into non-preemptive sections prevents “unbounded” pi-blocking in the example scenario previously shown in Figure 3. In Figure 4, because cannot be preempted from time 1 until time 4, the preemption due to the arrival of is deferred, which ensures that the lock is released in a timely manner, and so can meet its deadline at time 16.
However, the delay now incurred by during the interval also constitutes pi-blocking. This highlights an important point: progress mechanisms do not come “for free.” Rather, they must strike a balance between the delay incurred by tasks waiting to acquire a resource (e.g., in Figure 4) and the delay incurred by higher-priority tasks (e.g., in Figure 4) when the completion of critical sections is forced (i.e., when the normal scheduling order is overridden).
Executing lock requests as non-preemptive sections is also effective under clustered scheduling (and hence also under global scheduling). However, there exists a subtlety w.r.t. how delayed preemptions are realized that does not arise on uniprocessors or under partitioned scheduling. Consider the example schedules in Figures 5 and 6, which show two possible variants of the scenario previously depicted in Figure 2. In particular, since executes its critical section non-preemptively from time 1 to time 4, cannot preempt —the lowest-priority scheduled task—at time 3. However, there does exist another lower-priority task that can be preempted at the time, namely . Should immediately preempt or should it wait until time 4, when , which intuitively should have been the preemption victim, finishes its non-preemptive section? Both interpretations of global scheduling are possible [BLBA:07, B:11]. The former approach is called eager preemptions; the latter conversely lazy preemptions [B:11] or link-based global scheduling [BLBA:07].
Whereas eager preemptions are easier to implement from an OS point of view, this approach suffers from the disadvantage that a job can suffer pi-blocking repeatedly due to non-preemptive sections in any unrelated lower-priority task and at any point during its execution: in the worst case, a job can be preempted and suffer pi-blocking whenever a higher-priority job is released (such as at time 3 in Figure 5), which is difficult to predict and bound accurately.
In contrast, on a uniprocessor (and under partitioned scheduling), in the absence of self-suspensions, a job suffers pi-blocking due to a lower-priority job’s non-preemptive section at most once (i.e., immediately upon its release, or not at all), a property that greatly aids worst-case analysis.
The lazy preemption approach, aiming to restore this convenient property, reduces the number of situations in which a job is repeatedly preempted due to a non-preemptive section in a lower-priority job [BLBA:07, B:11]. While the lazy preemption approach cannot completely eliminate the occurrence of repeated preemptions in all situations [BA:14], under a common analysis approach—namely, if task execution times are inflated to account for delays due to spinning and priority inversions (discussed in Sections 4 and 5.1)—it does ensure that pi-blocking due to a non-preemptive section in a lower-priority job has to be accounted for only once per job (in the absence of self-suspensions) [BLBA:07, B:11, BA:14], analogously to the reasoning in the case of uniprocessors or partitioned scheduling. In other words, lazy preemption semantics ensure analysis conditions that are favorable for inflation-based analysis.
Link-based global scheduling [BLBA:07, B:11], which realizes lazy preemptions, derives its name from the fact that it establishes a “link” between a newly released job and the non-preemptively executing job that it should have preempted (if any); the deferred preemption is then enacted as soon as the linked job exits its non-preemptive section, which can be implemented efficiently [B:11]. Link-based global scheduling has been implemented and evaluated in LITMUSRT [B:11, BCBL:08], a real-time extension of the Linux kernel.333See http://www.litmus-rt.org.
Non-preemptive execution can be achieved in several ways, depending on the OS and the environment. In a micro-controller setting and within OS kernels, preemptions are typically avoided by disabling interrupts. In UNIX-class RTOSs with a user-mode / kernel-mode divide, where code running in user-mode cannot disable interrupts, non-preemptive execution can be easily emulated by reserving a priority greater than that of any “regular” job priority for tasks within critical sections.
Regardless of how non-preemptive sections are realized, the major drawback of this progress mechanism is that it can result in unacceptable latency spikes, either if critical sections are unsuitably long or if (some of the) higher-priority tasks are particularly latency-sensitive. For example, consider the scenario illustrated in Figure 7, which is similar to the one depicted in Figure 4, with the exception that another high-priority task with a tight relative deadline of only two time units has been introduced as on processor 1. Since this task has very little tolerance for any kind of delay, it is clearly infeasible to just turn ’s request into non-preemptive section since it is “too long” relative to ’s latency tolerance, as shown in Figure 7. However, not doing anything is also not a viable option since then would transitively cause to miss its deadline at time 16, similarly to the scenario shown in Figure 3.
3.4 Priority Inheritance
Since it is hardly a new observation that “long” non-preemptive sections are problematic in the presence of tight latency constraints, better solutions have long been known in the uniprocessor case, namely the classic priority inheritance and priority-ceiling protocols [SRL:90]. Unfortunately, these approaches do not transfer well to the multiprocessor case, in the sense that they are not always effective w.r.t. bounding the maximum duration of pi-blocking.
Priority inheritance is a good match for global scheduling, and is indeed used in multiprocessor real-time locking protocols for global scheduling (as discussed in Sections 5.1.2 and 5.2.2). Recall that, with priority inheritance, a lock-holding task ’s effective priority is the maximum of its own base priority and the effective priorities of all tasks that are waiting to acquire a lock that currently holds [SRL:90]. Figure 8, which shows the same scenario as Figure 2, illustrates how this rule is effective under global scheduling: with the priority inheritance rule in place, remains scheduled at time 4 when the higher-priority task is released since inherits the priority of , the maximum priority in the system, during the interval . As a result, the unrelated task is preempted instead, similar to the eager preemption policy in the case of non-preemptive sections as illustrated in Figure 5. (To date, no preemption rule analogous to the lazy preemption policy discussed in Section 3.3 has been explored in the context of priority inheritance.) Again, this highlights that the progress mechanisms used to mitigate unbounded priority inversions are themselves a source of bounded priority inversions, which must be carefully taken into account during blocking analysis.
Unfortunately, priority inheritance [SRL:90] works only under global scheduling: it is ineffective (from the point of view of worst-case blocking analysis) when applied across cores (respectively, clusters) under partitioned (respectively, clustered) scheduling. The reason can be easily seen in Figure 3: even though the priority inheritance rule is applied across cores, ’s effective priority is merely raised to that of , which does not prevent the preemption by at time 3 (recall that tasks are indexed in order of strictly decreasing priority). For the same reason, classic ceiling-based protocols like the PCP [SRL:90] and the SRP [B:91] are also ineffective: given that does not access the shared resource, the ceiling priority of the shared resource is lower than ’s priority. Fundamentally, the root cause is that numeric priority values are, analytically speaking, incomparable across processor (respectively, cluster) boundaries since partitions (respectively, clusters) are scheduled independently.
3.5 Allocation Inheritance
The solution to this problem is an idea that has appeared several times in different contexts and under various names: spinning processor executes for preempted processors (SPEPP) [TS:97], local helping [HP:01, HH:01], allocation inheritance [HA:02a, H:04, HA:06], multiprocessor bandwidth inheritance [FLC:10, FLC:12], and migratory priority inheritance [BB:12, B:13]. The essential common insight is that a preempted task’s critical section should be completed using the processing capacity of cores on which the blocked tasks would be allowed to run (if they were not blocked).
That is, a blocking task should not only inherit a blocked task’s priority, but also the “right to execute” on a particular core, which serves to restore analytical meaning to the inherited priority: to obtain a progress guarantee, a preempted lock holder must migrate to the core where the pi-blocking is incurred. Intuitively, Definition 1 implies that, if a task incurs pi-blocking, then its priority is sufficiently high to ensure that a lock holder inheriting ’s priority can be scheduled on ’s processor (or in ’s cluster), since the fact that incurs pi-blocking indicates the absence of runnable higher-priority jobs of tasks assigned to ’s processor or cluster (recall Clause 3 in Definition 1).
An example of this approach is shown in Figure 9, which shows the same scenario involving a latency-sensitive task previously shown in Figure 7. In contrast to the example in Figure 7, at time 1.5, when the latency-sensitive task is activated, the lock-holding task is preempted (just as it would be in the case of priority inheritance). However, when blocks on the resource held by at time 3, inherits the right to use the priority of on ’s assigned processor (which is processor 2, whereas is assigned to processor 1). Consequently, migrates from processor 1 to processor 2 to continue its critical section. When finishes its critical section at time 4.5, it ceases to inherit the priority and right to execute on processor 2 from and thus cannot continue to execute on processor 2. Task hence migrates back to processor 1 to continue its execution at time 18 when completes. Overall, suffers no latency penalty when it is released at time 1.5, but task also suffers no undue delays while waiting for to release the shared resource.
As already mentioned, several different names have been used in the past to describe progress mechanisms based on this principle. We adopt the term “allocation inheritance” [HA:02a, H:04, HA:06] since it clearly describes the idea that processor time originally allocated to a blocked task is used towards the completion of a blocking task’s critical section. The name also highlights the fact that this approach is a generalization of the classic priority inheritance idea. In fact, under event-driven global scheduling and on uniprocessors (the two cases where priority inheritance is effective), allocation inheritance in fact reduces to priority inheritance since all tasks are eligible to execute on all cores anyway.
From a purely analytical point of view, allocation inheritance is elegant and highly attractive: as evident in Figure 9, it has no negative latency impact on unrelated higher-priority tasks, while ensuring guaranteed progress, thus restoring the strong analytical foundation offered by priority inheritance on uniprocessors. Put differently, allocation inheritance is the natural multiprocessor extension of classic priority inheritance that allows the idea to work under any multiprocessor scheduling approach.
However, from a practical point of view, it can be difficult to support allocation inheritance efficiently: either it introduces task migrations (and the associated kernel complexities and cache overheads) into partitioned systems that otherwise would need none, or there must be some other, resource-specific way for remote processors to continue (or safely duplicate) the operation that the preempted task was trying to accomplish [TS:97, BW:13a], which can be difficult (or even impossible) to achieve for certain kinds of resources (e.g., hardware resources such as I/O ports). In particular, if the latter approach is feasible (i.e., helping operations to complete without requiring a complete task migration), it is usually also possible to implement completely lock- or even wait-free solutions (e.g., [AJR:97, R:97]), which can be an overall preferable solution in such cases [BCBL:08]. We discuss allocation inheritance, protocols built on top of it, and practical implementations in Sections 7 and 10.3.
3.6 Priority Boosting
The most commonly used progress mechanism is priority boosting, which is conceptually quite similar to non-preemptive sections and also a much older idea than allocation inheritance. Simply put, priority boosting requires that each critical section (pertaining to a global resource) is executed at a boosted priority that exceeds the maximum regular (i.e., non-boosted) scheduling priority of any task. As a result, newly released jobs, which do not yet hold any resources, cannot preempt critical sections, just as with non-preemptive sections. In fact, applying priority boosting to the examples shown in Figures 2 and 3 would yield exactly the same schedules as shown in Figures 5 and 4, respectively. However, in contrast to non-preemptive sections, tasks remain preemptive in principle, and since other critical sections may be executed with even higher boosted priorities, it is possible that a task executing a critical section may be preempted by another task also executing a critical section (pertaining to a different resource). In essence, priority boosting establishes a second priority band on top of regular task priorities that is reserved for lock-holding tasks.
Priority boosting is easy to support in an RTOS, and easy to emulate in user-mode frameworks and applications if not explicitly supported by the RTOS. It can also be considered the “original” progress mechanism, as its use (in uniprocessor contexts) was already suggested by multiple authors in 1980 [L:80, LR:80], and because the two first multiprocessor real-time locking protocols backed by analysis, the DPCP [RSL:88] and the MPCP [R:90], rely on it. However, while it is conveniently simple, priority boosting also comes with major latency penalties similar to non-preemptive sections, which limits its applicability in systems with tight latency constraints.
3.7 Restricted Priority Boosting
Priority boosting as described so far, and as used in the DPCP [RSL:88], MPCP [R:90], and many other protocols, is unrestricted, in the sense that it applies to all tasks and critical sections alike, regardless of whether or not a lock-holding task is actually causing some other task to incur pi-blocking. In contrast, both priority and allocation inheritance kick in only reactively [SVBD:14], when contention leading to pi-blocking is actually encountered at runtime.
This unrestricted nature of priority boosting can be problematic from an analytical point of view since it can result in a substantial amount of unnecessary pi-blocking. To overcome this limitation, several restricted (i.e., selectively applied) versions of priority boosting have been derived in work on locking protocols that ensure asymptotically optimal pi-blocking bounds, such as priority donation [BA:11, B:11, BA:13], restricted segment boosting [B:14b], and replica-request priority donation [WEA:12]. We will discuss these more sophisticated progress mechanisms in the context of the protocols in which they were first used in Sections 5.1, 5.2, and 8.2.
3.8 Priority Raising
As a final consideration, one can pragmatically devise a rule to the effect that critical sections are executed unconditionally at an elevated priority that, in contrast to priority boosting, is not necessarily higher than the maximum regular scheduling priority, but still higher than most regular scheduling priorities. The intended effect is that tasks with “large” WCETs cannot preempt critical sections, whereas lightweight latency-sensitive tasks (e.g., critical interrupt handlers) with minuscule WCETs are still permitted to preempt critical sections without suffering any latency impact.
For example, consider the schedule shown in Figure 10, which shows the same scenario previously depicted in Figures 7 and 9. Suppose the critical section priority for the resource shared by and is chosen to be higher than the priority , but below the priority of . As a result, when is activated at time 1.5, it simply preempts the in-progress critical section of , which transitively causes some pi-blocking for . However, this extra delay is small relative to the critical section length of and the execution requirement of . Importantly, when is released at time 5, it is not able to preempt the in-progress critical section of , which ensures that does not suffer an excessive amount of pi-blocking. Again, this causes some pi-blocking to , which however is minor relative to its own WCET.
To re-iterate, this pragmatic scheme avoids both the need to migrate tasks (i.e., allocation inheritance) and the latency degradation due to non-preemptive sections and priority boosting by raising the priority of short critical sections so that it exceeds the priorities of all tasks with “substantial” WCETs, while also keeping it below that of latency-sensitive tasks. Since latency-sensitive tasks must have relatively high priorities anyway, and since latency-sensitive tasks usually do not have large WCETs in practice, this scheme has the potential to work for a wide range of workloads. However, it violates a strict interpretation of the common notion of “bounded priority inversions” since now the pi-blocking bounds of certain remote tasks depend on the WCETs of high-priority latency-sensitive tasks (e.g., in Figure 10, the pi-blocking bound of on Processor 2 depends on the WCET of on Processor 1). The “priority raising” approach is thus not frequently considered in the academic literature.
This concludes our discussion of progress mechanisms. We next start our review of multiprocessor locking protocols and begin with spin-lock protocols, as they are conceptually simpler and easier to analyze than semaphore protocols.
4 Spin-Lock Protocols
The distinguishing characteristic of a spin lock is that a waiting task does not voluntarily yield its processor to other tasks. That is, in contrast to suspension-based locks, spin locks do not cause additional context switches.444Based on this definition, we do not consider approaches such as “virtual spinning” [LNR:09] (discussed in Section 5.2) to constitute proper spin locks, precisely because under “virtual spinning” tasks still self-suspend upon encountering contention. However, the use of spin locks does not necessarily imply the absence of preemptions altogether; preemptable spin locks permit regular preemptions (and thus context switches) as required by the scheduler during certain phases of the protocol. Such preemptions, however, do not constitute voluntary context switches, and a task that is preempted while spinning does not suspend; rather, it remains ready in the scheduler’s run-queue.
All protocols considered in this section force the uninterrupted execution of critical sections, by means of either non-preemptive execution or priority boosting, which simplifies both the analysis and the implementation. There also exist spin-based protocols under which tasks remain preemptable at all times (i.e., that allow tasks to be preempted even within critical sections); such protocols usually require allocation inheritance and are discussed in Section 7.
In the analysis of spin locks, it is important to clearly distinguish between two different cases of “blocking,” as illustrated in Figure 11. First, since jobs execute the protocol in part or in whole non-preemptively, higher-priority jobs that are released while a lower-priority job is executing a spin-lock protocol can be delayed. As already discussed in Section 3.3, this is a classic priority inversion due to non-preemptive execution as covered by Definition 1; recall that this kind of delay is called pi-blocking.
Definition 1, however, does not cover the delay that jobs incur while spinning because it applies only to jobs that are not scheduled, whereas spinning jobs are being delayed despite being scheduled. To clearly distinguish this kind of delay from pi-blocking, we refer to it as spin blocking (s-blocking) [B:11]. Note that, if jobs spin non-preemptably, then the s-blocking incurred by a lower-priority job can transitively manifest as pi-blocking experienced by a local higher-priority job. A sound blocking analysis must account for both types of blocking.
4.1 Spin-Lock Protocols for Partitioned Scheduling
The first spin-lock protocol backed by a sound blocking and schedulability analysis is due to GLD:01 [GLD:01], who presented the now-classic Multiprocessor Stack Resource Policy (MSRP) in 2001. While a number of authors had previously explored real-time and analysis-friendly spin-lock implementations (which we discuss in Section 10.1), GLD:01 were the first to leverage a particular class of spin-lock implementations, namely non-preemptive FIFO spin locks, to arrive at an analytical sound protocol (i.e., a set of rules that determine how such locks should be used in a real-time system) and corresponding blocking and schedulability analyses.
The MSRP is a protocol for partitioned scheduling and can be used with both FP and EDF scheduling (on each core). Like earlier suspension-based protocols for partitioned scheduling [RSL:88, R:90, R:91], the MSRP distinguishes between local and global resources (recall Section 2.1). Local resources are dealt with by means of the classic SRP [B:91] and are of no concern here.
Global resources are protected with non-preemptive FIFO spin locks. When a task seeks to acquire a shared resource, it first becomes non-preemptable (which resolves any local contention) and then executes a spin-lock algorithm that ensures that conflicting requests by tasks on different processors are served in FIFO order. The exact choice of FIFO spin-lock algorithm is irrelevant from a predictability point of view; many suitable choices with varying implementation tradeoffs exist (see Section 10.1).
Figure 12 depicts an example MSRP schedule, where ’s request is satisfied after ’s request because issued its request slightly earlier. The highest-priority task incurs considerable pi-blocking upon its release at time 2 since the local lower-priority task is spinning non-preemptively. Here, is transitively blocked by and ’s critical sections while incurs s-blocking.
The fact that tasks remain continuously non-preemptable both while busy-waiting and when executing their critical sections has several major implications. In conjunction with the FIFO wait queue order, it ensures that any task gains access to a shared resource after being blocked by at most critical sections, where in this context denotes the number of processors on which tasks sharing a given resource reside. This property immediately ensures starvation freedom and provides a strong progress guarantee that allows for a simple blocking analysis [GLD:01]. However, as apparent in Figure 12, it also causes latency spikes, which can be highly undesirable in hard real-time systems [C:93, TS:94, TS:96, T:96, TS:97, B:13] and which implies that the MSRP is suitable only if it can be guaranteed that critical sections are short relative to latency expectations, especially on platforms with large core counts.
From an implementation point of view, simply making the entire locking protocol non-preemptive is attractive because it completely decouples the scheduler implementation from locking protocol details, which reduces implementation complexity, helps with maintainability, and lessens overhead concerns. In fact, of all protocols discussed in this survey, the MSRP imposes the least implementation burden and is the easiest to integrate into an OS.
GLD:01 presented a simple blocking analysis of the MSRP [GLD:01], which relies on execution-time inflation and bounds pi-blocking and s-blocking in a bottom-up fashion. First, GLD:01’s analysis considers each critical section in isolation and determines the maximum s-blocking that may be incurred due to the specific, single critical section, which is determined by the sum of the maximum critical section lengths (pertaining to the resource under consideration) on all other processors. A task’s cumulative s-blocking is then simply bounded by the sum of the per-request s-blocking bounds. Finally, a task’s inflated WCET is the sum of its cumulative s-blocking bound and its original (i.e., non-inflated) WCET. Existing schedulability analysis can then be applied to the inflated WCET bounds.
While GLD:01’s original analysis [GLD:01] is simple and convenient, it comes with substantial structural pessimism. In particular, a major source of pessimism is that the maximum critical section lengths (on remote cores) are over-represented in the final bound. Effectively, every blocking remote critical section is considered to exhibit the length of the longest critical section, which is clearly pessimistic for shared resources that support multiple operations of various cost. For example, consider a producer-consumer scenario where multiple tasks share a sorted list of work items ordered by urgency that may include up to 100 elements, and suppose the list is accessed by two types of critical sections: (i) removal of the first element and (ii) an order-preserving insertion. When performing a blocking analysis, it is clearly undesirable to account for each dequeue operation as an enqueue operation.
To lessen the impact of this source of pessimism, a holistic blocking analysis of the MSRP was developed [B:11]. A holistic analysis considers all critical sections of a task together and directly derives an overall per-task s-blocking bound, rather than attributing s-blocking to individual critical sections and then summing up the individual bounds. Holistic blocking analysis effective in avoiding the over-counting of long critical sections in the analysis of a single task. However, because it still proceeds on a task-by-task basis and relies on inflated execution times to account for transitive s-blocking, the holistic approach [B:11]
still over-estimates the impact of long critical sections if multiple local tasks access the same global resource, because in such a case the longest critical section is reflected in multiple inflated WCETs. In fact,WB:13a showed that any blocking analysis that relies on execution-time inflation is asymptotically sub-optimal due to this fundamental structural pessimism [WB:13a].
To avoid such structural pessimism altogether, WB:13a developed a novel MILP-based blocking analysis for spin-lock protocols under P-FP scheduling that by design prevents any critical sections from being accounted for more than once in the final aggregate pi- and s-blocking bound [WB:13a]. Crucially, WB:13a’s analysis does not rely on execution time inflation to implicitly account for transitive s-blocking delays (i.e., lower-priority tasks being implicitly delayed when local higher-priority tasks spin). Rather, it explicitly models both transitive and direct s-blocking, as well as pi-blocking, and directly derives a joint bound on all three kinds of delay by solving a MILP optimization problem [WB:13a]. BB:16 [BB:16] recently provided an equivalent analysis for P-EDF scheduling.
Overall, if critical sections are relatively short and contention does not represent a significant schedulability bottleneck, then GLD:01’s original analysis [GLD:01] or the holistic approach [B:11] are sufficient. However, in systems with many critical sections, high-frequency tasks, or highly heterogenous critical section lengths, WB:13a and BB:16’s LP-based approaches are substantially more accurate [WB:13a, BB:16]. In his thesis, W:18 discusses further optimizations, including a significant reduction in the number of variables and how to convert the MILP-based analysis into an LP-based approach without loss of accuracy [W:18].
4.1.1 Non-FIFO Spin Locks
While FIFO-ordered wait queues offer many advantages—chiefly among them starvation freedom, ease of analysis, and ease of implementation—there exist situations in which FIFO ordering is not appropriate or not available.
For instance, if space overheads are a pressing concern (e.g., if there are thousands of locks), or in systems with restrictive legacy code constraints, it may be necessary to resort to unordered spin locks such as basic test-and-set (TAS) locks, which can be realized with a single bit per lock. In other contexts, for example given workloads with highly heterogenous timing requirements, it can be desirable to use priority-ordered spin locks, to let urgent tasks acquire contended locks more quickly.
Implementation-wise, the MSRP design—SRP for local resources, non-preemptive spin locks for global resources—is trivially compatible with either unordered or priority-ordered locks (instead of FIFO-ordered spin locks). The resulting analysis problem, however, is far from trivial and does not admit a simple per-resource approach as followed in GLD:01’s original analysis of the MSRP [GLD:01]. This is because individual critical sections are, in the worst case, subject to starvation effects, which can result in prolonged s-blocking. Specifically, in priority-ordered locks, a continuous stream of higher-priority critical sections can delay a lower-priority critical section indefinitely, and in unordered locks, any request may starve indefinitely as long as there is contention.
The lack of a strong per-request progress guarantee fundamentally necessitates a “big picture” view as taken in the holistic approach [B:11] to bound cumulative s-blocking across all of a job’s critical sections. Suitable analyses for priority-ordered spin locks were proposed by NE:12 [NE:12] and WB:13a [WB:13a].
Additionally, WB:13a also proposed the first analysis applicable to unordered spin locks [WB:13a]. While unordered locks are traditionally considered to be “unanalyzable” and thus unsuitable for real-time systems, WB:13a made the observation that unordered spin locks are analytically equivalent to priority-ordered spin locks if each task is analyzed assuming that all local tasks issue requests with the lowest-possible priority whereas all remote tasks issue high-priority requests, which maximizes the starvation potential.
4.1.2 Preemptable Spinning
The most severe drawback of non-preemptive FIFO spin locks (and hence the MSRP) is their latency impact [C:93, TS:94, TS:96, T:96, TS:97, B:13]. One approach to lessen this latency penalty without giving up too much desirable simplicity is to allow tasks to remain preemptable while busy-waiting, as illustrated in Figure 13, while still requiring critical sections to be executed non-preemptively (to avoid the lock-holder preemption problem). This has the benefit that preemptions are delayed by at most one critical section—the latency impact is reduced to —which greatly improves worst-case scalability [C:93, TS:94, TS:96, T:96, TS:97].
Preemptable spinning poses two major challenges. The first challenge is the implementation: while preemptable spinning can be easily integrated into unordered TAS locks, it requires substantially more sophisticated algorithms to realize FIFO- or priority-ordered spin locks that support preemptable spinning [C:93, TS:94, TS:97, AJJ:98, HA:02a]. The reason is that a job that is preempted while busy-waiting must be removed from the spin queue, or marked as preempted and skipped, to avoid granting the lock to a currently preempted job (and hence re-introducing the lock-holder preemption problem).
As a result, preemptable spinning poses analytical problems, which is the second major challenge. When a job continues execution after having been preempted, it may find itself dequeued from the spin queue—that is, after a preemption, a job may find that its lock request was canceled while it was preempted—to the effect that it must re-issue its canceled lock request, which carries the risk of encountering additional contention. Furthermore, in the worst case, the job may be preempted again while waiting for its re-issued lock request to be satisfied, which will necessitate the request to be re-issued once more, and so on.
The additional delays that arise from the cancellation of preempted requests hence cannot be analyzed on a per-request basis and inherently require a holistic analysis approach that avoids execution-time inflation. Appropriate blocking analysis for preemptable FIFO- and priority-ordered spin locks, as well as preemptable unordered spin locks, was first proposed by WB:13a for P-FP scheduling [WB:13a], and recently extended to P-EDF scheduling by BB:16 [BB:16]. ADJM:14 [ADJM:14] also proposed a protocol based on FIFO spin locks and preemptable spinning.
In systems with quantum-driven schedulers, where the scheduler preempts jobs only at well-known times, and not in reaction to arbitrarily timed events, the cancellation penalty is reduced because no request must be re-issued more than once if critical sections are reasonably short [AJJ:98]. As a result, it is possible [AJJ:98] to apply execution-time inflation approaches similar to the original MSRP analysis [GLD:01]. Nonetheless, since every job release causes at most one cancellation (in the absence of self-suspensions) [WB:13a], and since critical sections typically outnumber higher-priority job releases, it is conceptually still preferable to apply more modern, inflation-free analyses [WB:13a, BB:16] even in quantum-driven systems.
4.1.3 Spin-Lock Protocols based on Priority Boosting
Exploring a different direction, A:16 [A:16] introduced a variation of the preemptable spinning approach in which jobs initially remain preemptable even when executing critical sections, but become priority-boosted as soon as contention is encountered. A:16’s protocol is based on the following observation: if non-preemptive execution is applied unconditionally (as in the MSRP [GLD:01]), then lock-holding jobs cannot be preempted and thus cause a latency impact even if there is actually no contention for the lock (i.e., even if there is no lock-holder preemption problem to mitigate). Since in most systems lock contention is rare, the latency impact of non-preemptive execution, though unavoidable from a worst-case perspective, may be undesirable from an average-case perspective.
As an alternative, A:16 thus proposed the Forced Execution Protocol (FEP), which uses on-demand priority boosting (instead of unconditional non-preemptive execution) to reactively expedite the completion of critical sections only when remote jobs actually request a locked resource and start to spin. This design choice has the benefit of reducing the average priority inversion duration (i.e., the average latency impact), but it comes at a high price, namely increased worst-case pi-blocking, because high-priority jobs may be delayed by multiple preempted critical sections. This effect is illustrated in Figure 14, which depicts a scenario in which a single job of a higher-priority task (e.g., in Figure 14) may suffer pi-blocking due to multiple lower-priority critical sections since the FEP [A:16] allows incomplete critical sections to be preempted. Thus, if a remote task forces the completion of multiple preempted critical sections (e.g., in Figure 14), then a higher-priority task may be repeatedly delayed due to priority boosting. In contrast, the use of non-preemptive execution in the MSRP [GLD:01] ensures that, on each processor and at any point in time, at most one critical section is in progress. In summary, allowing critical sections to be preempted, only to be forced later, makes the worst case worse. We discuss protocols that avoid this effect by using allocation inheritance rather than priority boosting in Section 7.
So far we have discussed two cases at opposite ends of the preemption spectrum: either jobs spin non-preemptively as in the MSRP, or they remain preemptable w.r.t. all higher-priority jobs while spinning. However, it is also possible to let waiting jobs spin at some other predetermined intermediate priority. ABBN:14 [ABBN:14] observed that this flexibility allows for a generalized view on both spin locks and suspension-based protocols. In particular, ABBN:14 noted that jobs that spin at maximum priority are effectively non-preemptive, whereas jobs that spin at minimum priority are—from an analytical point of view—essentially suspended, in the sense that they neither prevent other jobs from executing nor issue further lock requests.
ABBN:14 combined this observation with (unconditional) priority boosting and FIFO-ordered spin locks into a flexible spin-lock model (FSLM) [ABBN:14, A:17] that can be tuned [ABBN:17] to resemble either the MSRP [GLD:01], FIFO-ordered suspension-based protocols like the partitioned FMLP for long resources [BLBA:07] (discussed in Sections 5.1.3 and 5.2.3), or some hybrid of the two. Notably, in ABBN:14’s protocol, requests of jobs that are preempted while spinning are not canceled, which allows for multiple critical sections to be simultaneously outstanding on the same core. As just discussed in the context of A:16’s FEP [A:16], increasing the number of incomplete requests that remain to be boosted at a later time increases the amount of cumulative pi-blocking that higher-priority jobs may be exposed to.
For example, Figure 15 depicts a schedule illustrating ABBN:14’s protocol in which tasks , , and spin at their regular priority. Jobs of all three tasks block on a resource held by a job of task , and thus when releases the lock, the highest-priority task suffers pi-blocking due to the back-to-back execution of three priority-boosted critical sections. This highlights that indiscriminately lowering the priority at which tasks spin is not always beneficial; rather, a good tradeoff must be found [ABBN:17]. For instance, in the scenario shown in Figure 15, the back-to-back pi-blocking of could be avoided by letting , , and all spin at the priority of .
ABBN:18 [ABBN:18] study the spin priority assignment problem under the FSLM approach in depth and introduce a method for choosing elevated spin priorities that dominates unrestricted priority boosting (i.e., it is not always best to spin at the maximum priority) [ABBN:18]. Furthermore, ABBN:18 show how to configure the FSLM such that memory requirements are not much worse than under the MSRP (the MSRP requires only one process stack per core, the FSLM can be configured to require at most two stacks per core) [ABBN:18].
4.1.4 Non-Preemptive Critical Sections with Allocation Inheritance
It is in fact possible to achieve universally low blocking bounds without requiring workload-specific parameters. Exploring an unconventional alternative to classic spin locks, TS:97 [TS:97] proposed an elegant protocol that achieves preemptable spinning with maximum pi-blocking, FIFO-ordering of critical sections, and s-blocking bounds just as good as those provided by the MSRP. TS:97’s solution, called the Spinning Processor Executes for Preempted Processors (SPEPP) protocol [TS:97], is based on the idea that the processors of blocked jobs should work towards the completion of blocking critical sections, rather than just “wasting” cycles in a spin loop.
More specifically, in a regular FIFO-ordered spin lock, a job enqueues itself in a spin queue, busy-waits, and then executes its own critical section when it finally holds the lock. TS:97’s SPEPP protocol changes this as follows. A job first enqueues the operation that it intends to carry out on the shared object (as well as any associated data, i.e., in programming language terms, a closure) in a wait-free FIFO queue, and then proceeds to acquire the actual lock. As it acquires the lock, the job becomes non-preemptable and proceeds to dequeue and execute operations from the FIFO-ordered operations queue until its own operation has been completed. Crucially, whenever the job finishes an operation, it checks for deferred preemptions and interrupts, and releases the lock and becomes preemptable if any are pending. As a result, maximum pi-blocking is limited to the length of one operation (i.e., one critical section length) [TS:97], and the time that jobs spend being s-blocked is used to complete the operations of preempted jobs, which is an instance of the allocation inheritance principle (as discussed in Section 3.5).
An interesting corner case occurs when both a preempted and the preempting job seek to access the same resource. In this case, simply following the above protocol (i.e., if the preempting job just appends its operation) could lead to the buildup of long queues, which would result in excessively pessimistic s-blocking bounds (i.e., in the worst case, s-blocking linear in the number of tasks ). TS:97 devised a better solution: by letting the preempting job steal the preempted job’s slot in the queue, at most operations are enqueued at any time [TS:97]. As this effectively cancels the preempted job’s request, it must re-issue its request when it resumes. However, in contrast to the preemptable spin locks discussed in Section 4.1.2, this does not cause additional s-blocking—as TS:97 argue, any additional s-blocking incurred by the preempted job upon re-issuing its request is entirely offset by a reduction in the s-blocking incurred by the preempting job (which benefits from stealing a slot that has already progressed through the FIFO queue). As a result, there is no additional net delay: TS:97’s SPEPP protocol ensures maximum pi-blocking with maximum s-blocking [TS:97].
4.2 Spin-Lock Protocols for Global Scheduling
HA:02 [HA:02, H:04, HA:06] were the first to consider spin-based real-time locking protocols under global scheduling. In particular, HA:02 studied synchronization in systems scheduled by an optimal Pfair scheduler [BCPV:96, SA:06], and introduced the important distinction between short and long shared resources.
A shared resource is considered “short” if all critical sections that access it are (relatively speaking) short, and “long” if some related critical sections are (relatively) long, where the exact threshold separating “short” and “long” is necessarily application- and system-specific. To synchronize access to short resources, HA:02 proposed two protocols based on FIFO spin locks, which we discuss next. (For long resources, HA:02 proposed allocation inheritance and semaphore protocols, as discussed in Section 7.)
Without getting into too much Pfair-specific detail [BCPV:96, SA:06], it is important to appreciate some specific challenges posed by this optimal scheduling approach. Pfair scheduling is quantum-based, which means that it reschedules tasks regularly at all multiples of a system quantum . The magnitude of this system quantum typically ranges from a few hundred microseconds to a few milliseconds. As a consequence, all but the shortest jobs span multiple quanta, and thus are likely to be preempted and rescheduled multiple times during their execution.
To avoid problematic lock-holder preemptions, which are prone to occur if critical sections cross quantum boundaries, HA:02 introduced the notion of a frozen zone, or blocking zone, at the end of each scheduling quantum: if a job attempts to commence the execution of a critical section in this zone (i.e., if the next quantum boundary is less than a given threshold away in time), then its lock request is automatically blocked until the beginning of its next quantum of execution, regardless of the availability of the shared resource. If critical sections are shorter than the system quantum length —which is arguably the case for any reasonable threshold for “short” critical sections—then such a zone-based protocol ensures that no job is ever preempted while holding a spin lock [HA:02, H:04, HA:06].
The remaining question is then how to deal with jobs that attempted to lock an unavailable resource before the start of the automatic blocking zone, and which are still spinning at the end of the current quantum, or which are granted the requested resource inside the automatic blocking zone. HA:02 considered two solutions to this problem.
First, under HA:02’s skip protocol, such a job remains in the spin queue and retains its position, but is marked as inactive and can be skipped over by later-enqueued active jobs. When a preempted, inactive job receives the next quantum of processor service, it is reactivated and becomes again eligible to acquire the resource. Furthermore, if a job is at the head of the FIFO spin queue, then it immediately acquires the lock since all spin locks are available at the beginning of each quantum in zone-based protocols [HA:02, H:04, HA:06]. The primary advantage of HA:02’s skip protocol is that it is starvation-free since preempted jobs retain their position in the FIFO queue. However, this also has the consequence that a job is blocked by potentially other jobs in the spin queue (i.e., every other task)—that is, unlike in the case of the MSRP [GLD:01], the number of blocking critical sections cannot be bounded by the number of processors . Since typically , this leads to more pessimistic s-blocking bounds.
HA:02’s second solution, called the rollback protocol [HA:02, H:04, HA:06], restores as a bound on the maximum number of blocking critical sections, but is applicable only under certain restrictions. Whereas the skip protocol requires only that the maximum critical section length, denoted , does not exceed the quantum length (i.e., ), the rollback protocol further requires . This constraint yields the property that, if tasks on all processors attempt to execute a critical section at the beginning of a quantum, then each task will have finished its critical section by the end of the quantum. As a result, there is no need to maintain a preempted job’s queue position to guarantee progress, and instead a spinning job’s request is simply canceled when it is preempted at the end of a quantum (i.e., the job is removed from the spin queue). Consequently, no spin queue contains more than jobs at any time. Furthermore, any lock attempt is guaranteed to succeed at the latest in a job’s subsequent quantum, because a preempted job immediately re-issues its request when it continues execution, and every request issued at the beginning of a quantum is guaranteed to complete before the end of the quantum since [HA:02, H:04, HA:06].
HA:02 [HA:02, H:04, HA:06] presented blocking analyses for both the rollback and the skip protocol. While Pfair is not widely used in practice today, HA:02’s concept of an automatic blocking zone at the end of a job’s guaranteed processor allocation has been reused in many locking protocols for reservation-based (i.e., hierarchically scheduled) systems, in both uniprocessor and multiprocessor contexts (as mentioned in Section 11.1).
In later work on non-Pfair systems, DLA:06 [DLA:06] and CDW:10 [CDW:10] studied FIFO-ordered spin locks under event-driven G-EDF and G-FP scheduling, respectively. Notably, in contrast to HA:02 [HA:02], DLA:06 [DLA:06] and CDW:10 [CDW:10] assume non-preemptable spinning as in the MSRP [GLD:01]. (Non-preemptable spinning makes little sense in a Pfair context because quantum boundaries cannot be postponed without prohibitive schedulability penalties.) Most recently, BMV:14 [BMV:14] proposed a spin-lock protocol for systems scheduled with the global RUN policy [RLML:11], another optimal multiprocessor real-time scheduler.
As already mentioned Section 2.4, in 2007, BLBA:07 introduced the Flexible Multiprocessor Locking Protocol [BLBA:07], which is actually a consolidated family of related protocols for different schedulers and critical section lengths. In particular, the FMLP supports both global and partitioned scheduling (with both fixed and EDF priorities), and also adopts HA:02’s distinction between short and long resources [HA:02]. For each of the resulting four combinations, the FMLP includes a protocol variant. For short resources, the FMLP relies on non-preemptive FIFO spin locks. Specifically, for short resources under partitioned scheduling, the FMLP essentially integrates the MSRP [GLD:01], and for short resources under global scheduling, the FMLP integrate’s DLA:06’s proposal and analysis [DLA:06], albeit with link-based scheduling (i.e., lazy preemptions, see Section 3.3), whereas DLA:06 [DLA:06] did not specify a preemption policy. For long resources, the FMLP relies on semaphores, which we will discuss next.
5 Semaphore Protocols for Mutual Exclusion
The distinguishing characteristic of suspension-based locks, also commonly referred to as semaphores or mutexes, is that tasks that encounter contention self-suspend to yield the processor to other, lower-priority tasks, which allows wait times incurred by one task to be overlaid with useful computation by other tasks. Strictly speaking, the suspension-based locks considered in this section correspond to binary semaphores, whereas the suspension-based -exclusion protocols discussed in Section 8.2 correspond to counting semaphores. We simply say “semaphore” when the type of protocol is clear from context.
As mentioned in Section 2.2.5, in the case of semaphores, there exist two principal ways in which critical sections can be executed: either in place (i.e., on a processor on which a task is also executing its non-critical sections), or on a dedicated synchronization processor. We focus for now on the more common in-place execution, and consider protocols for dedicated synchronization processors later in Section 6.
In principle, semaphores are more efficient than spin locks: since wait times of higher-priority jobs can be “masked” with useful computation by lower-priority jobs, no processor cycles are wasted and the processor’s (nearly) full capacity is available to the application workload. However, there are major challenges that limit the efficiency of semaphores in practice.
First, in practical systems, suspending and resuming tasks usually comes with non-negligible costs due to both OS overheads (e.g., ready queue management, invocations of the OS scheduler, etc.) and micro-architectural overheads (e.g., loss of cache affinity, disturbance of branch predictor state, etc.). Thus, if the expected wait time is shorter than the cumulative overheads of suspending, then spinning can be more efficient in practice. Whether or not runtime overheads make spinning more attractive depends on a number of factors, including the length of critical sections (relative to overhead magnitudes), the degree of contention (likelihood of spinning), and the magnitude of OS and architectural overheads. As our focus is on analytical concerns, we do not consider this aspect any further.
The second major challenge is that semaphores are subject to more intense worst-case contention because they allow other tasks to execute and issue additional lock requests while a task is waiting. That is, compared to non-preemptive spin locks, wait queues can become much longer as the number of concurrent requests for any resource is no longer implicitly upper-bounded by the number of processors (as for instance in the case of the MSRP [GLD:01], recall Section 4.1). Hence accurate blocking analysis is even more important for semaphores than for spin locks, as otherwise any practical efficiency gains are at risk of being overshadowed by analysis pessimism.
For instance, consider the following (exaggerated) illustrative example: suppose there are tasks sharing a single resource on processors, where , and that each critical section is of unit length . With non-preemptive FIFO spin locks (e.g., the MSRP [GLD:01]), the maximum spin time in any possible schedule is trivially upper-bounded by , and the maximum pi-blocking time is upper-bounded by [GLD:01, GNLF:03]. If we instead change the system to use FIFO semaphores (e.g., the FMLP [BLBA:07]), then it is easy to construct pathological schedules in which tasks are simultaneously suspended, waiting to acquire the single shared resource (i.e., the maximum pi-blocking duration is lower-bounded by ). This places semaphore protocols at an analytical disadvantage. And while we have chosen FIFO queueing in this example for simplicity, this effect is not specific to any particular queue order; in particular, similar examples can be constructed for protocols that employ priority queues, too [BA:10].
Another complication that suspension-based locking protocols must address is that tasks are inherently not guaranteed to be scheduled when they become the lock owner. That is, if a task encounters contention and self-suspends, then it will certainly not be scheduled when it receives ownership of the lock, and worse, it may remain unscheduled for a prolonged time if higher-priority task(s) started executing in the meantime. The resulting delay poses a risk of substantial transitive pi-blocking if other blocked tasks are still waiting for the same lock. Real-time semaphore protocols hence generally require a progress mechanism that ensures that lock-holding tasks can (selectively) preempt higher-priority tasks when waking up from a self-suspension. In contrast, simple non-preemptive spin locks do not have to take resuming lock holders into account.
Finally, and most importantly, while semaphores allow wait times to be potentially overlaid with useful computation, showing that this actually happens in the worst case (i.e., showing that the processor does not just idle while tasks suspend) is not always possible. And even when it is theoretically possible, it is an analytically difficult problem that requires identifying (or safely approximating) the worst-case self-suspension pattern, which has proven to be a formidable challenge [CNHY:19].
More precisely, on multiprocessors, an accurate analysis of semaphores generally requires the use of a suspension-aware (s-aware) schedulability test, that is, an analysis that applies to a task model that incorporates an explicit bound on a task’s maximum self-suspension time. In contrast, most schedulability analyses published to date are suspension-oblivious (s-oblivious), in the sense that they make the modeling assumption that tasks never self-suspend (i.e., jobs are either ready to execute or complete).
S-oblivious schedulability analyses can still be employed if tasks (briefly) self-suspend [CNHY:19], but any self-suspensions must be pessimistically modeled as computation time during analysis (i.e., execution-time inflation must be applied). For example, if a task with WCET self-suspends for at most time units, then it would be modeled and analyzed as having a WCET of when applying s-oblivious schedulability analysis—the task’s processor demand is safely, but pessimistically, over-approximated.
Given that the primary feature of suspension-based locking protocols is that tasks do not occupy the processor while waiting to acquire a lock, one may deem it intuitively undesirable to model and analyze suspension times as processor demand. However, s-aware schedulability analyses are unfortunately difficult to obtain and can be very pessimistic. Case in point, prior work on s-aware schedulability analysis for uniprocessor fixed-priority scheduling was found to be flawed in several instances [CNHY:19], and the best correct analyses available today are known to be only sufficient, but not exact (in contrast to response-time analysis for non-self-suspending tasks, which is exact on uniprocessors).
Another example that highlights the challenge of finding efficient s-aware schedulability analysis is G-EDF scheduling: an s-oblivious analysis of G-EDF was available [BCL:05] for several years before the first s-aware test for G-EDF [LA:13] was proposed. Furthermore, the s-aware test was actually found to be more pessimistic than a simple s-oblivious approach when applied in the context of a suspension-based locking protocol [B:14b].
Generally speaking, as a result of the challenges surrounding s-aware analysis and the pessimism present in today’s analyses, s-oblivious approaches can be competitive with s-aware analyses, and at times even yield superior performance in empirical comparisons [BA:13, B:11]. It is hence worthwhile to study both approaches, and to compare and contrast their properties.
Most importantly, the s-oblivious and s-aware approaches fundamentally differ w.r.t. the best-possible bound on cumulative pi-blocking [BA:10, B:11, BA:13]. More precisely, Definition 1 can be refined for the s-oblivious case, and with the refined definition in place (Definition 2 below), the inherently pessimistic treatment of suspensions in s-oblivious schedulability analyses allows some of this pessimism to be “recycled,” in the sense that less pessimistic assumptions have to be made when analyzing priority inversions in the s-oblivious case, which in turn allows for lower bounds on pi-blocking.
More formally, consider maximum pi-blocking, which for a given task set is the amount of pi-blocking incurred by the task that suffers the most from priority inversion: , where is the pi-blocking bound for task [BA:10]. Interestingly, the s-aware and s-oblivious analysis assumptions yield asymptotically different bounds on maximum pi-blocking [BA:10, B:11, BA:13]. Specifically, there exist semaphore protocols that ensure that any task will incur at most pi-blocking in the s-oblivious sense [BA:10, BA:13], whereas no semaphore protocol can generally guarantee pi-blocking bounds better than in the s-aware case [BA:10, B:14b] (recall that denotes the number of processors and the number of tasks, and that typically ). In the following, we review these bounds and the protocols that achieve them. We first discuss locking protocols intended for s-oblivious analysis because they are simpler and easier to analyze, and then consider locking protocols intended for s-aware analysis thereafter.
5.1 Suspension-Oblivious Analysis of Semaphore Protocols
Under s-oblivious schedulability analysis, self-suspensions are modeled as execution time during analysis. However, at runtime, tasks of course self-suspend; the distinction between the s-oblivious and s-aware approaches is purely analytical. In fact, it is possible to analyze any protocol using either approach, and strictly speaking the protocols themselves are neither “suspension-oblivious” nor “suspensions-aware.” However, certain protocols are easier to analyze, or can be more accurately analyzed, under one of the two analysis approaches, which gives rise to the commonly used terminology of s-oblivious and s-aware semaphore protocols.
In this subsection, we review s-oblivious locking protocols, i.e., semaphore protocols that are primarily analyzed using s-oblivious analysis, and that in many cases were designed specifically with s-oblivious analysis in mind.
5.1.1 Suspension-Oblivious Analysis and Blocking Optimality
The key insight underlying the analysis of s-oblivious locking protocols is that, since s-oblivious schedulability analysis pessimistically over-approximates any self-suspension times as processor demand, it is possible to reclaim some of this pessimism by refining the definition of pi-blocking (Definition 1) to account for this modeling assumption. More precisely, “suspension-oblivious pi-blocking” is defined such that any times during which both
a job is self-suspended while waiting to acquire a semaphore and
this delay can be attributed to higher-priority tasks (under the “suspended tasks create processor demand” analysis assumption)
are not counted as pi-blocking, which allows tighter pi-blocking bounds to be established (without endangering soundness of the analysis).
Intuitively, this works as follows. Consider clustered scheduling and a task assigned to a cluster consisting of processor cores. First, suppose a job is waiting to acquire some semaphore and there are (at least) higher-priority ready jobs, that is, is self-suspended and there are higher-priority jobs occupying the processors in ’s cluster. In this situation, does not incur pi-blocking according to Definition 1 (although it is waiting to acquire a semaphore) since it would not be scheduled even if it were ready, due to the presence of higher-priority ready jobs. In other words, the self-suspension does not have to be accounted for as additional delay in this case because would be delayed anyway.
Now consider the following alternative scenario: is self-suspended and there are higher-priority jobs in ’s cluster, but all higher-priority jobs are also self-suspended, each waiting to acquire some semaphore (possibly, but not necessarily, the same that is waiting for). That is, the higher-priority jobs in ’s cluster are pending, but not ready. In this case, if were ready, it would be scheduled immediately since in the real system the suspended higher-priority jobs do not occupy any processors, and hence intuitively this situation represents a priority inversion for (and also according to Definition 1). However, under s-oblivious analysis, the self-suspension times of higher-priority jobs are modeled as execution time. Hence, in the analyzed model of the system, the pending higher-priority jobs are analyzed as if they were occupying all processors, and hence incurs no additional delay in the modeled situation under the s-oblivious analysis assumption. The following definition exploits this observation.
A job of task , assigned to a cluster consisting of cores, suffers s-oblivious pi-blocking at time if and only if
is pending at time ,
not scheduled at time (i.e., it is self-suspended or preempted),
fewer than equal- or higher-priority jobs of tasks assigned to cluster are pending on processors belonging to ’s assigned cluster .
Based Definition 2, the suspension-oblivious schedulability analysis approach can be summarized as follows.555This description is somewhat simplified because it considers only self-suspensions and ignores priority inversions due to progress mechanisms to simplify the explanation. Any additional priority inversions (e.g., due to non-preemptive execution) are handled analogously by inflation.
Suppose we are given a self-suspending task set , where each task in is characterized by its WCET , a relative deadline , a period , and a bound on the maximum cumulative self-suspension duration of any of its jobs.
Further suppose that denotes a bound on the maximum cumulative duration of suspension-oblivious pi-blocking incurred by any of ’s jobs (where ).
Let denote the corresponding inflated, suspension-free task set, where each .
Then the actual task set (with self-suspensions) does not miss any deadlines under a given preemptive JLFP policy in the presence of self-suspensions if the inflated task set is schedulable under the same JLFP policy in the absence of any self-suspensions.
The correctness of this approach can be shown with a simple reduction (or schedule transformation) argument. Suppose a job misses a deadline in the real system, and consider the trace resulting in the deadline miss (i.e., a concrete schedule of the given task set ). This trace consists of a set of self-suspending jobs with concrete release, execution, and self-suspension times (i.e., discard any knowledge of locks; for this transformation we require only self-suspension times).
Now repeatedly transform this trace as follows until no self-suspensions remain. First, for each job in the trace, in order of decreasing priority, and for each point in time at which is suspended, if there are fewer than higher-priority jobs in ’s cluster occupying a processor at time , then transform ’s self-suspension at time into execution time. Otherwise, simply discard ’s self-suspension at time (i.e., reduce ’s self-suspension length by one time unit) since it is not scheduled at time anyway. This step does not decrease ’s response time, nor does it decrease the response time of any other job. After this transformation, we obtain a trace in which both (i) no job self-suspends and (ii) a deadline is still being missed. Furthermore, let denote the task of : since is a bound on the maximum amount of s-oblivious pi-blocking as defined by Definition 2, (iii) the number of times that an instant of self-suspension of job is converted to an instant of execution is bounded by .
From (i) and (iii), it follows that the transformed trace is a valid schedule of , and hence from (ii) we have that misses a deadline only if there exists a schedule in which misses a deadline. Conversely, if it can be shown that does not miss any deadlines, then also does not miss any deadlines.
Given Definition 2, a natural question to ask is: what is the least upper bound on maximum s-oblivious pi-blocking (i.e., the least ) that any locking protocol can guarantee in the general case? In other words, what amount of s-oblivious pi-blocking is unavoidable, in the sense that there exist pathological task sets that exhibit at least this much s-oblivious pi-blocking no matter which locking protocol is employed?
Clearly, this bound cannot be zero, as some blocking is unavoidable under any mutual exclusion scheme. It is in fact trivial to construct task sets in which a job exhibits s-oblivious pi-blocking under any locking protocol: if a lock is requested simultaneously by tasks on all processors, then (unit-length) critical sections must be serialized in some order, and hence whichever task acquires the lock last is blocked by (at least) critical sections (i.e., any general pi-blocking bound is necessarily linear in the number of processors to cover this scenario). If there are exactly tasks assigned to each cluster (i.e., if there are only tasks in total), then according to Definition 2 any self-suspension results in s-oblivious pi-blocking, and the lower bound trivially follows [BA:10, B:11, BA:13].
While this lower bound on maximum s-oblivious pi-blocking is straightforward, finding a matching upper bound is less obvious. Given that up to jobs can simultaneously contend for the same lock, one might wonder whether this is even possible. However, as we review next, it is in fact possible to construct locking protocols that ensure s-oblivious pi-blocking for any sporadic task set [BA:10, B:11, BA:13], which establishes that, under s-oblivious schedulability analysis, pi-blocking is fundamental [BA:10, B:11, BA:13].
5.1.2 Global Scheduling
The earliest semaphore protocols for global scheduling are due to HA:02 [HA:02a, HA:02, HA:06], who introduced support for lock-based synchronization in Pfair-scheduled systems. However, due to the quantum-based nature of Pfair, their analysis is not s-oblivious in the sense of Definition 2; we hence defer a discussion of their work until Section 7.
The first multiprocessor real-time semaphore protocol explicitly studied using the s-oblivious analysis approach is BLBA:07’s FMLP [BLBA:07]. As already discussed in Section 4, the FMLP is actually a family of related protocols for different scheduling approaches and incorporates both spin- and suspension-based variants. Aiming for simplicity in both implementation and analysis, the FMLP for long resources (i.e., the semaphore variant) for global scheduling combines priority inheritance (recall Section 3.4) with simple FIFO wait queues.
Priority inheritance ensures that a lock-holding job is scheduled whenever another job that it blocks is incurring s-oblivious pi-blocking. This is easy to see: under global scheduling (i.e., if there is only one cluster of size ), a job incurs s-oblivious pi-blocking only if it is among the highest-priority pending jobs (Definition 2), and thus the lock-holding job is guaranteed to inherit a priority that allows it to be immediately scheduled [BLBA:07].
Combined with the strong progress guarantee of FIFO wait queues, the long FMLP for global scheduling ensures that a job incurs s-oblivious pi-blocking for the duration of at most critical section lengths while waiting to acquire a lock. This bound shows that, while more accurate analyses taking actual request patterns into account are possible [B:11], in the general case the FMLP does not ensure asymptotically optimal maximum s-oblivious pi-blocking. In fact, no protocol relying exclusively on FIFO or priority queues can be optimal in this regard [BA:10].
The first asymptotically optimal protocol is the Locking Protocol (OMLP) [BA:10, B:11, BA:13, BA:11], which, as the name suggests, ensures maximum s-oblivious pi-blocking for any task set. Like the FMLP, the OMLP is also a family of protocols for global, partitioned, and clustered scheduling, which we review in turn.
The OMLP variant for global scheduling (i.e., the global OMLP) [BA:10] relies on priority inheritance, like the earlier global FMLP [BLBA:07]. To achieve optimality, it replaces the FMLP’s simple FIFO queue with a hybrid wait queue, which consists of a bounded FIFO segment of length and a priority-ordered tail queue that feeds into the FIFO segment.
Jobs that request the lock enqueue directly in the FIFO segment if there is space (i.e., if fewer than jobs are contending for the resource), and otherwise in the tail queue, which is ordered by job priority. The job at the head of the FIFO queue holds the lock; when it releases the lock, it is dequeued from the FIFO queue, ownership is passed to the new head of the FIFO queue (if any) and the highest-priority job presently waiting in the tail queue (if any) is transferred to the FIFO queue.
This combination of queues ensures that a job incurs s-oblivious pi-blocking for the duration of at most critical sections per lock request. Clearly, once a job enters the bounded-length FIFO queue, at most critical sections of jobs ahead in the FIFO queue cause pi-blocking. Additionally, a job incurs s-oblivious pi-blocking for the cumulative duration of at most critical sections while it waits in the priority-ordered tail queue, which follows from the following observation [BA:10]. Suppose a job that is waiting in the tail queue is skipped over times (i.e., at least times another job is moved to the end of the FIFO queue while is waiting). Since the tail queue is priority-ordered, each job that skipped ahead has a higher priority than . Furthermore, since the FIFO queue has a capacity of exactly jobs, it follows that there are higher-priority pending jobs, which implies that incurs no s-oblivious pi-blocking after it has been skipped over at least times (recall clause (3) of Definition 2). Thus, in total, a job incurs s-oblivious pi-blocking for a cumulative duration of at most critical sections while moving through both queues, which is within a factor of of the lower bound and thus asymptotically optimal [BA:10, B:11, BA:13]. Fine-grained (i.e., non-asymptotic) analyses of the global OMLP taking into account actual request patterns are available as well [B:11, BA:13].
W:15 [W:15] studied two variants of mutual exclusion called preemptive mutual exclusion and half-protected resources, which are intended for synchronizing (in software) access to shared hardware resources such as memory buses and caches (which have the special property that they can be revoked without risking an inconsistent state of the shared resource, e.g., revocation of a granted cache partition carries a performance penalty but no consistency hazard). To analyze the resulting synchronization problem, W:15 introduced idleness analysis [W:15], new blocking analysis technique that bounds the maximum amount of idleness induced on other cores by a critical section, rather than bounding the number of waiting jobs. Idleness analysis can also be applied to the analysis of regular mutual exclusion protocols such as the OMLP or the FMLP.
5.1.3 Partitioned Scheduling
In the case of partitioned scheduling, priority inheritance is ineffective (Section 3.4), with priority boosting being the traditional alternative (Section 3.6). This choice was also adopted in the design of the long FMLP for partitioned scheduling [BLBA:07], which was the first semaphore protocol for partitioned scheduling to be analyzed under the s-oblivious approach.
Like all other FMLP variants, the long FMLP for partitioned scheduling relies on FIFO queues. One additional twist that arises in conjunction with priority boosting is that a tie-breaking policy is required to determine which job to schedule if there are multiple lock-holding jobs on the same processor. In the interest of simplicity, the FMLP favors whichever job first acquired its respective lock (i.e., “earliest-resumed job first”) [BLBA:07]. This later turned out to be a non-ideal choice in the context of suspension-aware analysis (discussed in Section 5.2.3), and was changed to an earliest-issued request first policy in the later FMLP+ [B:11].
However, under s-oblivious analysis, either tie-breaking rule is problematic, as unrestricted priority-boosting generally prevents optimal s-oblivious pi-blocking (regardless of the order in which blocked jobs wait), which can be inferred from the following simple example. Consider a job that is the highest-priority job on its processor, and suppose for the sake of illustration that tasks reside on job ’s processor, with the remaining task located on a second processor. Now suppose that, just before is released, the remote task first acquires a shared lock and then all other tasks on ’s processor suspend while waiting for the same lock. If lock-holding jobs are unconditionally priority-boosted, then will be preempted for the duration of one critical section of each of the other tasks on ’s processor, which results in s-oblivious pi-blocking even if itself does not acquire any locks itself. As a result of this effect, and because FIFO queues allow for pi-blocking due to up to critical sections per critical section, jobs incur s-oblivious pi-blocking of up to critical section lengths under the long FMLP for partitioned scheduling, which is not asymptotically optimal.
The partitioned OMLP [BA:10] solves the “too many boosted jobs” problem with a token mechanism to limit the number of tasks that can simultaneously request global resources, which implicitly restricts the maximum delay due to priority boosting while also ensuring that global wait queues remain short.
Under the partitioned OMLP, there is a single contention token associated with each processor. A processor’s contention token is a local (virtual) resource that is managed using an optimal uniprocessor protocol (such as the PCP [SRL:90] or the SRP [B:91]). Furthermore, each global resource is associated with a FIFO wait queue, and jobs holding (global) resources are priority-boosted, just as in the earlier FMLP. However, the key OMLP rule is that a task must hold its local contention token before it may issue a request for a global resource. As a result, only at most tasks compete for global resources at any time, which in conjunction with FIFO queues and priority boosting immediately yields a pi-blocking bound of critical section lengths once a job holds its local contention token. Additionally, a job may incur pi-blocking while it is waiting to acquire a contention token, which however is also limited to critical section lengths (including any priority boosting effects) [BA:10].
As a result, the partitioned OMLP guarantees a bound on s-oblivious pi-blocking for the duration of critical sections per request, plus s-oblivious pi-blocking for the duration of up to critical sections due to competition for the local contention token and priority boosting, for a total of critical section lengths (assuming a task that issues only one request per job), which is within of the lower bound and thus asymptotically optimal [B:11, BA:13]. A fine-grained (i.e.
, non-asymptotical) analysis of the partitioned OMLP is available[BA:11, BA:13].
5.1.4 Clustered Scheduling
The case of “true” clustered scheduling, where there are multiple clusters (unlike the special case of global scheduling) and each cluster contains more than one processor (unlike the special case of partitioned scheduling), is particularly challenging because it combines the challenges of both global and partitioned scheduling. In particular, priority inheritance across clusters is ineffective (Section 3.4), but priority boosting, even if restricted by a token mechanism as in the partitioned OMLP, makes it difficult to obtain asymptotically optimal s-oblivious pi-blocking bounds.
For this reason, a new progress mechanism called priority donation was developed for the clustered OMLP [BA:11, B:11, BA:13]. As already mentioned in Section 3.7, priority donation can be understood as a form of restricted priority boosting. However, the key difference to the token mechanism used in the partitioned OMLP is that there exists an explicit relationship between the lock-holding job that is forced to be scheduled (i.e., the priority recipient) and the job that is not scheduled as a result (i.e., the priority donor). In contrast, priority boosting just prescribes that a job is scheduled, but leaves unspecified which other job is not scheduled as a result, which causes analytical complications if there is more than one processor in a cluster (i.e., if there is a choice w.r.t. which job to preempt).
Priority donation maintains the invariant that, in each cluster and for each lock-holding job , is either among the highest-priority jobs in its cluster, or there exists a job among the highest-priority jobs that is the unique and exclusive priority donor for . As a result of this invariant, contention for global locks is limited to at most concurrent lock requests, and lock-holding jobs are guaranteed to make progress towards releasing the lock (i.e., lock holders are never preempted).
A job can become priority donor only once, immediately upon its release (i.e., before it starts executing). While it serves as priority donor, it is suspended to make available a processor for the priority recipient, and thus incurs s-oblivious pi-blocking. Priority donation ceases when the critical section of the priority recipient ends. The maximum request duration—from the time that a lock is requested until the time that the lock is released—thus also determines the amount of s-oblivious pi-blocking transitively incurred by the priority donor.
Under the clustered OMLP, each resource is associated with a simple FIFO wait queue [BA:11, B:11, BA:13]. Since priority donation guarantees that lock-holding jobs are scheduled, and since there are at most concurrent requests for global resources in progress at any time, a job is delayed by at most earlier critical sections per lock request. This in turn implies that priority donors incur s-oblivious pi-blocking for the cumulative duration of at most critical sections [BA:11]. The blocking bound for the clustered OMLP is thus equivalent to that of the partitioned OMLP, and it is hence also asymptotically optimal within a factor of roughly two of the lower bound [BA:11, BA:13].
Additional protocols designed specifically for s-oblivious analysis are discussed in Sections 7 and 8.2. However, none of these protocols, nor any OMLP variant, ensures an upper bound on maximum s-oblivious pi-blocking better than within roughly a factor of two of the known lower bound. It is presently unknown whether it is possible to close this gap in the general case.
5.2 Suspension-Aware Analysis of Semaphore Protocols
Under s-aware analysis, any self-suspensions due to lock contention and priority inversions due to progress mechanisms are explicitly modeled and must be accounted for by the schedulability test. Hence there is no opportunity to “recycle” any pessimism, there is no “analysis trick” as in the s-oblivious case—when targeting s-aware analysis, the goal of the locking protocol designer is simply to bound maximum delays as tightly as possible.
The potential upshot is that an underlying s-aware schedulability analysis can potentially be much more accurate when characterizing the effects of contention, and substantially less pessimistic in terms of system utilization since execution times are not being inflated. This is an important consideration especially if self-suspensions are relatively long. For instance, s-aware analysis becomes essential when synchronizing access to graphics processing units (GPUs), where critical section lengths can easily reach dozens or even hundreds of milliseconds [EA:12]. In comparison, when considering shared data structures, where critical sections are typically just a few microseconds long [BCBL:08, AJJ:98], the s-oblivious utilization impact is minor.
However, while historically the first multiprocessor real-time locking protocols [RSL:88, R:90, R:91, RM:95a] have all been intended for s-aware analysis, the understanding of self-suspensions from a schedulability point of view has only recently begun to mature, and a number of misunderstandings and misconceptions have been identified in earlier analyses of task sets with self-suspensions [CNHY:19]. Multiprocessor locking protocols for s-aware analysis, and the required s-aware analyses themselves, are thus presently still active areas of research.
In the following, we provide an overview of the current state of the art, starting with a brief review of the definition of s-aware pi-blocking, known asymptotic bounds, and (non-)optimality results, and then summarize major binary semaphore protocols for global, partitioned, and clustered scheduling.
5.2.1 Suspension-Aware Schedulability Analysis and Blocking Optimality
In the case of s-aware schedulability analysis, any delay that does not result from the execution of higher-priority jobs constitutes a priority inversion that must be explicitly accounted for. This is captured by the following definition.
A job of task , assigned to a cluster consisting of cores, suffers s-aware pi-blocking at time if and only if
is pending at time ,
is not scheduled at time (i.e., it is self-suspended or preempted), and
fewer than equal- or higher-priority jobs of tasks assigned to cluster are scheduled on processors belonging to ’s assigned cluster .
Notably, Definition 3 is equivalent to the classic uniprocessor notion of pi-blocking (Definition 1). The key difference to the s-oblivious case (Definition 2) is that under Definition 3 only the presence of scheduled higher-priority jobs prevents a delay from being considered a priority inversion, whereas under Definition 2 a priority inversion is ruled out if there are pending higher-priority jobs. Since any scheduled job is also pending, Definition 3 is weaker than Definition 2, and consequently any bound on s-aware pi-blocking is also a bound on s-oblivious pi-blocking (but the converse does not hold) [BA:10, B:11].
The fundamental lower bound on s-aware pi-blocking is [BA:10, B:11], which can be easily shown with a task set in which all tasks simultaneously compete for a single shared resource [BA:10], so that whichever task acquires the resource last is delayed by earlier critical sections.
While the lower bound is rather intuitive, the true challenge is again to construct locking protocols that asymptotically match this lower bound, that is, to find protocols that ensure maximum s-aware pi-blocking for any task set. In fact, from a blocking optimality point of view, the s-aware case is much more challenging than the well-understood s-oblivious case, and required several attempts until it was solved [BA:10, B:11, B:14b]. To date, while there exists a protocol for clustered scheduling that achieves maximum s-aware pi-blocking—namely the generalized FMLP+ [B:14b]—which suffices to establish asymptotic tightness of the lower bound under global and partitioned scheduling, no protocol with asymptotically optimal s-aware pi-blocking bounds is known specifically for global scheduling [B:11, B:14b, YWB:15].
The search for practical protocols that are also asymptotically optimal with regard to maximum s-aware pi-blocking is complicated by several non-optimality results. For one, any protocol relying exclusively on priority queues is generally subject to an lower bound on maximum s-aware pi-blocking due to starvation effects [BA:10]. Furthermore, under global scheduling, any protocol relying on priority inheritance or (unrestricted) priority boosting is subject to an lower bound on maximum s-aware pi-blocking [B:11, B:14b], where corresponds to the ratio of the longest to the shortest period of the task set (and which in general cannot be bounded in terms of or ). The generalized FMLP+ [B:14b], which we discuss in Section 5.2.5 below, thus requires more sophisticated machinery to achieve its bound under clustered scheduling.
5.2.2 Global Scheduling
Almost all semaphore protocols designed specifically for global scheduling rely on priority inheritance. As already mentioned in the discussion of the s-oblivious case (Section 5.1.2), BLBA:07’s global FMLP for long resources [BLBA:07], based on FIFO wait queues, was the first protocol in this category, and even though the initial analysis was s-oblivious [BLBA:07] (no s-aware analysis for global scheduling was known yet at the time of publication), the protocol itself works well under s-aware analysis, too, and an effective s-aware analysis has been presented by YWB:15 [YWB:15] for G-FP scheduling.
The classic Priority Inheritance Protocol (PIP) [SRL:90], which combines priority inheritance (the progress mechanism, Section 3.4) with priority-ordered wait queues, was initially analyzed under G-FP scheduling by EA:09 [EA:09]. In more recent work, EA:09’s original analysis has been subsumed by YWB:15’s more accurate analysis of the PIP (and several other protocols) [YWB:15], a state-of-the-art blocking analysis approach based on linear programming [B:13a] applied to G-FP scheduling.
NN:11 [NN:11] transferred EA:09’s original analysis of the PIP [EA:09] to a variant of the protocol that they called Immediate PIP (I-PIP) [NN:11], which retains the use of priority-ordered wait queues, but replaces priority inheritance with priority boosting. They further derived bounds on the maximum resource-hold times
under both the original PIP and the I-PIP, as well as heuristics for reducing resource-hold times without violating schedulability[NN:11].
Motivated by their analysis of the PIP, EA:09 also proposed a new semaphore protocol [EA:09, EA:09a] that they called the Parallel Priority-Ceiling Protocol (P-PCP) [EA:09], which is also based on priority inheritance and priority-ordered wait queues. Additionally, inspired by the classic uniprocessor PCP [SRL:90], the P-PCP introduces rules that prevent jobs from acquiring available resources to limit, at each priority level, the maximum number of lower-priority jobs that may simultaneously hold locks. Intuitively, such a rule can help to limit the amount of pi-blocking caused by the progress mechanism, but it introduces considerable complexity and has to be carefully balanced with the extra delay introduced by withholding available resources. EA:09 [EA:09] did not provide an empirical comparison of the PIP and the P-PCP; a later evaluation by YWB:15 [YWB:15] based on their more accurate re-analysis of both protocols found that the P-PCP offers no substantial benefits over the (much simpler) PIP and FMLP protocols.
YWB:15 [YWB:15] also compared the long FMLP and the PIP, and found the two protocols to be incomparable: fundamentally, some real-time workloads require the non-starvation guarantees of the FMLP’s FIFO queues, whereas other workloads require that urgent jobs are prioritized over less-urgent jobs. In practice, it is thus preferable for a system to offer both FIFO and priority-ordered wait queues, and it would not be difficult to combine YWB:15’s analyses of the FMLP and the PIP to analyze such a hybrid protocol.
5.2.3 Partitioned Scheduling
Across all combinations of multiprocessor scheduling and synchronization approaches, the category considered next—multiprocessor real-time semaphore protocols for partitioned scheduling with in-place execution of critical sections—has received the most attention in prior work. The classic, prototypical protocol in this domain is R:90’s Multiprocessor Priority Ceiling Protocol (MPCP) for P-FP scheduling [R:90, R:91]. The MPCP is a natural extension of uniprocessor synchronization principles, is appealingly simple, and has served as a template for many subsequent protocols. To ensure lock-holder progress, the MPCP relies on priority boosting (Section 3.6). Specifically, for each resource, the protocol determines a ceiling priority that exceeds the priority of any regular task, and the effective priority of a resource-holding task is unconditionally boosted to the ceiling priority of the resource that it holds.
To resolve contention, each shared resource is associated with a priority-ordered wait queue, in which blocked tasks wait in order of their regular scheduling priority. From a blocking optimality point of view, this choice prevents asymptotic optimality [BA:10, B:11]. However, empirically, the MPCP is known to perform well for many (but not all) workload types [B:13a].
Priority-boosted tasks remain preemptable under the MPCP. Resource-holding jobs can thus be preempted by other resource-holding jobs. The choice of ceiling priorities therefore has a significant impact on blocking bounds. R:90’s original proposal [R:90, R:91] did not provide a specific rule for determining ceiling priorities; rather, it specified certain conditions for acceptable ceiling priorities, which left some degree of choice to the implementor. Later works [LNR:09, MDPL:14] have simply assumed that the priority ceiling of a global resource is the maximum priority of any task accessing the resource, offset by a system-wide constant to ensure priority-boosting semantics (as discussed in Section 3.6).
Since the MPCP is the original shared-memory multiprocessor real-time locking protocol, it unsurprisingly has received considerable attention in subsequent works. LS:95 [LS:95] studied the choice of queue order in the MPCP and observed that assigning tasks explicit synchronization priorities (unrelated to their scheduling priorities) that reflect each task’s blocking tolerance can significantly improve schedulability [LS:95]. Furthermore, LS:95 observed that simply using FIFO queues instead of priority-ordered wait queues can yield substantial schedulability improvements [LS:95], which is consistent with observations made later in the context of the FIFO-based FMLP and FMLP+ [B:13a, BA:08, B:11, B:14b]. A variant of the MPCP with FIFO queues was later also studied by CO:12a [CO:12a], as well as a variant in which priority boosting is replaced with non-preemptive execution of critical sections [CO:12a]. YLLR:14 [YLLR:14] proposed to exploit knowledge of each task’s best-case execution time (BCET) to refine the blocking bounds for the MPCP under P-FP scheduling by obtaining a more realistic bound on the worst-possible contention possible at runtime.
In work primarily aimed at the task mapping problem (discussed in Section 11.1), LNR:09 [LNR:09] also proposed a variant of the MPCP based on virtual spinning, in which blocked jobs do not actually spin, but which can be analyzed using a WCET-inflation approach as commonly used in the analysis of spin locks (recall Section 4.1). While the term was not yet in widespread use at the time, LNR:09’s “virtual spinning” approach is in fact an s-oblivious analysis of the MPCP, together with a protocol tweak to simplify said analysis. Specifically, at most one job per core may issue a request for a global resource at any time (as is the case with non-preemptive spin locks) [LNR:09]. Contention for global resources is thus first resolved locally on each core, similar to the token mechanism in the partitioned OMLP [BA:10] already discussed in Section 5.1.3, which limits global contention. However, in both LNR:09’s own evaluation [LNR:09] as well as in later comparisons [B:11, BA:11, BA:13], it was observed that the “virtual spinning” variant of the MPCP performs poorly compared to both the regular MPCP (under s-aware analysis) and the partitioned and clustered OMLP variants, which are optimized for s-oblivious analysis [B:11, BA:13, BA:11].
A number of blocking analyses and integrated schedulability analyses of the MPCP have been proposed over the years [LNR:09, R:90, R:91, B:13a], including analyses for arbitrary activation models based on arrival curves [NSE:09, SNE:09]. It should also be noted that over the years a number of misconceptions had to be corrected related to the critical instant [YCH:17], use of the period enforcer technique [R:91a] to shape locking-induced self-suspensions [CB:17], and the proper accounting of self-suspensions in response-time analyses [CNHY:19]. These corrections should be consulted before reusing or extending existing analyses.
The most accurate and most extensible blocking analysis of the MPCP available today [B:13a] models the blocking analysis problem as a linear program, which allows for the systematic avoidance of structural pessimism such as the repeated over-counting of long critical sections in blocking bounds [B:13a] (as already discussed in Section 4.1). In particular, the LP-based approach allows for a much more accurate analysis of critical sections that contain self-suspensions (e.g., due to accesses to devices such as GPUs) by cleanly separating the time that a job holds a resource from the time that a job executes while being priority boosted [B:13a]. Recently, PBKR:18 [PBKR:18] expressed very similar ideas using more conventional notation, but unfortunately did not compare their proposal with the earlier LP-based analysis of the MPCP [B:13a].
A substantially different, early proposal for a predictable multiprocessor real-time locking protocol—almost as old as the MPCP, but known not nearly as well—is due to Z:92 [Z:92], who developed a real-time threading package [SZ:92, Z:92] on top of the Mach microkernel’s thread interface. In Z:92’s protocol, critical sections are executed non-preemptively and blocked threads wait in FIFO order. A unique aspect is that Z:92 introduced the notion of dynamic blocking analysis, where threads specify a maximum acceptable wait time when requesting a resource that is checked by the real-time resource management subsystem by means of an online blocking analysis that takes current contention conditions into account. This allows the system to dynamically determine whether the specified maximum acceptable wait time can be exceeded, and to reject the request before any delay is actually incurred (in contrast to a regular timeout, which kicks in only after the maximal wait time has been exceeded). Such a mechanism of course comes with non-negligible runtime overheads and has some impact on system complexity. It has been absent from later proposals.
In work targeting P-EDF, CTB:94 developed the Multiprocessor Dynamic Priority Ceiling Protocol (MDPCP) [CTB:94], which despite its name is quite different from the earlier MPCP. The MDPCP ensures progress by letting jobs holding global resources execute non-preemptively, and orders jobs in per-resource wait queues by decreasing priority (i.e., increasing deadlines). Additionally, the MDPCP defines for each resource a current priority ceiling, which is defined “as the maximum priority of all jobs that are currently locking or will lock” the semaphore [CTB:94]. As a result, the MDPCP is fundamentally tied to the periodic task model, where the arrival times of future jobs are known a priori. In contrast, in a sporadic setting under P-EDF scheduling, it is impossible to precisely determine the set of jobs and their priorities that will lock a resource in the future.
The MDPCP includes a non-work-conserving rule akin to the uniprocessor PCP [SRL:90] that prevents jobs with insufficiently high priorities (i.e., insufficiently urgent deadlines) from acquiring resources that might still be needed by higher-priority jobs (i.e., jobs with earlier deadlines). More precisely, a job on processor is allowed to lock a global resource only if its own priority exceeds (i.e., its deadline is earlier than) the maximum priority ceiling of any resource currently in use on any of the processors that might conflict with, where two processors “might conflict with” each other if there exists some resource that is accessed by tasks on both processors [CTB:94]. This rule is required to avoid deadlock in the presence of nested requests, as will be discussed in Section 9. In an accompanying tech report [CT:94], CT:94 further defined a second variant of the MDPCP based on priority boosting (rather than non-preemptive execution); this MDPCP version is also tied to the periodic task model.
BLBA:07’s partitioned FMLP [BLBA:07] (previously discussed in Section 5.1.3) was also applied to P-FP scheduling and analyzed in an s-aware manner [BA:08a]. Recall that the FMLP variant for long resources under partitioned scheduling relies on per-resource FIFO queues to resolve contention and on priority boosting to ensure lock-holder progress. While priority boosting is conceptually simple, a key detail is how to order simultaneously priority-boosted jobs (i.e., what to do if multiple tasks assigned to the same processor hold a resource at the same time). The original FMLP [BLBA:07] pragmatically gives priority to whichever job acquired its resource first (which greedily minimizes the number of preemptions). This choice, however, turned out to be problematic from a blocking-optimality point of view [BA:10, B:11], and a refined version of the partitioned FMLP for long resources, called the partitioned FMLP+, was introduced [B:11].
Like its predecessor, the partitioned FMLP+ uses per-resource FIFO queues to resolve contention and priority boosting to ensure lock-holder progress. However, it uses a subtly different tie-breaking rule: among priority-boosted jobs, the job with the first-issued (rather than the first-granted) lock request is given priority (i.e., priority-boosted jobs are scheduled in order of increasing lock-request times) [B:11]. To avoid preemptions in the middle of a critical section, the FMLP+ optionally also supports non-preemptive critical sections [B:11].
In contrast to all prior semaphore protocols for partitioned multiprocessor scheduling, the partitioned FMLP+ (with either preemptive and non-preemptive critical sections) ensures asymptotically optimal maximum s-aware pi-blocking [B:11]. Specifically, due to the use of FIFO queues and the FIFO-based priority-boosting order, the FMLP+ ensures that a job is delayed by at most earlier-issued requests for any resource each time it executes a critical section (assuming preemptive critical sections). Additionally, prior to a job’s arrival or while it is suspended, lower-priority jobs may issue resource requests, which may be priority-boosted at a later time and thereby cause additional pi-blocking. Hence, whenever a job arrives or resumes from a self-suspension, there may be up to earlier-issued, incomplete requests that can cause additional pi-blocking. Assuming non-preemptive critical sections adds only a constant amount of additional blocking [B:11]. The FMLP+ hence ensures maximum s-aware pi-blocking within a factor of roughly two of the known lower bound [B:11].
In addition to the optimality result, several fine-grained (i.e., non-asymptotic) s-aware blocking analyses of the FMLP+ have been presented [B:11, B:13a, B:14b], with an LP-based analysis [B:13a] again yielding the most accurate results and offering the greatest flexibility, including support for self-suspensions within critical sections [B:13a, Appendix F]. Recently, MKZT:16 [MKZT:16] further refined the LP-based analysis and proposed additional constraints to increase analysis accuracy.
Overall, the FMLP+ is simple, requires no configuration or a priori knowledge (such as priority ceilings), has been implemented in LITMUSRT [B:11], and shown to be practical [B:13a]. In an empirical comparison with the MPCP, the two protocols were shown to be incomparable: the FMLP+ outperforms the MPCP for many (but not all) workloads, and vice versa [B:13a]. PBKR:18 [PBKR:18] observed similar trends in their comparison of the two protocols.
As in the global case (Section 5.2.2), it would be desirable to develop a hybrid protocol that integrates the advantages of the FIFO-based FMLP+ [B:11] with optional prioritization as in the MPCP [R:90, R:91] for the most blocking-sensitive tasks [LS:95]. Such a protocol could be easily analyzed by combining the existing LP-based analyses [B:13a] of the MPCP and the FMLP+.
Targeting P-FP scheduling in the context of a somewhat different system model, NBN:11a [NBN:11a] considered the consolidation of legacy uniprocessor systems onto shared multicore platforms, where each core is used to host a mostly independent (uniprocessor) application consisting of multiple tasks. Whereas intra-application synchronization needs can be resolved with existing uniprocessor protocols (as previously employed in the individual legacy systems), the move to a shared multicore platform can create new inter-application synchronization needs (e.g., due to shared platform resources in the underlying RTOS). To support such inter-application resource sharing, NBN:11a [NBN:11a] developed the Multiprocessors Synchronization Protocol for Real-Time Open Systems (MSOS) protocol [NBN:11a], with the primary goal of ensuring that the temporal correctness of each application can be assessed without requiring insight into any other application (i.e., applications are assumed to be opaque and may be developed by independent teams or organizations). To this end, the MSOS protocol uses a two-level, multi-tailed hybrid queue for each resource. Similar to the partitioned OMLP [BA:10] (discussed in Section 5.1.3), contention for global (i.e., inter-application) resources is first resolved on each core, such that at most one job per core and resource can contend for global resources. Since the MSOS protocol resolves inter-application contention with FIFO queues, this design allows for the derivation of blocking bounds without knowledge of any application internals, provided the maximum per-application resource-hold time is known (for which NBN:11a provide a bound [NBN:11a]). The intra-application queues can be either FIFO or priority-ordered queues [NBN:11a], and lock-holder progress is ensured via priority boosting as in the MPCP [R:90]. Subsequently, CNHY:19 [CNHY:19] corrected an oversight in the analysis of the MSOS protocol [NBN:11a] related to the worst-case impact of self-suspensions.
5.2.4 Semi-Partitioned Scheduling
Semi-partitioned multiprocessor scheduling [ABD:05] is a hybrid variant of partitioned scheduling, where most tasks are assigned to a single processor each (as under partitioned scheduling) and a few migratory tasks receive allocations on two or more processors (i.e., their processor allocations are effectively split across processors). Semi-partitioned scheduling has been shown to be an effective and highly practical technique to circumvent bin-packing limitations without incurring the complexities and overheads of global or clustered scheduling [BBA:11, BG:16]. From a synchronization point of view, however, semi-partitioned scheduling has not yet received much attention.
A notable exception is ANN:12a’s work [ANN:12a] on semaphore protocols for semi-partitioned fixed-priority (SP-FP) scheduling. Since known techniques for partitioned scheduling are readily applicable to non-migratory tasks, the novel challenge that must be addressed when targeting semi-partitioned systems is migratory tasks. To this end, ANN:12a [ANN:12a] proposed two protocol variants, both using priority-ordered wait queues and priority boosting.
In the first protocol variant, the Migration-based Locking Protocol under Semi-Partitioned Scheduling (MLPS) [ANN:12a], each task is assigned a marked processor on which it must execute all its critical sections. This approach simplifies the problem, as it ensures that all of a task’s critical sections are executed on a statically known processor, which reduces the analysis problem to the partitioned case. However, it also introduces additional task migrations, as a migratory task that currently resides on the “wrong” (i.e., non-marked) processor must first migrate to its marked processor before it can enter a critical section, and then back again to its non-marked processor when it releases the lock.
As an alternative, ANN:12a’s Non-Migration-Based Locking Protocol under Semi-Partitioned Scheduling (NMLPS) [ANN:12a] lets migratory tasks execute their critical sections on their current processor (i.e., on whichever processor they happen to be executing at the time of lock acquisition). This avoids any superfluous migrations, but causes greater analysis uncertainty as it is now unclear on which processor a critical section will be executed.
Additional complications arise when a resource-holding migratory task should, according to the semi-partitioning policy, be migrated in the middle of a critical section. At this point, there are two choices: either let the task finish its critical section before enacting the migration, which may cause it to overrun its local budget, or instead preempt the execution of the critical section, which causes extra delays and makes the analysis considerably more pessimistic. ANN:12a [ANN:12a] chose the former approach in the NMLPS, which means that budget overruns up to the length of one critical section must be accounted for in all but the last segments of migratory tasks.
5.2.5 Clustered Scheduling
Like the semi-partitioned case, the topic of s-aware semaphore protocols for clustered scheduling has not received much attention to date. The primary works are an extension of NBN:11a’s MSOS protocol [NBN:11a], called the clustered MSOS (C-MSOS) protocol [NN:13], and the generalized FMLP+ [B:14b], which establishes asymptotic tightness of the known lower bound on s-aware pi-blocking (recall Section 5.2.1).
Under the C-MSOS, legacy applications are allowed to span multiple cores (i.e., there is one application per cluster). Local (i.e., intra-application) resources are managed using the PIP [SRL:90, EA:09] (as discussed in Section 5.2.2). Global (i.e., inter-application) resources are managed using a two-stage queue as in the MSOS protocol [NBN:11a]. However, in the C-MSOS protocol, each resource’s global queue can be either a FIFO queue (as in the MSOS protocol [NBN:11a]) or a round-robin queue. As before, the per-application queues can be either FIFO or priority queues, and lock-holder progress is ensured via priority boosting (which prevents asymptotic optimality w.r.t. maximum s-aware pi-blocking under global and clustered scheduling [B:11, B:14b]).
The generalized FMLP+ [B:14b] was designed specifically to close the “s-aware optimality gap” [B:11], i.e., to provide a matching upper bound of s-aware pi-blocking under clustered (and hence also global) scheduling, thereby establishing the known lower bound [BA:10a, B:11] to be asymptotically tight [B:14b]. The name derives from the fact that the generalized FMLP+ produces the same schedule as the partitioned FMLP+ when applied to partitioned scheduling [B:14b]. However, despite this lineage, in terms of protocol rules, the generalized FMLP+ [B:14b] differs substantially from the (much simpler) partitioned FMLP+ [B:11].
The generalized FMLP+ [B:14b] resolves contention with simple per-resource FIFO queues, as in the prior FMLP [BLBA:07] and the partitioned FMLP+ [B:11]. The key challenge is to ensure lock-holder progress, since neither priority inheritance nor (unrestricted) priority boosting can yield asymptotically optimal s-aware pi-blocking bounds under global and clustered scheduling [B:11, B:14b]. Intuitively, the main problem is that raising the priority of a lock holder (via either inheritance or boosting) can cause other, unrelated higher-priority jobs to be preempted. Furthermore, in pathological cases, it can cause the same job to be repeatedly preempted, which gives rise to asymptotically non-optimal s-aware pi-blocking [B:11, B:14b]. The generalized FMLP+ overcomes this effect by employing a progress mechanism tailored to the problem, called restricted segment boosting (RSB) [B:14b]. Under the RSB rules, in each cluster, only the (single) job with the earliest-issued request benefits from priority boosting (with any ties in request-issue time broken arbitrarily). In addition to this single boosted lock holder, certain non-lock-holding jobs are co-boosted, specifically to prevent repeated preemptions in the pathological scenarios that cause the non-optimality of priority inheritance and priority boosting [B:11, B:14b].
Based on RSB, the generalized FMLP+ ensures asymptotically optimal s-aware pi-blocking under clustered scheduling [B:14b], and hence also under global scheduling, which closes the s-aware optimality gap [B:11]. However, in an empirical comparison under global scheduling [YWB:15], the generalized FMLP+ performed generally worse than protocols specifically designed for global scheduling, which indicates that the generalized FMLP+ [B:14b] is primarily of interest from a blocking optimality point of view. In contrast, the simpler partitioned FMLP+ [B:11], which is designed specifically for partitioned scheduling and hence avoids the complexities resulting from clustered and global scheduling, is known to perform empirically very well and to be practical [B:13a].
6 Centralized Execution of Critical Sections
In the preceding two sections, we have considered protocols for in-place execution of critical sections, where jobs directly access shared resources. Under in-place protocols, the critical sections pertaining to each (global) resource are spread across multiple processors (i.e., wherever the tasks that share the resource happen to be executing). For spin locks (Section 4), this is the natural choice. In the case of semaphores, however, this is not the only possibility, nor is it necessarily the best. Instead, as discussed in Section 2.2.5, it is also possible to centralize the execution of all critical sections onto a designated processor, the synchronization processor of the resource. In fact, the very first multiprocessor real-time semaphore protocol, namely the DPCP [RSL:88], followed exactly this approach.
Protocols that call for the centralized execution of critical sections are also called distributed multiprocessor real-time locking protocols, because the centralized approach does not necessarily require shared memory (as opposed to the in-place execution of critical sections, which typically relies on cache-consistent shared memory). From a systems point of view, there are three ways to interpret such protocols. In the following discussion, let a job’s application processor be the processor on which it carries out its regular execution (i.e., where it executes its non-critical sections).
In the first interpretation, which is consistent with a distributed-systems perspective, each critical section of a task is seen as a synchronous remote procedure call (RPC) to a resource server executing on the synchronization processor. The resource server, which may be multi-threaded, is in charge of serializing concurrent RPC calls. A job that issues an RPC to the synchronization processor self-suspends after sending the RPC request and resumes again when the resource server’s response is received by the application processor. The job’s self-suspension duration thus includes both the time required to service its own request plus any delays due to contention for the resource (i.e., blocking due to earlier-serviced requests). Additionally, in a real system, any communication overheads contribute to a job’s self-suspension time (e.g., transmission delays if the RPC request is communicated over a shared interconnect, argument marshalling and unmarshalling costs, etc.).
In the second interpretation, which is typically adopted in a shared-memory context, jobs are considered to migrate from the application processor to the synchronization processor when they attempt to lock a shared resource, and to migrate back to their application processor when unlocking the resource. All resource contention is hence reduced to a uniprocessor locking problem. However, from a schedulability analysis point of view, the time that the job resides on the synchronization processor still constitutes a self-suspension w.r.t. to the analysis of the application processor.
Finally, the third interpretation, which is appropriate for both shared-memory and distributed systems, is to see each job as a sequence (i.e., as a linear DAG) of subjobs with precedence constraints and an end-to-end deadline [SBL:94], where different subjobs are spread out across multiple processors. In this view, the synchronization problem is again reduced to a uniprocessor problem. The end-to-end analysis, however, must deal with the fact that each DAG visits the application processor multiple times, which can give rise to pessimism in the analysis.
From an analytical point of view—i.e., for the purpose of schedulability and blocking analysis—the first two interpretations are equivalent, that is, identical analysis problems must be solved and, ignoring overheads, identical bounds are obtained, regardless of how the protocol is actually implemented. The third approach provides some additional flexibility [TL:94, SBL:94] and has recently been exploited to enable modern analyses and heuristics [HCR:16, HYC:16, BCHY:17, DLBC:18, CLYL:19].
6.1 Advantages and Disadvantages
Multiprocessor real-time locking protocols that centralize the execution of critical sections offer a number of unique advantages. For one, they can be easily applied to heterogenous multiprocessor platforms, where certain critical sections may be inherently restricted to specific cores (e.g., compute kernels can run only on GPUs, high-performance signal processing may need to take place on a DSP, low-power cores may lack floating-point support, etc.). Similarly, if certain shared devices are accessible only from specific processors (e.g., if there is a dedicated I/O processor), then those processors naturally become synchronization processors for critical sections pertaining to such devices. Furthermore, centralizing all critical sections is also attractive in non-cache-coherent systems, since it avoids the need to keep a shared resource’s state consistent across multiple local memories. In fact, even in cache-coherent shared-memory systems it can be beneficial to centralize the execution of critical sections to avoid cache-line bouncing [LDTL:12]. And last but not least, from a real-time perspective, the centralized approach allows for the reuse of well-established uniprocessor protocols, which for some workloads can translate into significant schedulability improvements over in-place approaches [B:13a].
Centralized protocols, however, also come with a major downside. Whereas schedulability and blocking analysis is typically concerned with worst-case scenarios, many systems also require excellent average-case performance, and this is where in-place execution of critical sections has a major advantage. In well-designed systems, resource contention is usually rare, which means that uncontested lock acquisitions are the common case that determines average-case performance. In a semaphore protocol based on in-place execution, uncontested lock acquisitions do not cause self-suspensions and can be optimized to incur very low acquisition and release overheads (see “futexes,” discussed in Section 10.2). In contrast, in protocols based on the centralized approach, every remote critical section necessarily involves a self-suspension regardless of whether the shared resource is actually under contention, which is likely to have a significant negative impact on average-case performance.
6.2 Centralized Protocols
The original protocol for the centralized execution of critical sections, and in fact the first multiprocessor real-time locking protocol altogether, is the Distributed Priority Ceiling Protocol (DPCP) [RSL:88, R:91]. Unfortunately, there is some confusion regarding the proper name of the DPCP. The protocol was originally introduced as the “Multiprocessor Priority Ceiling Protocol” and abbreviated as “MPCP” [RSL:88], but then renamed to “Distributed Priority Ceiling Protocol,” properly abbreviated as “DPCP,” shortly thereafter [R:91]. To make matters worse, the shared-memory protocol now known as the MPCP (discussed in Section 5.2.3) was introduced in the meantime [R:90]. However, for some time, the authors of several subsequent works remained unaware of the name change, and hence a number of later publications, including a popular textbook on real-time systems [L:00], refer to the DPCP [RSL:88] by the name “MPCP.” We follow the modern terminology [R:91] and denote by “DPCP” the original protocol [RSL:88], which is based on the centralized execution of critical sections, and reserve the abbreviation “MPCP” to refer to the later shared-memory protocol [R:90], which is based on the in-place execution of critical sections (as discussed in Section 5.2.3).
The DPCP has been designed for P-FP scheduling. As the name suggests, the DPCP relies on the classic PCP [SRL:90] to arbitrate conflicting requests on each synchronization processor. To ensure resource-holder progress, that is, to prevent lock-holding jobs from being preempted by non-lock-holding jobs if the sets of synchronization and application processors are not disjoint, the DPCP relies on priority boosting. As a result of reusing the uniprocessor PCP, the DPCP effectively uses a priority-ordered wait queue (i.e., conflicting requests from two remote jobs are served in order of their regular scheduling priorities). This simple design has proven to be highly effective and practical even in modern systems [B:13a].
A number of s-aware blocking analyses of the DPCP have been presented in the literature [RSL:88, R:91, B:13a, HYC:16], with an LP-based approach [B:13a] yielding the most accurate bounds. Recent works [YCH:17, CB:17] documented some misconceptions in the original analyses [RSL:88, R:91].
RC:08 [RC:08] investigated a variant of the DPCP that uses the SRP [B:91] instead of the PCP [SRL:90] on each core. The resulting protocol, which they called the Distributed Stack Resource Policy (DSRP) [RC:08], has the advantage of integrating better with P-EDF scheduling (since the underlying SRP is well-suited for EDF).
Just as the FIFO-ordered FMLP+ complements the priority-ordered MPCP in case of in-place critical sections, the Distributed FIFO Locking Protocol (DFLP) [B:13a, B:14a] is a FIFO-ordered protocol for centralized critical-section execution that complements the priority-ordered DPCP. The DFLP works in large parts just like the DPCP, with the exception that it does not use the PCP to manage access to global resources. Instead, it adopts the design first introduced with the partitioned FMLP+ [B:11] (discussed in Section 5.2.3): conflicting lock requests are served in FIFO order, lock-holding jobs are (unconditionally) priority-boosted, and jobs that are priority-boosted simultaneously on the same synchronization processor are scheduled in order of their lock requests (i.e., the tie-breaking rule favors earlier-issued requests).
Blocking under the DFLP has been analyzed using an s-aware, LP-based approach [B:13a]. In an empirical comparison under P-FP scheduling based on an implementation in LITMUSRT [B:13a], the DFLP and DPCP were observed to be incomparable: the DFLP performs better than the DPCP for many (but not all) workloads, and vice versa. Similarly, both protocols were also observed to each be incomparable with their in-place counterparts (i.e., the partitioned FMLP+ [B:11] and the MPCP [R:90], respectively).
In contrast to the DPCP, which is defined only for P-FP scheduling, the DFLP can also be combined with P-EDF or clustered scheduling [B:14a].
6.3 Blocking Optimality
Centralized locking protocols have also been studied from the point of view of blocking optimality [BA:10, B:11], and asymptotically tight bounds on maximum pi-blocking have been obtained for both the s-aware and s-oblivious cases [B:14a]. Interestingly, the way in which resources and tasks are assigned to synchronization and application processors, respectively, plays a major role. If some tasks and resources are co-hosted, that is, if the sets of synchronization and application processors are not disjoint, then maximum pi-blocking is asymptotically worse than in the shared-memory case: a lower bound of maximum pi-blocking has been established [B:14a], where denotes the ratio of the maximum response time and the minimum period of any task. Notably, this bound holds under both s-aware and s-oblivious analysis due to the existence of pathological corner cases in which jobs are repeatedly preempted [B:14a]. Both the DPCP [RSL:88, R:91] and the DFLP [B:13a, B:14a] ensure maximum s-aware pi-blocking, and are hence asymptotically optimal in the case with co-hosted tasks and resources [B:14a].
In contrast, if the sets of synchronization and application processors are disjoint (i.e., if no processor serves both regular tasks and critical sections), then the same lower bounds as in the case of in-place critical sections apply [B:14a]: under s-aware analysis, and under s-oblivious analysis.
The DFLP [B:13a, B:14a] ensures maximum s-aware pi-blocking in the disjoint case under clustered (and hence also partitioned) scheduling, and is thus asymptotically optimal under s-aware analysis [B:14a]. Asymptotic tightness of the bound on maximum s-oblivious pi-blocking was established with the Distributed OMLP (D-OMLP) [B:14a], which was obtained by transfering techniques introduced with the OMLP family for in-place critical sections to the centralized setting.
7 Independence Preservation: Avoiding the Blocking of Higher-Priority Tasks
All locking protocols discussed so far use either priority inheritance, non-preemptive sections, unconditional priority boosting, or a restricted variant of the latter (such as priority donation or RSB). Of these, priority inheritance has a unique and particularly attractive property: independent higher-priority jobs are not affected by the synchronization behavior of lower-priority jobs.
For example, consider three tasks , , and under uniprocessor fixed-priority scheduling, and suppose that and share a resource that the higher-priority task does not require. If the tasks follow a protocol based on priority inheritance, then the response time of is completely independent of the lengths of the critical sections of and , which is obviously desirable. In contrast, if and synchronize by means of non-preemptive critical sections, then ’s response time, and ultimately its temporal correctness, depends on the critical section lengths of lower-priority tasks. In other words, the use of non-preemptive sections induces a temporal dependency among logically independent tasks.
Unfortunately, priority boosting, priority donation, and RSB similarly induce temporal dependencies when they force the execution of lock-holding lower-priority jobs. Since on multiprocessors priority inheritance is effective only under global scheduling (recall Section 3.4), this poses a significant problem for multiprocessor real-time systems that do not use global scheduling (of which there are many in practice). In response, a number of multiprocessor real-time locking protocols have been proposed that avoid introducing temporal dependencies in logically unrelated jobs. We use the term independence preservation [B:12, B:13, B:14] to generally refer to the desired isolation property and this class of protocols.
7.1 Use Cases
Independence preservation is an important property in practice, but to date it has received relatively little attention compared to the classic spin and semaphore protocols discussed in Sections 4–6. To highlight the concept’s significance, we briefly sketch four contexts in which independence preservation is essential.
First, consider multi-rate systems with a wide range of activation frequencies [B:13]. For instance, in automotive systems, it is not uncommon to find tasks periods ranging from as low as 1 ms to as high as 1,000 ms or more [KZH:15]. Now, if a 1,000 ms task has a utilization of only 10%, and if each job spends only 1% of its execution time in a critical section, then a single such critical section is already long enough (1 ms) to render any 1 ms task on the same core infeasible. This shows the importance of independence preservation in the face of highly heterogeneous timing requirements.
As a second example, consider an infrequently triggered sporadic event handler that must react within, say, 100 s (e.g., a critical interrupt handler with a tight latency constraint). Now assume the system is deployed on an eight-core platform and consider a shared-memory object accessed by all cores (e.g., an OS data structure) that is protected with a non-preemptive FIFO spin lock (e.g., as used in the MSRP [GLD:01], recall Section 4.1). Even if each critical section is only at most 20 s long, when accounting for the transitive impact of spin delay, the worst-case latency on every core is at least 160 s, which renders the latency-sensitive interrupt handler infeasible. Generally speaking, if job release latency is a concern, then non-independence-preserving synchronization methods must be avoided. Case in point: the PREEMPT_RT real-time patch for the Linux kernel converts most non-preemptive spin locks in the kernel to suspension-based mutexes for precisely this reason. In other words, none of the protocols discussed in Sections 4–6 based on priority boosting or non-preemptive execution is appropriate for general use in the Linux kernel. TS:97 highlighted the negative impact of delays due to real-time synchronization on interrupt latency as a major problem in multiprocessor RTOS kernels already more than 20 years ago [TS:94, TS:95, TS:96, TS:97, T:96].
Third, consider open systems, where at design time it is not (fully) known which applications will be deployed and composed at runtime. Non-preemptive sections and priority boosting are inappropriate for such systems, because the pi-blocking that they induce is a global property, in the sense that it affects all applications, and because the maximum critical section length in newly added applications is not always known. Independence preservation ensures temporal isolation among independent applications, which greatly simplifies the online admission and composition problem. FLC:10 argue this point in detail [FLC:10, FLC:12].
As a fourth and final example, independence preservation is also important in the context of mixed-criticality systems [BD:18], where it is highly desirable to avoid any dependencies from critical on non-critical components. Specifically, if the temporal correctness of a highly critical task depends on the maximum critical section length in a lower-criticality task, then there exists an implicit trust relationship: the temporal correctness of higher-criticality tasks is guaranteed only as long as the lower-criticality task does not violate the worst-case timing behavior assumed in the analysis of higher-criticality tasks. Independence preservation can help to avoid such dependencies, which violate the freedom-from-interference principle at the heart of mixed-criticality systems [BD:18]. A detailed argument along these lines has been presented in prior work [B:14].
7.2 Fully Preemptive Locking Protocols for Partitioned and Clustered Scheduling
Since priority inheritance ensures independence preservation under global scheduling, we focus on partitioned and clustered scheduling (and in-place critical sections).
Recall from Section 3 that the fundamental challenge under partitioned scheduling can be described as follows: a lock-holding task on processor is preempted by a higher-priority task while blocks a remote task located on processor . There are fundamentally only three choices:
priority-boost to expedite the completion of its critical section, in which case is delayed;
do nothing and accept that ’s blocking bound depends on ’s execution cost; or
use processor time originally allocated to on processor to finish ’s critical section: allocation inheritance, as discussed in Section 3.5.
Option (1) violates independence preservation, option (2) results in potentially “unbounded” pi-blocking, and hence all protocols considered in this section rely on option (3).
Allocation inheritance can be combined with both spin- and suspension-based waiting. In both cases, the key property is that lock-holding tasks remain fully preemptable at all times, and that they continue to execute with their regular (i.e., non-boosted priorities), which ensures the desired independence-preservation property.
HP:01 [HP:01] were the first to describe an independence-preserving multiprocessor real-time synchronization protocol, which they realized in the Fiasco L4 microkernel under the name local helping. Given that microkernels in the L4 family rely exclusively on IPC, the shared resource under contention is in fact a single-threaded resource server that synchronously responds to invocations from client tasks, thereby implicitly sequencing concurrent requests (i.e., the execution of the server’s response handler forms the “critical section”).
HP:01’s solution [HP:01] is based on earlier work by HH:01 [HH:01], who described an elegant way to realize temporally predictable resource servers on uniprocessors that is analytically equivalent to the better-known (uniprocessor) Bandwidth Inheritance (BWI) protocol [LLA:01] (which was independently proposed in the same year). Specifically, HH:01 proposed a mechanism that they termed helping:whenever a blocked client (i.e., a client thread that seeks to rendezvous with a server thread that is not waiting to accept a synchronous IPC message) is selected by the scheduler, the server process is dispatched instead [HH:01] (see also time-slice donation [SBK:10]).666 HH:01’s helping mechanism [HH:01] is named in analogy to the “helping” employed in wait-free algorithms [H:91], but fundamentally a different mechanism. “Helping” in wait-free algorithms does not rely on any support by the OS [H:91]; rather, it is realized exclusively with a processor’s atomic operations (such as an atomic compare-and-swap instruction).
HP:01 extended HH:01’s helping approach to multiprocessors (under P-FP scheduling) and systematically considered key design choices and challenges. Specifically, with local helping,777HP:01 also describe a variant called remote helping where a blocked client migrates to the core assigned to the server [HP:01], which however is not an attractive solution from an analytical point of view and thus not further considered here. a preempted resource server is migrated (i.e., pulled) to the core of the blocked client, at which point the uniprocessor helping mechanism [HH:01] can be applied—an instance of the allocation inheritance principle (Section 3.5). However, two interesting challenges arise:
What should blocked clients do when the resource server is already executing on a remote core?
How does a blocked client learn that the resource server was preempted on a remote core?
HP:01 considered two fundamental approaches. In the first approach, which they termed polling [HH:01], the blocked client simply executes a loop checking whether the resource server has become available for dispatching (i.e., whether it has been preempted), which addresses both questions. This polling approach is equivalent to preemptable spinning (i.e., it is conceptually a busy-wait loop that happens to spin on the process state of the server process), with the typical advantage of avoiding self-suspensions and the typical disadvantage of potentially wasting many processor cycles.
As an alternative, HP:01 considered a sleep and callback [HP:01] approach, where the blocked client registers its willingness to help in a data structure and then self-suspends. When the server process is preempted, the register of potential helpers is consulted and one or more blocked clients are woken up by triggering their callback functions, which requires sending inter-processor interrupts (IPIs) to the cores on which they are hosted. The sleep and callback approach is equivalent to self-suspending clients, which comes with the typical advantage that blocked clients yield the processor to lower-priority tasks, but which also introduces the typical analytical challenges and overhead issues. Since HP:01 expected critical sections (i.e., server request handlers) in their system to be quite short, and due to implementation challenges associated with the sleep and callback approach, HP:01 chose the polling approach in their implementation [HP:01].
Given that synchronous IPC (with single-threaded processes) and mutual exclusion are duals of each other, HP:01’s work [HP:01] directly applies to the multiprocessor real-time locking problem, and in fact their combination of local helping and synchronous IPC can be considered a multiprocessor real-time locking protocol that combines priority-ordered wait queues with allocation inheritance under P-FP scheduling.
Not long after HP:01 [HH:01], HA:02a [HA:02a, H:04, HA:06] proposed the use of allocation inheritance to realize a predictable suspension-based locking protocol for Pfair scheduling [BCPV:96, SA:06]. While Pfair is a global scheduler, it is not compatible with priority inheritance due to its sophisticated scheduling rules and more nuanced notion of “priority.” HA:02a hence proposed allocation inheritance as a generalization of the priority-inheritance principle that neither assumes priority-driven scheduling nor requires a priority concept. As already mentioned in Section 3.5, HA:02a coined the term “allocation inheritance,” which we have adopted in this survey to refer to the general idea of dynamically re-purposing processor-time allocations to ensure lock-holder progress. HA:02a further considered two alternatives to allocation inheritance named rate inheritance and weight inheritance [HA:02a, HA:06, H:04], which are both specific to Pfair scheduling and not further considered herein.
Much later, FLC:10 [FLC:10, FLC:12] extended the uniprocessor bandwidth inheritance protocol [LLA:01] to multiprocessors, targeting in particular multiprocessors under reservation-based scheduling. The resulting protocol, the Multiprocessor Bandwidth Inheritance (MBWI) protocol [FLC:10, FLC:12], combines allocation inheritance with FIFO-ordered wait queues and busy-waiting.
FLC:10 observed that, since the allocation inheritance principle is not specific to any particular scheduling algorithm, the MBWI protocol can be employed without any modifications or adaptations under partitioned, global, or clustered scheduling [FLC:10, FLC:12]. In fact, it can even be used in unchanged form under semi-partitioned scheduling or in the presence of tasks with arbitrary processor affinities (APAs) [GCB:13, GCB:15].
Like HP:01 [HP:01], FLC:12 chose to follow the polling approach in their implementation of the MBWI protocol in LITMUSRT [FLC:12]. As an interesting practical tweak, in FLC:12’s implementation, polling jobs detect when the lock-holding job self-suspends (e.g., due to I/O) and then self-suspend as well, to prevent wasting large amounts of processor time when synchronizing access to resources that induce self-suspensions within critical sections (e.g., such as GPUs) [FLC:10, FLC:12]. Nonetheless, the MBWI protocol is fundamentally a spin-based protocol [FLC:10, FLC:12].
In work targeting Linux with the PREEMPT_RT patch, BB:12 [BB:12] proposed to replace Linux’s implementation of priority inheritance with allocation inheritance (which they referred to as migratory priority inheritance [BB:12]) because priority inheritance is ineffective in the presence of tasks with disjoint APAs, which Linux supports [GCB:13, GCB:15].
In contrast to FLC:10’s MBWI protocol [FLC:10, FLC:12] and HP:01’s local helping implementation in Fiasco [HP:01], BB:12 [BB:12] proposed to retain Linux’s usual semaphore semantics, wherein blocked tasks self-suspend. Similarly to HP:01’s work [HP:01], and unlike the MBWI protocol [FLC:10, FLC:12], BB:12’s proposal [BB:12] uses priority-ordered wait queues.
The Independence-preserving Protocol (OMIP) [B:13, B:12] for clustered scheduling is the only protocol based on allocation inheritance that achieves asymptotic blocking optimality under s-oblivious analysis. Recall that the only other protocol for clustered scheduling that is asymptotically optimal w.r.t. maximum s-oblivious pi-blocking is the clustered OMLP [BA:11, B:11, BA:13], which relies on priority donation, a restricted variant of priority boosting, and which hence is not independence preserving. The OMIP improves upon the clustered OMLP by replacing priority donation with allocation inheritance, which ensures that lock-holding tasks remain preemptable at all times. As a result of this change in progress mechanism, the OMIP requires a multi-stage hybrid queue [B:13, B:12] similar to the one used in the global OMLP [BA:10], in contrast to the simple FIFO queues used in the clustered OMLP [BA:11]. In fact, in the special case of global scheduling, the OMIP reduces to the global OMLP, and hence can be understood as a generalization of the global OMLP [B:13, B:12]. This also underscores that allocation inheritance is a generalization of priority inheritance (Section 3.5).
The OMIP, which has been prototyped in LITMUSRT [B:13], is suspension-based and hence requires the implementation to follow a sleep and callback approach [HP:01]. However, because the available blocking analysis is s-oblivious [B:13], which already accounts for suspension times as processor demand (Section 5.1), it can be trivially (i.e., without any changes to the analysis) changed into a spin-based protocol. Similarly, the OMIP’s multi-stage hybrid queue could be combined with the MBWI protocol [FLC:10, FLC:12] to lower the MBWI protocol’s bounds on worst-case s-blocking (i.e., bounds as in the OMIP rather than the MBWI protocol’s bounds).
One unique feature of the OMIP worth noting is that, since the blocking bounds are completely free of any terms depending on the number of tasks , it does not require any trust on the maximum number of tasks sharing a given resource. This makes it particularly interesting for open systems and mixed-criticality systems, where the final workload composition and resource needs are either not known or not trusted.
Exploiting this property, as well as a close correspondence between s-oblivious analysis and certain processor reservation techniques, the OMIP has been used to derive a locking protocol for Virtually Exclusive Resources (VXR) [B:12] and a synchronous Mixed-Criticality IPC (MC-IPC) protocol [B:14]. The VXR and MC-IPC protocols exhibit three key features that aid system integration in a mixed-criticality context:
the number of tasks sharing a given resource need not be known for analysis purposes and no trust is implied,
different maximum critical section lengths may be assumed in the analysis of high- and low-criticality tasks, and
even non-real-time, best-effort background tasks may access shared resources in a mutually exclusive way without endangering the temporal correctness of high-criticality tasks [B:14].
Concurrently to the OMIP [B:13], BW:13a [BW:13a] proposed the Multiprocessor Resource Sharing Protocol (MrsP) for P-FP scheduling. MrsP combines allocation inheritance with FIFO-ordered spin locks and local per-processor (i.e., uniprocessor) priority ceilings. Specifically, each global resource is protected with a FIFO-ordered spin lock as in the MSRP [GLD:01] (recall Section 4.1), but jobs remain fully preemptable while spinning or holding a resource’s spin lock, which ensures independence preservation.
To ensure progress locally, each resource is further managed, independently and concurrently on each processor, with a local priority ceiling protocol (either the PCP [SRL:90] or SRP [B:91]). From the point of view of the local ceiling protocol, the entire request for a global resource, including the spin-lock acquisition and any spinning, is considered to constitute a single “critical section,” which is similar to the use of contention tokens in the partitioned OMLP [BA:10]. Naturally, when determining a resource’s local, per-processor priority ceiling, only local tasks that access the resource are considered.
MrsP employs allocation inheritance to ensure progress across processors. Instead of literally spinning, waiting jobs may thus be replaced transparently with the lock holder, or otherwise contribute towards completing the operation of the lock-holding job (as in the SPEPP protocol [TS:97]), which means that an implementation of the MrsP can follow the simpler polling approach [HP:01].
BW:13a [BW:13a] motivate the design of the MrsP with the observation that blocking bounds for the MrsP can be stated in a way that is syntactically virtually identical with the classic uniprocessor response-time analysis equation for the PCP and SRP. For this reason, BW:13a consider the MrsP to be particularly “schedulability compatible,” and note that the MrsP is the first protocol to achieve this notion of compatibility.
While this is true in a narrow, syntactical sense, it should also be noted that every other locking protocol discussed in this survey is also “schedulability compatible” in the sense that the maximum blocking delay can be bounded a priori and incorporated into a response-time analysis. Furthermore, the “schedulability compatible” blocking analysis of the MrsP presented by BW:13a [BW:13a] is structurally similar to GLD:01’s original analysis of the MSRP [GLD:01] and relies on execution-time inflation (which is inherently pessimistic [WB:13a], recall Section 4.1). More modern blocking analysis approaches avoid execution-time inflation altogether [WB:13a] and have a more detailed model of contention (e.g., holistic blocking analyses [B:11, PBKR:18] or LP-based analyses [B:13a, WB:13a]). A less-pessimistic analysis of the MrsP using state-of-the-art methods would similarly not resemble the classic uniprocessor response-time equation in a one-to-one fashion; “schedulability compatibility” is thus less a property of the protocol, and more one of the particular analysis (which admittedly is possible in this particular form only for the MrsP).
Recently, ZGBW:17 introduced a new blocking analysis of the MrsP [ZGBW:17] that avoids execution-time inflation using an analysis setup adopted from WB:13a’s LP-based analysis framework [WB:13a]. However, in contrast to WB:13a’s analysis, ZGBW:17’s analysis is not LP-based. Rather, ZGBW:17 follow a notationally more conventional approach based on the explicit enumeration of blocking critical sections, which however has been refined to match the accuracy of WB:13a’s LP-based analysis of the MSRP [GLD:01]. While ZGBW:17’s new analysis [ZGBW:17] is not “schedulability compatible” according to BW:13a’s syntactic criterion [BW:13a], ZGBW:17’s analysis has been shown [ZGBW:17] to be much less pessimistic than BW:13a’s original inflation-based but “schedulability compatible” analysis [BW:13a].
8 Protocols for Relaxed Exclusion Constraints
In the preceding sections, we have focused exclusively on protocols that ensure mutual exclusion. However, while mutual exclusion is without a doubt the most important and most widely used constraint in practice, many systems also exhibit resource-sharing problems that call for relaxed exclusion to allow for some degree of concurrency in resource use. The two relaxed exclusion constraints that have received most attention in prior work are reader-writer (RW) exclusion and -exclusion (KX).
RW synchronization is a classic synchronization problem [CHP:71] wherein a shared resource may be used either exclusively by a single writer (which may update the resource’s state) or in a shared manner by any number of readers (that do not affect the resource’s state). RW synchronization is appropriate for shared resources that are rarely updated and frequently queried. For instance, an in-memory data store holding sensor values, route information, mission objectives, etc. that is used by many subsystems and updated by few is a prime candidate for RW synchronization. Similarly, at a lower level, the list of topic subscribers in a publish/subscribe middleware is another example of rarely changing, frequently queried data that must be properly synchronized.
KX synchronization is a generalization of mutual exclusion to replicated shared resources, where there are multiple identical, inter-changeable copies (or replicas) of a shared resource. Replicated resources can be managed with counting semaphores, but require special handling in multiprocessor real-time systems to ensure analytically sound pi-blocking bounds. Examples where a need for KX synchronization arises in real-time systems include multi-GPU systems (where any task may use any GPU, but each GPU must be used by at most one task at a time) [EA:12], systems with multiple DMA engines (where again any task may program any DMA engine, but each DMA engine can carry out only one transfer at a time), and also virtual resources such as cache partitions [WHKA:13].
Since both RW and KX synchronization generalize mutual exclusion, any of the locking protocols discussed in the previous sections may be used to solve RW or KX synchronization problems. This, however, would be needlessly inefficient. The goal of locking protocols designed specifically for RW and KX synchronization is thus both (i) to increase parallelism (i.e., avoid unnecessary blocking) and (ii) to reflect this increase in parallelism as improved worst-case blocking bounds. Goal (ii) sets real-time RW and KX synchronization apart from classic (non-real-time) RW and KX solutions, since in a best-effort context it is sufficient to achieve a decrease in blocking on average.
We acknowledge that there is a large body of prior work on relaxed exclusion protocols for non-real-time and uniprocessor systems, a discussion of which is beyond the scope of this survey, and in the following focus exclusively on work targeting multiprocessor real-time systems.
8.1 Phase-Fair Reader-Writer Locks
The first multiprocessor real-time RW protocol achieving both goals (i) and (ii) was proposed by BA:09 [BA:09]. Prior work on RW synchronization for uniprocessors or general-purpose multiprocessor systems had yielded three general classes of RW locks:
reader-preference locks, where pending writers gain access to a shared resource only if there are no pending read requests;
conversely writer-preference locks; and
task-fair locks (or FIFO RW locks), where tasks gain access to the shared resource in strict FIFO order, but consecutive readers may enter their critical sections concurrently.
From a worst-case perspective, reader-preference locks are undesirable because reads are expected to be frequent, which gives rise to prolonged writer starvation, which in turn manifests as extremely pessimistic blocking bounds [BA:10a]. Writer-preference locks are better suited to real-time systems, but come with the downside that, if there are potentially multiple concurrent writers, the worst-case blocking bound for each reader will pessimistically account for rare corner-case scenarios in which a reader is blocked by multiple consecutive writers. Finally, task-fair locks degenerate to regular mutex locks in the pathological case when readers and writers are interleaved in the queue; consequently, task-fair locks improve average-case parallelism, but their worst-case bounds do not reflect the desired gain in parallelism.
BA:09 introduced phase-fair locks [BA:09, BA:10a], a new category of RW locks better suited to worst-case analysis. In a phase-fair lock, reader and writer phases alternate, where each reader phase consists of any number of concurrent readers, and a writer phase consists of a single writer. Writers gain access to the shared resource in FIFO order w.r.t. other writers. Importantly, readers may join an ongoing reader phase only if there is no waiting writer; otherwise newly arriving readers must wait until the next reader phase, which starts after the next writer phase.
These rules ensure that writers do not starve (as in a writer-preference or task-fair lock), but also ensure blocking for readers as any reader must await the completion of at most one reader phase and one writer phase before gaining access to the shared resource [BA:09, BA:10a]. As a result, phase-fair locks yield much improved blocking bounds for both readers and writers if reads are significantly more frequent than updates [BA:09, BA:10a].
Several phase-fair RW spin-lock algorithms have been presented, including compact (i.e., memory-friendly) spin locks [BA:10a, B:11], ticket locks [BA:09, BA:10a, B:11], and cache-friendly scalable queue locks [BA:10a, B:11]. Concerning RW semaphores, the clustered OMLP [BA:10, B:11, BA:13] based on priority donation includes a phase-fair RW variant, which also achieves asymptotically optimal maximum s-oblivious pi-blocking [BA:10, B:11, BA:13].
8.2 Multiprocessor Real-Time -Exclusion Protocols
As already mentioned, in best-effort systems, KX synchronization can be readily achieved with counting semaphores. Furthermore, in the case of non-preemptive spin locks, classic ticket locks can be trivially generalized to KX locks. We hence focus in the following on semaphore-based protocols for multiprocessor real-time systems.
Given the strong progress guarantees offered by priority donation (discussed in Section 5.1.4), it is not difficult to generalize the clustered OMLP to KX synchronization [BA:11, B:11, BA:13], which yields a protocol that is often abbreviated as CK-OMLP in the literature. Since it derives from the clustered OMLP, the CK-OMLP applies to clustered scheduling, and hence also supports global and partitioned scheduling. Furthermore, under s-oblivious analysis, it ensures asymptotically optimal maximum pi-blocking [BA:11, B:11, BA:13]. As such, it covers a broad range of configurations. However, as it relies on priority donation to ensure progress, it is not independence-preserving (recall Section 7), which can be a significant limitation especially when dealing with resources such as GPUs, where critical sections are often naturally quite long. Subsequent protocols were specifically designed to overcome this limitation of the CK-OMLP.
EA:11 considered globally scheduled multiprocessors and proposed the Optimal -Exclusion Global Locking Protocol (O-KGLP) [EA:11, EA:13]. In contrast to the CK-OMLP, their protocol is based on priority inheritance, which is possible due to the restriction to global scheduling, and which enables the O-KGLP to be independence-preserving.
In the context of KX synchronization, applying priority inheritance is not as straightforward as in the mutual exclusion case because priorities must not be “duplicated”. That is, while there may be multiple resource holders (if ), only at most one of them may inherit a blocked job’s priority at any time, as otherwise analytical complications similar to those caused by priority boosting arise (including the loss of independence preservation). The challenge is thus to determine, dynamically at runtime and with low overheads, which resource-holding job should inherit which blocked job’s priority.
To this end, EA:11 [EA:11, EA:13] proposed a multi-ended hybrid queue consisting of a shared priority queue that forms the tail (as in the global OMLP [BA:10]) and a set of per-replica FIFO queues (each of length ) that serve to serialize access to specific replicas. A job holding a replica inherits the priorities of the jobs in the FIFO queue corresponding to the replica of resource , and additionally the priority of one of the highest-priority jobs in the priority tail queue. Importantly, if inherits the priority of a job in the priority tail queue, then is called the claimed job of and moved to the FIFO queue leading to when releases . This mechanism ensures that priorities are not “duplicated” while also ensuring progress. In fact, EA:11 established that the O-KGLP is asymptotically optimal w.r.t. s-oblivious maximum pi-blocking [EA:11, EA:13].
In work on predictable interrupt management in multi-GPU systems [EA:12a], EA:12a further proposed a KX variant of the FMLP (for long resources) [BLBA:07]. This variant, called -FMLP, simply consists of one instantiation of the FMLP for each resource replica (i.e., each resource replica is associated with a replica-private FIFO queue that does not interact with other queues). When jobs request access to a replica of a -replicated resource, they simply enqueue in the FIFO queue of the replica that ensures the minimal worst-case wait time (based on the currently enqueued requests). While the -FMLP is not asymptotically optimal under s-oblivious analysis (unlike the O-KGLP and the CK-OMLP), it offers the advantage of being relatively simple to realize [EA:12a] while also ensuring independence preservation under global scheduling (unlike the CK-OMLP).
WEA:12 [WEA:12] realized that blocking under the O-KGLP [EA:11, EA:13] could be further improved with a more nuanced progress mechanism, which they called Replica-Request Priority Donation (RRPD) [WEA:12], and proposed the Replica-Request Donation Global Locking Protocol (R2DGLP) based on RRPD [WEA:12]. As the name suggests, RRPD transfers the ideas underlying priority donation [BA:11, B:11, BA:13] to the case of priority inheritance under global scheduling. Importantly, whereas priority donation applies to all jobs (regardless of whether they request any shared resource), RRPD applies only to jobs that synchronize (i.e., that actually request resource replicas). This ensures that RRPD is independence-preserving (in contrast to priority donation); however, because RRPD incorporates priority inheritance, it is effective only under global scheduling.
Like the O-KGLP, the R2DGLP is asymptotically optimal w.r.t. to maximum s-oblivious pi-blocking. Furthermore, when fine-grained (i.e., non-asymptotic) pi-blocking bounds are considered, the R2DGLP ensures higher schedulability due to lower s-oblivious pi-blocking bounds (i.e., the R2DGLP achieves better constant factors than the O-KGLP) [WEA:12].
Another CK-OMLP variant is the PK-OMLP due to YLLR:13 [YLLR:13]. Priority donation as used by the CK-OMLP ensures that there is at most one resource-holding job per processor at any time. For resources such as GPUs, where each critical section is likely to include significant self-suspension times, this is overly restrictive. The PK-OMLP, which is intended for partitioned scheduling, hence improves upon the CK-OMLP by allowing multiple jobs on the same processor to hold replicas at the same time [YLLR:13]. Furthermore, YLLR:13 presented an s-aware blocking analysis of the PK-OMLP under P-FP scheduling, which enables a more accurate treatment of self-suspensions within critical sections (this analysis was later amended by YCH:17 [YCH:17]). As a result, the PK-OMLP usually outperforms the CK-OMLP when applied in the context of multi-GPU systems [YLLR:13]. More recently, YLLC:16 presented another KX locking protocol specifically for P-FP scheduling and s-aware analysis that forgoes asymptotic optimality in favor of priority-ordered wait queues and non-preemptive critical sections [YLLC:16].
Finally, all discussed KX protocols only ensure that no more than tasks enter critical sections (pertaining to a given resource) at the same time. This, however, is often not enough: to be practical, a KX protocol must also be paired with a replica assignment protocol to match lock holders to replicas. That is, strictly speaking a KX algorithm blocks a task until it may use some replica, but it usually is also necessary to quickly resolve exactly which replica a task is supposed to use. To this end, NYYE:16 [NYYE:16] introduced several algorithms for the -exclusion replica assignment problem, with the proposed algorithms representing different tradeoffs w.r.t. optimality considerations and overheads in practice [NYYE:16].
9 Nested Critical Sections
Allowing fined-grained, incremental nesting of critical sections—that is, allowing tasks already holding one or more locks to issue further lock requests—adds another dimension of difficulty to the multiprocessor real-time locking problem.
First of all, if tasks may request locks in any order, then allowing tasks to nest critical sections can easily result in deadlock. However, even if programmers take care to manually avoid deadlocks by carefully ordering all requests, the blocking analysis problem becomes much more challenging. In fact, in the presence of nested critical sections, the blocking analysis problem is NP-hard even in extremely simplified settings [WB:14], while it can be solved in polynomial time on both uniprocessors (even in the presence of nesting) and multiprocessors in the absence of nesting (at least in simplified settings) [WB:14, W:18].
As a result, today’s nesting-aware blocking analyses either are computationally highly expensive or yield only coarse, structurally pessimistic bounds. Furthermore, authors frequently exclude nested critical sections from consideration altogether (or allow only coarse-grained nesting via group locks, see Section 9.1 below). In the words of R:90 in his original analysis of the MPCP [R:90]: “[s]ince nested global critical sections can potentially lead to large increases in blocking durations, […] global critical sections cannot nest other critical sections or be nested inside other critical sections.” In the subsequent decades, many authors have adopted this expedient assumption.
The aspect unique to nesting that makes it so difficult to derive accurate blocking bounds is transitive blocking, where jobs are indirectly delayed due to contention for resources that they (superficially) do not even depend on. For example, if a job requires only resource , but another job holds while trying to acquire a second resource in a nested fashion, then is exposed to delays due to contention for even though it does not require itself.
While this is a trivial example, such transitive blocking can arise via long transitive blocking chains involving arbitrarily many resources and jobs on potentially all processors. As a result, characterizing the effects of such chains in a safe way and without accruing excessive pessimism is a very challenging analysis problem. BBW:16 [BBW:16] provide more detailed examples of some of the involved challenges.
Nonetheless, despite all difficulties, fine-grained lock nesting arises naturally in many systems [BBW:16] and is usually desirable (or even unavoidable) from an average-case perspective. That is, even though fine-grained locking may not be advantageous from a worst-case blocking perspective, the alternative—coarse-grained locking, where lock scopes are chosen to protect multiple resources such that tasks must never acquire more than one lock at any time—is usually much worse in terms of average-case contention, attainable parallelism, scalability, and ultimately throughput. Robust and flexible support for fine-grained nesting is thus indispensable. While the current state of the art, as discussed in the following, may not yet fully meet all requirements in practice, support for nesting in analytically sound multiprocessor real-time locking protocols is an active area of research and we expect capabilities to continue to improve in the coming years.
9.1 Coarse-Grained Nesting with Group Locks
One easy way of allowing at least some degree of “nested” resource usage, without incurring the full complexity of fine-grained locking, is to (automatically) aggregate fine-grained resource requests into coarser resource groups protected by group locks. That is, instead of associating a lock with each resource (which is the usual approach), the set of shared resources is partitioned into disjoint resource groups and each such resource group is associated with a group lock.
Under this approach, prior to using a shared resource, a task must first acquire the corresponding group lock. Conversely, holding a resource group’s lock entitles a task to use any resource in the group. To eliminate lock nesting, resource groups are defined such that, if any task ever requires access to two resources and simultaneously, then and are part of the same resource group. More precisely, resource groups are defined by the transitive closure of the “may be held together” relation [BLBA:07]. As a result, no task ever holds more than one group lock.
The use of group locks was first proposed by R:90 [R:90] in the context of the MPCP, and re-popularized in recent years by the FMLP [BLBA:07]. Both protocols rely exclusively on group locks, in the sense that to date no analysis with support for fine-grained nesting has been presented for either protocol.
From an analysis point of view, group locks are extremely convenient—the synchronization problem (at runtime) and the blocking analysis problem (at design time) both fully reduce to the non-nested cases. As a result, any of the protocols and analyses surveyed in the preceding sections can be directly applied to the analysis of group locks.
However, there are also obvious downsides in practice. For one, resource groups must be explicitly determined at design time and group membership must be known at runtime (or compiled into the system), so that tasks may acquire the appropriate group lock when requesting a resource, which from a software development point of view is at least inconvenient, and may actually pose significant engineering challenges in complex systems. Furthermore, since its very purpose is to eliminate incremental lock acquisitions, group locking comes with all the scalability and performance problems associated with coarse-grained synchronization.
Last but not least, for certain resources and systems, it may not be possible to define appropriate resource groups. As a pathological example, assume a UNIX-like kernel and consider the file system’s inode objects, which are typically arranged in a tree that reflects the file system’s hierarchy. Importantly, certain file system procedures operate on multiple inodes at once (e.g., the inodes for a file and its parent directory), and since files may be moved dynamically at runtime (i.e., inodes may show up at any point in the tree), virtually any two inodes could theoretically be held simultaneously at some point. As a result, the set of all inodes collapses into a single resource group, with obvious performance implications.
Thus, while group locks can help to let programmers express resource use in a fine-grained manner, clearly more flexible solutions are needed. Specifically, for performance and scalability reasons, non-conflicting requests for different resources should generally be allowed to proceed in parallel, even if some task may simultaneously hold both resources at some other time. We next discuss multiprocessor real-time locking protocols that realize this to varying degrees.
9.2 Early Protocol Support for Nested Critical Sections
Research on support for fine-grained nesting in real-time multiprocessor locking protocols can be grouped into roughly two eras: a period of initial results that lasted from the late 1980s until the mid 1990s, and a recently renewed focus on the topic, which started to emerge in 2010. We discuss protocols from the initial period next and then discuss the more recent developments in Section 9.3 and asymptotically optimal fine-grained nesting in Section 9.4 below.
The first multiprocessor real-time locking protocol, the DPCP [RSL:88], was in fact also the first protocol to include support for fine-grained nesting, albeit with a significant restriction. Recall from Section 6 that the DPCP executes critical sections centrally on designated synchronization processors. Because the DPCP relies on the uniprocessor PCP on each synchronization processor, and since the PCP supports nested critical sections (and prevents deadlock) [SRL:90], it is in fact trivial for the DPCP to support nested critical sections as long as nesting occurs only among resources assigned to the same processor. Resources assigned to different synchronization processors, however, are not allowed to be nested under the DPCP [RSL:88, R:91].
Consequently, the DPCP’s support for fine-grained nesting is actually not so different from group locks—all nesting must be taken into account up front, and resources assigned to the same synchronization processor form essentially a resource group. In fact, just as there is no parallelism among non-conflicting requests for resources protected by the same group lock, under the DPCP, there is no parallelism in case of (otherwise) non-conflicting requests for resources assigned to the same synchronization processor. (Conversely, group locks can also be thought of as a kind of “virtual synchronization processors.”) The approach followed by the DPCP thus is attractively simple, but not substantially more flexible than group locks.
The DPCP’s same-processor restriction was later removed by RM:95 [RM:95], who in 1995 proposed a protocol that fully generalizes the DPCP and supports fine-grained nesting for all resources. As with the DPCP, for each resource, there is a dedicated synchronization processor responsible for sequencing conflicting requests. However, unlike the DPCP, RM:95’s protocol [RM:95] does not require all critical sections pertaining a resource to execute on the synchronization processor; rather, the protocol allows for full flexibility: any critical section may reside on any processor. As a result, nested sections that access multiple resources managed by different synchronization processors become possible.
RM:95’s protocol [RM:95] works as follows. To ensure mutual exclusion among distributed critical sections and to prevent deadlock, RM:95 introduced a pre-claiming mechanism that realizes conservative two-phase locking: when a task seeks to enter a critical section, it first identifies the set of all resources that it might require while executing the critical section, and then for each such resource sends a request message to the corresponding synchronization processor. Each synchronization processor replies with a grant message when the resource is available, and once grant messages have been received for all requested resources, the task enters its critical section. As resources are no longer required, the task sends release messages to the respective synchronization processors; the critical section ends when all resources have been released.
To avoid deadlock, synchronization processors further send preempt messages if a request message from a higher-priority task is received after the resource has been already granted to a lower-priority task (and no matching release message has been received yet). There are two possibilities: either the lower-priority task has already commenced execution of its critical section, in which case the preempt message is safely ignored as it will soon release the resource anyway, or it has not yet commenced execution, in which case it releases the resource immediately and awaits another grant message for the just-released resource. Deadlock is impossible because of the protocol’s all-or-nothing semantics: tasks request all resources up front, commence execution only when they have acquired all resources, and while executing a critical section only release resources (i.e., conservative two-phase locking).
Compared to the DPCP [RSL:88], RM:95’s protocol [RM:95] is a significant improvement in terms of flexibility and versatility. However, while their protocol allows tasks to use multiple shared resources at once, it does not allow tasks to lock multiple resources incrementally. From a programmer’s point of view, it can be cumbersome (or even impossible) to determine all resources that will be required prior to commencing a critical section. Specifically, if the current state of one of the requested resources impacts which other resources are also needed (e.g., if a shared object contains a pointer to another, a priori unknown resource), then a task must initially lock the superset of all resources that it might need, only to then immediately release whichever resources are not actually needed. Such conservative two-phase locking leads to a lot of unnecessary blocking in the worst case, and is known to suffer from performance penalties even in the average case.
The first work-conserving protocol—in the sense that it always allows non-conflicting requests to proceed in parallel—was developed already in 1992 and is due to SZ:92 [Z:92, SZ:92]. As previously discussed in Section 5.2.3, SZ:92’s protocol includes an online admission test, which rejects lock requests that cannot be shown (at runtime, based on current contention conditions) to be satisfied within a specified waiting-time bound [Z:92]. As a result, SZ:92’s protocol prevents deadlock—even if tasks request resources incrementally and in arbitrary order—as any request that would cause deadlock will certainly be denied by the admission test.
It should be noted that, from the programmer’s point of view, this notion of deadlock avoidance is significantly different from deadlock avoidance as realized by the classic PCP [SRL:90] and SRP [B:91] uniprocessor protocols: whereas the PCP and SRP defer potentially deadlock-causing resource acquisitions, which is logically transparent to the task, SZ:92’s protocol outright rejects such lock requests [Z:92], so that lock-acquisition failures must be handled in the task’s logic.
Concerning blocking bounds, Z:92’s analysis [Z:92] requires that the bound on the maximum length of outer critical sections must include all blocking incurred due to inner (i.e., nested) critical sections. This assumption, common also in later analyses [BW:13a, GZBW:17], leads unfortunately to substantial structural pessimism.
For example, consider a scenario in which a job repeatedly accesses two resources and in a nested fashion (i.e., locks first and then locks , and does so multiple times across its execution). Now suppose there’s another job on a remote core that accesses just once. Since can incur blocking due to ’s infrequent critical section when it tries to acquire while already holding , it follows that ’s maximum critical section length w.r.t. must include ’s maximum critical section length w.r.t. . Thus, if accesses and in a nested fashion times, then ’s critical section will be over-represented in ’s response-time bound by a factor of —a safe, but pessimistic bound that grossly overstates the actual blocking penalty due to transitive blocking.
Another early protocol that supports fine-grained nesting is CTB:94’s MDPCP [CTB:94] for periodic tasks. Due to the underlying careful definition of inter-processor priority ceilings (which, as discussed in Section 5.2.3, rests on the restriction to periodic tasks), the MDPCP is able to prevent transitive blocking and deadlocks [CT:94] analogously to the PCP [SRL:90]. Furthermore, for the priority ceilings to work correctly, the MDPCP requires that any two resources that might be held together must be shared by exactly the same sets of processors [CTB:94].
Finally, in 1995, TS:95 [TS:95] made an important observation concerning the worst-case s-blocking in the presence of nested non-preemptive FIFO spin locks. Let denote the maximum nesting depth (i.e., the maximum number of FIFO spin locks that any task holds at a time), where . TS:95 [TS:95] showed that, under maximum contention, tasks may incur s-blocking for the combined duration of critical sections [TS:95]. That is, the potential for accumulated transitive blocking makes the bound on worst-case s-blocking in the presence of nesting exponentially worse (w.r.t. the maximum nesting depth ) relative to the bound on maximum s-blocking in the non-nested case, which is simply (recall Section 4.1).
Intuitively, this effect arises as follows. Let us say that an outermost (i.e., non-nested) critical section is of level one, and that any critical section immediately nested in a level-one critical section is of level two, and so on. Consider a tower of critical sections, that is, a level-1 critical section, containing exactly one level-2 critical section, containing exactly one level-3 critical section, and so on up to level . Observe that:
a job blocked on the level-1 lock can be directly delayed by earlier-enqueued jobs;
each of which can be directly delayed by earlier-enqueued jobs when trying to enter the level-2 nested critical section;
each of which can be directly delayed by earlier-enqueued jobs when trying to enter the level-3 nested critical section;
and so on up to level .
Crucially, all the direct s-blocking incurred by various jobs in steps (2)–(4) also transitively delays , which thus accumulates all delays and therefore incurs s-blocking exponential in . The actual construction that TS:95 [TS:95] used to establish the lower bound is more nuanced than what is sketched here because non-preemptive execution can actually reduce blocking (a job that occupies a processor while spinning on a level- lock prevents the processor from serving jobs that generate contention for level- locks, where ).
Even in the most basic case (i.e., simple immediate nesting), where no task ever holds more than two locks at the same time (), nested FIFO spin locks result in an s-blocking bound that is quadratic in the number of cores, which is clearly undesirable from a scalability point of view. To address this bottleneck, TS:95 [TS:95] proposed an ingenious solution that brings worst-case s-blocking back under control (especially in the special case of ) by using priority-ordered spin locks, but with a twist. Instead of using scheduling priorities, TS:95 let jobs use timestamps as priorities—more precisely, a job’s locking priority is given by the time at which it issued its current outermost lock request (i.e., nested requests do not affect a job’s current locking priority), with the interpretation that an earlier timestamp implies higher priority (i.e., FIFO w.r.t. outermost request timestamps). The necessary timestamps need not actually reflect “time” and can be easily obtained from an atomic counter, such as those used in ticket locks.
For the outermost lock (i.e., the level-one lock), the timestamp order is actually equivalent to FIFO. The key difference is the effect on the queue order in nested locks: when using FIFO-ordered spin locks, a job’s “locking priority” is effectively given by the time of its nested lock request; with TS:95’s scheme [TS:95], the job’s locking priority is instead given by the time of its outermost, non-nested lock request and remains invariant throughout all nested critical sections (until the outermost lock is released).
Because at most jobs are running at any time, TS:95’s definition of locking priorities ensures that there are never more than jobs with higher locking priorities (i.e., earlier timestamps). In the special case of , this highly practical technique suffices to restore an upper bound on maximum s-blocking. However, in the general case (), additional heavy-weight techniques are required to ensure progress [TS:95], and even then TS:95’s method [TS:95] unfortunately does not achieve an bound in the general case. Interestingly, TS:95 report that they were able to realize a multiprocessor RTOS kernel with a maximum nesting depth of [TS:95].
9.3 Recent Advances in Multiprocessor Real-Time Locking with Unrestricted Nesting
Renewed interest in fine-grained locking emerged again in 2010 with FLC:10’s MBWI protocol [FLC:10, FLC:12], which explicitly supports fine-grained locking and nested critical sections, albeit without any particular rules to aid or restrict nesting. In particular, the MBWI (i) does not prevent deadlock and (ii) uses FIFO queues even for nested requests.
As a result of (ii), TS:95’s observation regarding the exponential growth of maximum blocking bounds [TS:95] also transfers to nesting under the MBWI protocol. However, the impact of TS:95’s observation [TS:95] is lessened somewhat in practice once fine-grained blocking analyses (rather than coarse, asymptotic bounds) are applied to specific workloads, since TS:95’s lower bound is based on the assumption of extreme, maximal contention, whereas lock contention in real workloads (and hence worst-case blocking) is constrained by task periods and the number of critical sections per job.
Concerning (i), programmers are expected to arrange all critical sections such that the “nested in” relation among locks forms a partial order—which prevents cycles in the wait-for graph and thus prevents deadlock. This is a common approach and widely used in practice. For instance, the Linux kernel relies on this well-ordered nesting principle to prevent deadlock, and (as a debugging option) employs a locking discipline checker called lockdep to validate at runtime that all observed lock acquisitions are compliant with some partial order.
Notably, to prevent deadlock, it is sufficient for such a partial order to exist; it need not be known (and for complex systems such as Linux it generally is not, at least not in its entirety). For worst-case blocking analysis purposes, however, all critical sections and all nesting relationships must of course be fully known, and based on this information it is trivial to infer the nesting partial order. For simplicity, we assume that resources are indexed in accordance with the partial order (i.e., a job may lock while already holding only if ).
Assuming that all nesting is well-ordered, FLC:12 [FLC:12] presented a novel blocking-analysis algorithm for nested critical sections that characterizes the effects of transitive blocking much more accurately than the crude, inflation-based bounds used previously (e.g., recall the discussion of SZ:92’s protocol [Z:92, SZ:92] and Z:92’s analysis [Z:92] in Section 9.2 above). Aiming for this level of accuracy was a major step forward, but unfortunately FLC:12’s algorithm exhibits super-exponential runtime complexity [FLC:12]. As already mentioned, unrestricted nesting is inherently difficult to analyze accurately [WB:14].
BW:13a’s MrsP [BW:13a] is another recent spin-based protocol that supports fine-grained, well-ordered nesting in an otherwise unrestricted manner. While the initial version of the protocol [BW:13a] already offered basic support for nesting, the original analysis (which heavily relies on inflation, similar to Z:92’s approach [Z:92]) left some questions pertaining to the correct accounting of transitive blocking unanswered [BBW:16]. A revised and clarified version of the MrsP with better support for nested critical sections was recently presented by GZBW:17 [GZBW:17], including a corrected analysis of worst-case blocking in the presence of nested critical sections [GZBW:17]. GZBW:17’s revised analysis is still based on execution-time inflation and thus subject to the same structural pessimism as Z:92’s approach [Z:92] (as discussed in Section 9.2). In particular, GZBW:17’s revised analysis [GZBW:17] does not yet incorporate ZGBW:17’s recently introduced, less pessimistic, inflation-free analysis setup [ZGBW:17]. Conversely, ZGBW:17’s improved analysis [ZGBW:17] does not yet support fine-grained nesting.
In recent work following an alternative approach, BBW:16 [BBW:16] developed a MILP-based blocking analysis of the classic MSRP [GLD:01] with fine-grained, well-ordered nesting. In GLD:01’s original definition of the MSRP [GLD:01], nesting of global resources is explicitly disallowed. However, as long as all nesting is well-ordered, the protocol is capable of supporting fine-grained nesting—the lack of nesting support in the original MSRP is simply a matter of missing analysis, not fundamental incompatibility, and can be explained by the fact that in 2001 analysis techniques were not yet sufficiently advanced to enable a reasonably accurate blocking analysis of nested global critical sections. Leveraging a modern MILP-based approach inspired by earlier LP- and MILP-based analyses of non-nested protocols [B:13a, WB:13a, BB:16, YWB:15], BBW:16 [BBW:16] provided the first analysis of the MRSP with support for fine-grained nesting. As a result, the MSRP may now be employed under P-FP scheduling without any nesting restrictions (other than the well-ordered nesting principle, which is required to prevent deadlock).
Interestingly, while the MSRP uses non-preemptive FIFO spin locks, which is precisely the type of lock that TS:95 [TS:95] showed to be vulnerable to exponential transitive blocking, BBW:16’s MILP-based analysis [BBW:16] is effective in analyzing transitive blocking because the MILP-based approach inherently avoids accounting for any critical section more than once [BBW:16]. Thus, while in theory FIFO-ordered spin locks cannot prevent exponential transitive blocking in pathological corner cases with extreme levels of contention, this is less of a concern in practice given a sufficiently accurate analysis (i.e., if the analysis does not over-estimate contention) since well-engineered systems are usually designed to minimize resource conflicts.
While MILP solving is computationally quite demanding, BBW:16’s analysis [BBW:16] comes with the advantage of resting on a solid formal foundation that offers a precise, graph-based abstraction for reasoning about possible blocking delays, and which ultimately enables rigorous individual proofs of all MILP constraints. Given the challenges inherent in the analysis of transitive blocking, BBW:16’s formal foundation and MILP-based analysis approach [BBW:16] provide a good starting point for future analyses of fine-grained nesting in multiprocessor real-time locking protocols.
9.4 Asymptotically Optimal Multiprocessor Real-Time Locking
The most significant recent result in real-time locking is due to WA:12 [WA:12], who in 2012 presented a surprising breakthrough by showing that, with a few careful restrictions, it is possible to control the occurrence of transitive blocking and thereby ensure favorable—in fact, asymptotically optimal—worst-case blocking bounds. Specifically, WA:12 introduced the Real-time Nested Locking Protocol (RNLP) [WA:12], which is actually a meta-protocol that can be configured with several progress mechanisms and lock-acquisition rules to yield either spin- or suspension-based protocols that support fine-grained, incremental, and yet highly predictable nested locking.
Depending on the specific configuration, the RNLP ensures either s-oblivious or s-aware asymptotically optimal maximum pi-blocking, in the presence of well-ordered nested critical sections and for any nesting depth . Furthermore, the RNLP is widely applicable: it supports clustered JLFP scheduling, and hence also covers the important special cases of G-EDF, G-FP, P-EDF, and P-FP scheduling. Specifically, if applied on top of priority donation [BA:11] (respectively, RSB [B:14b]), the RNLP yields an (respectively, ) bound on maximum s-oblivious (respectively, s-aware) pi-blocking under clustered JLFP scheduling [WA:12]. The RNLP can also be instantiated on top of priority boosting similarly to the FMLP+ [B:11] under partitioned JLFP scheduling to ensure maximum s-aware pi-blocking [WA:12], and on top of priority inheritance under global JLFP scheduling to ensure maximum s-oblivious pi-blocking [WA:12].
Analogously to the s-oblivious case, the RNLP can also be configured to use non-preemptive execution and spin locks to obtain an bound on maximum s-blocking (again for any nesting depth ) [WA:12]. As this contrasts nicely with TS:95’s lower bound in the case of unrestricted nesting [TS:95], and since the spin-based RNLP is slightly easier to understand than suspension-based configurations of the RNLP optimized for either s-oblivious or s-aware analysis, we will briefly sketch the spin-based RNLP variant in the following.
The RNLP does not automatically prevent deadlock and requires all tasks to issue only well-ordered nested requests (w.r.t. a given partial order). The RNLP’s runtime mechanism consists of two main components: a token lock and a request satisfaction mechanism (RSM). Both are global structures, that is, all requests for any resource interact with the same token lock and RSM.
The token lock is a -exclusion lock that serves two purposes: (i) it limits the number of tasks that can concurrently interact with the RSM, and (ii) it assigns each job a timestamp that indicates when the job acquired its token (similar to the time-stamping of outermost critical sections in TS:95’s earlier protocol [TS:95] based on priority-ordered spin locks). If the RNLP is instantiated as a spin-based protocol or for s-oblivious analysis, then [WA:12]. (Otherwise, in the case of s-aware analysis, [WA:12].)
Before entering an outermost critical section (i.e., when not yet holding any locks), a job must first acquire a token from the token lock. Once it holds a token, it may interact with the RSM. In particular, it may repeatedly request resources from the RSM in an incremental fashion, acquiring and releasing resources as needed, as long as nested requests are well-ordered. Once a job releases its last resource (i.e., when it leaves its outermost critical section), it also relinquishes its token.
In the spin-based configuration of the RNLP, jobs become non-preemptable as soon as they acquire a token, and remain non-preemptable until they release their token. Since non-preemptive execution already ensures that at most tasks can be non-preemptable at the same time, in fact no further KX synchronization protocol is required; WA:12 refer to this solution as a trivial token lock (TTL) [WA:12]. A TTL simply records a timestamp when a job becomes non-preemptable, at which point it may request resources from the RSM.
The specifics of the RSM differ in minor ways based on the exact configuration of the RNLP, but all RSM variants share the following key characteristics of the spin-based RSM. Within the RSM, there is a wait queue for each resource , and when a job requests a resource, it enters the corresponding wait queue. As previously seen in TS:95’s protocol [TS:95], jobs are queued in order of increasing timestamps. In the absence of nesting, this reduces again to FIFO queues, but when issuing nested requests, jobs may benefit from an earlier timestamp and “skip ahead” of jobs that acquired their tokens at a later time.
However, there is a crucial deviation from TS:95’s protocol [TS:95] that makes all the difference: whereas in TS:95’s protocol a job at the head of a queue acquires the resource as soon as possible, the RNLP’s RSM may choose to not satisfy a request for a resource even when it is available [WA:12]. That is, the RNLP is non-work-conserving and may elect to withhold currently uncontested resources in anticipation of a potential later request that must not be delayed (which is not entirely unlike the use of priority ceilings in the classic PCP [SRL:90]). Specifically, a job at the head of a resource ’s queue may not acquire if there exists another token-holding job with an earlier token timestamp that might still request [WA:12].
As a result of this non-work-conserving behavior and the use of timestamp-ordered wait queues, the RNLP ensures that no job is ever blocked by a request of a job with a later token timestamp, even when issuing nested requests. This property suffices to show maximum s-blocking per outermost critical section (because there can be at most jobs with earlier token timestamps). It bears repeating that the RNLP’s bound holds for any nesting depth , whereas TS:95’s work-conserving protocol [TS:95] ensures maximum s-blocking only for , and even then TS:95’s protocol exhibits worse constant factors (i.e., is subject to additional s-blocking).
The RNLP’s non-work-conserving RSM behavior has two major implications: first, while the RNLP controls worst-case blocking optimally (asymptotically speaking), it does so at the price of a potential increase in average-case blocking when jobs are denied access to rarely nested, but frequently accessed resources. Second, all potential nesting must be known at runtime (i.e., the partial nesting order must not only exist, it must also be available to the RNLP). This is required so that the RSM can appropriately reserve resources that may be incrementally locked at a later time (i.e., to deny jobs with later token timestamps access to resources that might still be needed by jobs with earlier token timestamps). In practical terms, the need to explicitly determine, store, and communicate the partial nesting order imposes some additional software engineering effort (e.g., at system-integration time).
Subsequently, in work aimed at making the RNLP even more versatile and efficient for practical use, WA:13 [WA:13] introduced a number of extensions and refinements. Most significantly, they introduced the notion of dynamic group locks (DGLs) [WA:13] to the RNLP. As the name suggests, a DGL allows tasks to lock multiple resources in one operation with all-or-nothing semantics, similarly to a (static) group lock (recall Section 9.1), but without the need to define groups a priori, and without requiring that groups be disjoint. In a sense, DGLs are similar to the pre-claiming mechanism of RM:95 [RM:95], but there is one important difference: whereas RM:95 enforce conservative two-phase locking semantics—once a task holds some resources, it cannot acquire any additional locks—in the RNLP, tasks are free to issue as many DGL requests as needed in an incremental fashion. That is, the RNLP supports truly nested, fine-grained DGLs. Notably, introducing DGLs does not negatively affect the RNLP’s blocking bounds, and the original RNLP [WA:12] can thus be understood as a special case of the DGL-capable RNLP [WA:13] where each DGL request pertains to just a single resource (i.e., a singleton “group” lock).
Additionally, WA:13 [WA:13] introduced the possibility to apply the RNLP as a KX synchronization protocol (also with asymptotically optimal blocking bounds). In particular, KX synchronization is possible in conjunction with DGLs, so that tasks can request multiple replicas of different resources as one atomic operation.
As another practical extension of the RNLP, WA:13 [WA:13] introduced the ability to combine both spin- and suspension-based locks in a way such that requests for spin locks are not blocked by requests for semaphores (called “short-on-long blocking” [WA:13]), since critical sections pertaining to suspension-based locks are likely to be much longer (possibly by one or more orders of magnitude) than critical sections pertaining to spin locks.
In 2014, in a further major broadening of the RNLP’s capabilities [WA:12, WA:13], WA:14 presented the Reader-Writer RNLP (RW-RNLP) [WA:14] for nested RW synchronization. Building on the principles of the RNLP and phase-fair locks [BA:09, BA:10a, B:11], WA:14 derived a RW protocol that achieves asymptotically optimal maximum pi- or s-blocking (like the RNLP) and per-request reader blocking (phase-fairness), while allowing for a great deal of flexibility: tasks may arbitrarily nest read and write critical sections, upgrade read locks to write locks, and lock resources incrementally. While a detailed discussion is beyond the scope of this survey, we note that integrating RW semantics into the RNLP, in particular without giving up phase-fairness, is nontrivial and required substantial advances in techniques and analysis [WA:14].
In 2015, JWA:15 [JWA:15] introduced a contention-sensitive variant of the RNLP [WA:12], denoted C-RNLP. In contrast to the original RNLP, and the vast majority of other protocols considered herein, the C-RNLP exploits knowledge of maximum critical section lengths at runtime to react dynamically to actual contention levels. (SZ:92’s protocol [Z:92, SZ:92] also uses maximum critical section lengths at runtime.) At a high level, the C-RNLP dynamically overrides the RNLP’s regular queue order to lessen the blocking caused by the RNLP’s non-work-conserving behavior, but only if it can be shown that doing so will not violate the RNLP’s guaranteed worst-case blocking bounds. Since heavy resource contention is usually rare in practice, contention sensitivity as realized in the C-RNLP can achieve substantially lower blocking in many systems. As a tradeoff, the C-RNLP unsurprisingly comes with higher lock acquisition overheads, which however can be addressed with a novel implementation approach [NAA:18] (see Section 10.4). Furthermore, it requires accurate information on worst-case critical section lengths to be available at runtime, which can be inconvenient from a software engineering perspective.
Overall, the RNLP and its various extensions represent the state of the art w.r.t. support for fine-grained nesting with acquisition restrictions that prevent excessive transitive blocking. Importantly, WA:12 [WA:12] established with the RNLP that maximum s-blocking, maximum s-oblivious pi-blocking, and s-aware pi-blocking are all possible in the presence of nested critical sections even when faced with an arbitrary nesting depth , which was far from obvious at the time given TS:95’s prior negative results [TS:95].
10 Implementation Aspects
While our focus in this survey is algorithmic properties and analytical guarantees, there also exists a rich literature pertaining to the implementation of multiprocessor real-time locking protocols and their integration with programming languages. In the following, we provide a brief overview of key topics.
10.1 Spin-Lock Algorithms
The spin-lock protocols discussed in Section 4 assume the availability of spin locks with certain “real-time-friendly” properties (e.g., FIFO-ordered or priority-ordered locks). Spin locks algorithms widely used in practice include MS:91’s scalable MCS queue locks [MS:91], simple ticket locks [MS:91, L:74], and basic TAS locks, where the former two are instances of FIFO-ordered spin locks, and the latter is an unordered lock (i.e., it is not “real-time-friendly,” but easy to implement and still analyzable [WB:13a]). These lock types are well-known, not specific to real-time systems, and covered by excellent prior surveys on shared-memory synchronization [R:86, AKH:03, S:13]. We focus here on spin-lock algorithms designed specifically for use in real-time systems.
The most prominent example in this category are priority-ordered spin locks, which are only rarely (if ever) used in general-purpose systems. The first such locks are due to ML:91 [M:91, ML:91, M:94], who offered a clear specification for “priority-ordered spin locks” and proposed two algorithms that extend two prior FIFO-ordered spin locks, by respectively B:78 [B:78] and MS:91 [MS:91], to respect request priorities.
Several authors continued this line of research and proposed refined priority-ordered spin locks in subsequent years. In particular, C:93 [C:93] proposed several scalable FIFO- and priority-ordered queue lock algorithms. C:93 also presented several extensions of the basic algorithms that add support for timeouts, preemptable spinning, memory-efficient lock nesting (i.e., without requiring a separate queue element for each lock) [C:93]. TS:94 [TS:94] similarly proposed a scheme for spinning jobs to be preempted briefly by interrupt service routines, with the goal of ensuring low interrupt latencies in the kernel of a multiprocessor RTOS.
WTS:96 [WTS:96] considered nested priority-ordered spin locks and observed that they can give rise to starvation effects that ultimately lead to unbounded priority inversion. Specifically, they identified the following scenario: when a high-priority job is trying to acquire a lock that is held by a lower-priority job , and is in turn trying to acquire a (nested) lock that is continuously used by (at least two) middle-priority jobs (in alternating fashion) located on other processors, then (and implicitly ) may remain indefinitely blocked on (respectively, on ). To overcome this issue, WTS:96 proposed two spin-lock algorithms that incorporate priority inheritance. The first algorithm—based on ML:91’s algorithm [M:91, ML:91, M:94]—is simpler; however, it is not scalable (i.e., it is not a local-spin algorithm). WTS:96’s second algorithm restores the local-spin property [WTS:96].
To improve overhead predictability, JH:97 [JH:97] proposed a priority-ordered spin lock that, in contrast to earlier algorithms, ensures that critical sections can be exited in constant time. To this end, JH:97’s algorithm maintains a pointer to the highest-priority pending request, which eliminates the need to search the list of pending requests when a lock is released.
Finally, and much more recently, HJ:16 [HJ:16] proposed a strengthened definition of “priority-ordered spin locks” that forbids races among simultaneously issued requests of different priorities and presented an algorithm that satisfies this stricter specification.
Concerning FIFO-ordered spin locks that support preemptable spinning, as assumed in Section 4.1.2, several authors have proposed suitable algorithms [C:93, TS:94, KWS:97, AJJ:98]. Furthermore, in their proposal of the SPEPP approach (which also relies on preemptable spinning, as discussed in Section 4.1.4), TS:97 [TS:97] provided two implementation blueprints, one based on MCS locks [MS:91] and one based on TAS locks. Notably, even the implementation based on TAS locks ensures FIFO-ordered execution of critical sections because all posted operations (i.e., closures) are processed in the order in which they were enqueued (though not necessarily by the processor that enqueued them) [TS:97].
With regard to RW locks, MS:91a provided the canonical implementation of task-fair (i.e., FIFO) RW locks [MS:91a] as an extension of their MCS queue locks [MS:91]. Several practical phase-fair RW lock implementations were proposed and evaluated by BA:09 [BA:09, BA:10a, B:11]. BJ:11 subsequently proposed a stricter specification of “phase fairness” and proposed a matching lock algorithm [BJ:11].
Finally, while not aimed specifically at real-time systems, it is worth pointing out a recent work of DH:16 [DH:16] in which they aim to circumvent the lock-holder preemption problem without resorting to non-preemptive sections or heavy-weight progress mechanisms by leveraging emerging hardware support for transactional memory (HTM). With a sufficiently powerful HTM implementation, it is possible to encapsulate entire critical sections pertaining to shared data structures (but not I/O devices) in a HTM transaction, which allows preempted critical sections to be simply aborted and any changes to the shared resource to be rolled backed automatically. As a result, lock holders can be preempted without the risk of delaying remote tasks. However, HTM support is not yet widespread in the processor platforms typically used in real-time systems, and it still remains to be seen whether it will become a de facto standard in future multicore processors for embedded systems.
In work on predictable multicore processors, SS:18 [SS:18] presented a pure hardware implementation of a predictable and analysis-friendly spin lock with round-robin semantics and non-preemptable spinning. Interestingly, while a round-robin lock does not ensure FIFO ordering of requests, round-robin access is much simpler and can be realized much more efficiently in hardware, and nonetheless provides identical worst-case guarantees as a FIFO spin lock (at most blocking critical sections per request). In SS:18’s implementation, uncontested lock acquisitions take only two clock cycles and lock release operations proceed in a single clock cycle [SS:18]. In another recent proposal of a processor-integrated hardware synchronization facility, MHKS:19 [MHKS:19] presented an on-chip scratchpad memory with support for time-predictable atomic operations, which can be used to implement spin locks in an efficient and predictable manner amenable to WCET analysis.
B:11 [B:11] discusses how to factor non-preemptive spin-lock overheads into blocking and schedulability analyses. BAGB:17 [BAGB:17] provide an overhead-aware blocking and schedulability analysis for ABBN:14’s FSLM [ABBN:14]. BAGB:17’s analysis [BAGB:17] is based on execution-time inflation, similar to the original analysis of the MSRP [GLD:01], and hence is subject to structural pessimism [WB:13a]. WB:13’s LP-based analysis [WB:13a], which is designed to avoid such structural pessimism, can be used in conjunction with the overhead-accounting techniques proposed by B:11 [B:11].
10.2 Avoiding System Calls
In an operating system with a clear kernel-mode/user-mode separation and protection boundary, the traditional way of implementing critical sections in user mode is to provide lock and unlock system calls. However, system calls typically impose non-negligible overheads (compared to regular or inlined function calls), and hence represent a significant bottleneck.
System-call overhead poses a problem in particular for spin-lock protocols, as one of the primary benefits of spin locks is their lower overheads compared to semaphores. If each critical section requires a system call to indicate the beginning of non-preemptive execution, and another system call to indicate the re-enabling of preemptions, then the overhead advantage is substantially diminished.
To avoid such overheads, LITMUSRT introduced a mechanism [B:11] (in version 2010.1) that allows tasks to communicate non-preemptive sections to the kernel in a way that requires a system call only in the infrequent case of a deferred preemption. The approach works by letting each task share a page of memory, called the task’s control page, with the kernel, similar to the notion of a userspace thread control block (UTCB) found in L4 microkernels. More specifically, to enter a non-preemptive section, a task simply sets a flag in its control page, which it clears upon exiting the non-preemptive section. To indicate a deferred preemption, the kernel sets another flag in the control page. At the end of each non-preemptive section, a task checks the deferred preemption flag, and if set, triggers the scheduler (e.g., via the sched_yield() system call).
To prevent runaway tasks or attackers from bringing down the system, the kernel can simply stop honoring a task’s non-preemptive section flag if the task fails to call sched_yield() within a pre-determined time limit [B:11], which makes the mechanism safe to use even if user-space tasks are not trusted. The control-page mechanism thus allows spin locks to be implemented efficiently in userspace, requiring no kernel intervention even when inter-core lock contention occurs.
A similar problem exists with semaphores in user mode. However, since blocking is realized by suspending in semaphores, in the worst case (i.e., if contention is encountered), the kernel is always involved. Nonetheless, the avoidance of system calls in user-mode semaphores is still an important average-case optimization. Specifically, since lock contention is rare in well-designed systems, avoiding system calls in the case of uncontested lock and release operations (i.e., in the common case) is key to maximizing throughput in applications with a high frequency of critical sections.
Semaphore implementations that do not involve the kernel in the absence of contention are commonly called futexes (fast userspace mutexes), a name popularized by the implementation in Linux. From a real-time perspective, the main challenge in realizing futexes is maintaining a protocol’s predictability guarantees (i.e., to avoid invalidating known worst-case blocking bounds). With regard to this problem, SVBD:14 [SVBD:14] distinguish between reactive and anticipatory progress mechanisms [SVBD:14], where the former take effect only when contention is encountered, whereas the latter conceptually require actions even before a conflicting lock request is issued. For instance, priority inheritance is a reactive progress mechanism, whereas priority boosting is an anticipatory progress mechanism since a job’s priority is raised unconditionally whenever it acquires a shared resource.
It is easy to combine futexes with reactive mechanisms since the kernel is involved anyway in the case of contention (to suspend the blocking task). In contrast, anticipatory protocols are more difficult to support since the protocol’s unconditional actions must somehow be realized without invoking the kernel in the uncontended case. Possibly for this reason, Linux supports priority-inheritance futexes, but currently does not offer a futex implementation of ceiling protocols.
Despite such complications, it is fortunately still possible to realize many anticipatory protocols as futexes by deferring task state updates until the kernel is invoked anyway for some other reason (e.g., a preemption due to the release of a higher-priority job), as has been shown by a number of authors [ZBK:14, Z:13, AAB:15, SVBD:14].
In a uniprocessor context, ZBK:14 [Z:13, ZBK:14, ZK:18, ZK:19] considered how to implement predictable real-time futexes in an efficient and certifiable way in the context of a high-assurance, resource-partitioned separation kernel. Their approach is also relevant in a multiprocessor context because it allows for an efficient, futex-compatible implementation of priority boosting under partitioned scheduling by means of deferred priority changes [ZBK:14]. AAB:15 [AAB:15] later explored similar protocols for uniprocessor FP and EDF scheduling and verified their correctness with a model checker.
Targeting multiprocessor systems, SVBD:14 [SVBD:14] systematically explored the aforementioned classes of reactive and anticipatory real-time locking protocols, and concretely proposed real-time futex implementations of the PCP [SRL:90], the MPCP [R:90], and the partitioned FMLP+ [B:11], which were shown to be highly efficient in practice.
10.3 Implementations of Allocation Inheritance
Allocation inheritance is the progress mechanism that is the most difficult to support on multiprocessors, in particular when realized as task migration, since it implies dynamic and rapid changes in the set of processors on which a lock-holding job is eligible to execute. While this results in nontrivial synchronization challenges within the scheduler, allocation inheritance has been implemented and shown to be practical in several systems.
As already mentioned in Sections 4.1.4 and 10.1, TS:97 [TS:97] provided efficient implementations of the allocation inheritance principle that avoid task migrations by expressing critical sections as closures. However, TS:97’s algorithms still require critical sections to be executed non-preemptively (i.e., they are not independence-preserving).
Concerning realizations of allocation inheritance that allow tasks to remain fully preemptable at all times, SBK:10 [SBK:10] describe an elegant way of implementing allocation inheritance on uniprocessors and mention that their implementation extends to multiprocessors (but do not provide any details). HP:01 [HP:01] discuss implementation and design choices in a multiprocessor context, but do not report on implementation details. Both SBK:10 and HP:01 consider microkernel systems, which are particularly well-suited to allocation inheritance due to their minimalistic kernel environment and emphasis on a clean separation of concerns.
In work on more complex monolithic kernels, BB:12 [BB:12] discuss a prototype implementation in Linux. Allocation inheritance has also been realized several times in the Linux-based LITMUSRT: by FLC:12 when implementing the spin-based MBWI protocol [FLC:12], by B:13 for the suspension-based OMIP [B:13] and MC-IPC protocols [B:14], and by CBHM:15 for the spin-based MrsP [CBHM:15].
CBHM:15 also presented an implementation of the MrsP and allocation inheritance in RTEMS, a static real-time OS without a kernel-mode / user-mode divide targeting embedded multiprocessor platforms, and compared and contrasted the two implementations in LITMUSRT and RTEMS [CBHM:15].
10.4 RTOS and Programming Language Integration
Over the years, a number of authors have explored the question of how to best integrate real-time locking protocols into RTOSs and popular programming languages, to which extent existing theory meets the needs of real systems, and techniques for efficient implementations of real-time locking protocols. In the following, we provide a high-level overview of some of the considered directions and questions.
Criticism of programming language synchronization facilities from the perspective of multiprocessor real-time predictability dates all the way back to 1981, when REMC:81 [REMC:81] reviewed the then-nascent ADA standard. Interestingly, REMC:81 argued already then in favor of introducing spin-based synchronization (rather than exclusively relying on suspension-based methods) to avoid scheduler invocations [REMC:81].
More than 20 years later, N:05 [N:05] considered undesirable blocking effects on multiprocessors due to ADA 95’s protected actions. Specifically, N:05 identified that, if low-priority tasks spread across several processors issue a continuous stream of requests for a certain type of ADA operations (namely, entries protected by barriers) to be carried out on a protected object currently locked by a higher-priority task, then, according to the language standard, these operations could potentially all be serviced by the higher-priority task in its exit path (i.e., when trying to release the protected object’s lock) [N:05], which theoretically can lead to unbounded delays. As an aside, TS:97’s SPEPP approach [TS:97] offers an elegant solution to this particular problem since it is starvation-free. Similarly, the MrsP [BW:13a] could be applied in this context, as suggested by BW:13 [BW:13] in their investigation of protected objects in ADA 2012.
Even today, predictable multiprocessor synchronization in ADA remains a point of discussion. LWB:13a [LWB:13a, LWB:13, L:13] revisited the support for analytically sound multiprocessor real-time synchronization in ADA 2012 and still found it to be wanting. At the same time, they also found the multiprocessor real-time locking protocols available in the literature unsatisfactory, in the sense that there is no clear “best” protocol that could be included in the standard to the exclusion of all others. To resolve this mismatch in needs and capabilities, LWB:13a argued in favor of letting programmers provide their own locking protocols, so that each application may be equipped with a protocol most suitable for its needs, and presented a flexible framework for this purpose as well as a number of reference implementations of well-known protocols on top of the proposed framework [LWB:13a, LWB:13, L:13].
Most recently, GZAJ:17 [GZAJ:17] investigated the question of predictable multiprocessor real-time locking within the constraints of the ADA Ravenscar profile [B:99] for safety-critical hard real-time systems. In particular, they compared implementations of the MSRP [GLD:01] (based on non-preemptive sections) and the MrsP [BW:13a] (based on allocation inheritance), and found that the simpler MSRP is preferable in the restricted Ravenscar context, whereas the MrsP is suitable for use in a general, full-scope ADA system.
In work on other programming environments and languages, ZC:04 [ZC:04, ZC:06] investigated a range of multiprocessor real-time locking protocols in the context of a CORBA middleware, and SPS:17 [SS:15, SPS:17] proposed and evaluated hardware implementations of real-time synchronization primitives in a native Java processor for embedded safety-critical systems. Also targeting Java, WLB:11 [WLB:11] studied the multiprocessor real-time locking problem from the point of view of the needs and requirements of the Real-Time Java (RTSJ) and Safety-Critical Java (SCJ) specifications, and found a considerable gap between the (restrictive) assumptions underlying the (at the time) state-of-the-art real-time locking protocols and the broad flexibility afforded by the RTSJ and, to a lesser degree, SCJ specifications. As a step towards closing this gap, WLB:11 [WLB:11] suggested changes to the RTSJ and SCJ specifications that would ease a future integration of analytically sound multiprocessor real-time locking protocols.
The first to discuss in detail the implementation of a multiprocessor real-time locking protocol in an actual RTOS were SZ:92 [Z:92, SZ:92], who proposed a real-time threads package, including support for predictable synchronization as discussed in Section 5.2.3, for use on top of the Mach microkernel, which has since been superseded by later generations of microkernels (e.g., the L4 family).
TS:96 considered the design of a multiprocessor RTOS in light of the scalability of worst-case behavior [TS:96, T:96]. Among other techniques, they proposed a scheme called local preference locks [TS:96], where resources local to a particular processor are protected with priority-ordered spin locks, but request priorities do not depend on task priorities. Instead, the local processor accesses a local preference lock with higher priority than remote processors, which ensures that processors quickly gain access to local resources (i.e., with s-blocking) even if they are shared with multiple remote processors.
In work throughout the past decade, many locking protocols have been implemented and evaluated in LITMUSRT [B:11, E:15]. Already in 2008, BA:08a [BA:08a] provided a detailed discussion of the implementations of several locking protocols in LITMUSRT, including the FMLP and the MPCP.
Several authors have reported on MPCP implementations. In work targeting Linux with the PREEMPT_RT patch, CDF:14 [CDF:14] provided details on in-kernel implementations of the FMLP [BLBA:07] and a non-preemptive MPCP variant [CO:12a]. A particularly low-overhead implementation of the MPCP for micro-controllers that avoids the need for expensive wait-queue manipulations was proposed by MDPL:14 [MDPL:14]. Targeting a very different system architecture, IYY:17 implemented the MPCP on top of a CAN bus to realize mutual exclusion in a distributed shared memory (DSM) [IYY:17].
In recent overhead-oriented work, NAA:17 [NAA:17, NAA:19] added a fastpath to the RNLP [WA:12] to optimize for the common case of non-nested lock acquisitions, and AVGB:16 [AVGB:16] presented an implementation of spin locks with flexible spin priorities [ABBN:14]. Motivated by the fact that the contention-sensitive C-RNLP [JWA:15] exhibits relatively high acquisition and release overheads due to its complex request-sequencing rules, NAA:18 [NAA:18] went a significant step further and introduced a novel and rather unconventional approach to implementing locking protocols. Specifically, NAA:18 centralized the acquisition and release logic in dedicated lock servers that run the locking protocol in a cache-hot manner. Importantly, in contrast to the centralized semaphore protocols discussed in Section 6, a lock server does not actually centralize the execution of critical sections; rather, it centralizes only the execution of the locking protocol itself, and still lets tasks execute their actual critical sections in-place. In other words, in NAA:18’s design, a lock server decides in which order critical sections are executed, but does not execute any critical sections of tasks. As a result of the cache-hot, highly optimized implementation of the contention-sensitive RNLP, NAA:18 were able to demonstrate a reduction in acquisition and release overheads by over 80% [NAA:18].
To date, multiprocessor real-time locking protocols have received scant attention from a WCET analysis perspective, with G:13’s thesis [G:13] and recent proposals for synchronization support in time-predictable multicore platforms [SPS:17, SS:18, MHKS:19] being notable exceptions.
Last but not least, GAB:16 [GAB:16] recently reported on a verified implementation of priority inheritance with support for nested critical sections in the RTEMS operating system.
11 Conclusion, Further Directions, and Open Issues
Predictable synchronization is one of the central needs in a multiprocessor real-time system, and it is thus not surprising that multiprocessor real-time locking protocols, despite having received considerable attention already in the past, are still a subject of ongoing research. In fact, the field has seen renewed and growing interest in recent years due to the emergence and proliferation of multicore processors as the de facto standard computing platform. Looking back at its history over the course of the past three decades—starting with RSL:88’s pioneering results [RSL:88, R:90, R:91]—it is fair to say that the community has gained a deeper understanding of the multiprocessor real-time locking problem and amassed a substantial body of relevant knowledge. In this survey, we have attempted to systematically document and structure a current snapshot of this knowledge and the relevant literature, in hopes of making it more easily accessible to researchers and practitioners alike.
11.1 Further Research Directions
In addition to the topics discussed in the preceding sections, there are many further research directions related to multiprocessor real-time locking protocols that have been explored in the past. While these topics are beyond the scope of this already long survey, we do mention a few representative publications to provide interested readers with starting points for further exploration.
One important resource in multiprocessor real-time systems that we have excluded from consideration herein is energy. Unsurprisingly, energy management policies, in particular schemes that change processor speeds, can have a significant impact on synchronization [CCK:08, HWZJ:12, FTCS:13, TFCY:16, W:17]. Another architectural aspect that can interact negatively with multiprocessor real-time locking protocols is simultaneous multithreading (SMT) [L:06].
Targeting high-integrity systems subject to both real-time and security requirements, VEHH:13 [VEHH:13] studied a number of real-time locking protocols (including the MPCP [R:90] and the clustered OMLP [BA:11]) from a timing-channel perspective and identified confidentiality-preserving progress mechanisms and locking protocols that prevent shared resources from being repurposed as covert timing channels.
A number of authors have studied multiprocessor real-time synchronization problems from an optimality perspective and have obtained speed-up and resource augmentation results for a number of protocols and heuristics [AE:10, HYC:16, RAB:11, RNA:12, R:13, AR:14, HCR:16, BCHY:17, CBSU:18]. These results are largely based on rather limiting assumptions (e.g., only a single critical section per job), and in several instances pertain to protocols purposefully designed to obtain a speed-up or resource augmentation result, which has limited practical relevance [CBHD:17].
Taking a look at synchronization in real-time systems from a foundational perspective, LS:92 [LS:92] established lower and upper bounds on the number of atomic registers required to realize mutual exclusion with deadlock avoidance.
Targeting periodic workloads in which each task has at most one critical section, SUBC:19 [SUBC:19] explored an unconventional synchronization approach in which an explicit dependency graph of all jobs and critical sections in a hyperperiod is built a priori and all critical sections are sequenced offline with a list scheduling heuristic.
All of the protocols discussed in this survey assume sequential tasks. Going beyond this standard assumption, HBL:12 [HBL:12] studied the multiprocessor real-time locking problem in the context of parallel real-time tasks [H:12, HBL:12]. Exploring a similar direction, DLAG:17 [DLAG:17] proposed an analysis of parallel tasks using spin locks in the context of federated scheduling. Most recently, JGLY:19 [JGLY:19] presented an analysis of parallel tasks using semaphores under the same scheduling assumptions.
Over the years, many results pertaining to the task and resource mapping problems as well as related optimization problems have appeared [TL:94, SBL:94, LDG:04, NBN:09, NNB:10, FM:10, RAB:11, FM:11, HLK:11, N:12, HWZJ:12, WB:13, HZWY:14, SRM:14, BMV:14, ASZD:15, HYC:16, HIS:17, BCHY:17, HTZY:17, DLBC:18, CLYL:19]. Particularly well-known is LNR:09’s task-set partitioning heuristic for use with the MPCP [LNR:09]. Alternative heuristics and strategies have been proposed by (among others) NNB:10 [NNB:10], WB:13 [WB:13], and ASZD:15 [ASZD:15]. Most recently, DLBC:18 [DLBC:18] presented a detailed study of several task- and resource-assignment heuristics following a resource-centric schedulability analysis approach [HYC:16, HCR:16] in the context of a semaphore protocol with centralized critical sections, and CLYL:19 [CLYL:19] presented an ILP-based partitioning solution for the same workload and platform model. Techniques for scheduling task graphs with precedence constraints and end-to-end deadlines in distributed systems [TL:94, SBL:94] are particularly relevant in the context of the DPCP and its related task and resource mapping problems.
Another system configuration aspect that has received considerable attention is the policy selection problem, where the goal is to choose an appropriate synchronization method for a given set of tasks, a set of shared resources, and the tasks’ resource needs [BNNG:11, HLLW:12, HZDL:14, AKNN:15, ASZD:15, BB:16, BCBL:08]. BCBL:08 [BCBL:08] compared spin- and suspension-based locking protocols (namely, the FMLP variants for short and long resources [BLBA:07]) in LITMUSRT under consideration of overheads with each other, and also against non-blocking synchronization protocols. The choice between spin-based locking protocols and non-blocking alternatives has also recently been considered by ASZD:15 [ASZD:15] and BB:16 [BB:16]. S:03 proposed to reduce the impact of remote blocking by combining locks with a versioning mechanism to reduce critical section lengths [S:03].
In work on the consolidation and integration of legacy systems on multicore platforms, ABN:13 [ABN:12, ABN:13], NN:13 [NN:13], and NBN:11 [NBN:11] explored the use of abstract interfaces that allow to represent a component’s resource needs and locking behavior without revealing detailed information about the component’s internals.
Finally, several authors have considered synchronization needs in hierarchical multiprocessor real-time systems (i.e., systems in which there is a hierarchy of schedulers) [NBNB:09, NBN:09, KWR:14, BBB:15, AKBB:16, ABBN:15a, ABBN:15]. In such systems, tasks are typically encapsulated in processor reservations, resource servers, or, in the case of virtualization, virtual machines, and thus prone to preemptions in the middle of critical sections. As first studied by HA:02 [HA:02] in the context of Pfair-scheduled systems, this poses considerable challenges from a locking point of view [HA:02, HA:02a, H:04, HA:06], and requires special rules to either prevent lock acquisitions shortly before a job or reservation’s (current) budget allocation is exhausted [HA:02], or acceptance of (and appropriate accounting for) the fact that jobs or reservations may overrun their allocated budget by the length of one critical section [BBB:15].
11.2 Open Problems
As already mentioned in Section 1, the “last word” on multiprocessor real-time resource sharing has not yet been spoken, and will likely not be spoken for a long time to come. Without seeking to detract from other interesting directions, we briefly highlight three largely unexplored opportunities for future work.
First, there is a need for more flexible blocking analyses that can handle multiple lock types simultaneously. Practical systems typically use multiple types of locks for different purposes (e.g., both spin locks and semaphores), and while many lock types and real-time locking protocols have been investigated and analyzed in isolation, few results in the literature explicitly account for effects that arise from the combination of different lock types (e.g., blocking bounds for semaphores in the presence of non-preemptive sections due to spin locks [BLBA:07, WA:13]). Worse, few (if any) of the existing analyses focused on individual lock types and protocols compose soundly without modification. To better support the needs of real-world systems, clearly further advances will be required in this direction.
Second, all major existing blocking analyses pertain to the worst case. While this is clearly required for true hard real-time systems, given that average-case contention in well-designed systems is usually low, the resulting bounds can seem extremely pessimistic relative to observable blocking delays. For firm real-time systems that do not require hard guarantees, or for systems where there are strong economic incentives to not provision based on an absolute worst-case basis (which is arguably the majority of systems in practice), there is unfortunately little support in the existing literature. Thus, practitioners today must choose between pessimistic hard-real-time blocking bounds that tend to result in over-provisioning, or no analysis at all (i.e., rely purely on measurements instead). To extend the range of systems to which analytically sound blocking bounds are applicable, we will need means for reasoning about anticipated blocking delays that are both more rigorous than average-case observations and less taxing than hard-real-time analyses based on worst-case assumptions at every step.
Last but not least, we would like to highlight the need for a rigorous foundation and formal proofs of correctness for blocking analyses. In particular for worst-case blocking analyses, which by their very nature are intended to be used in critical systems, it is essential to have utmost confidence in the soundness of the derived bounds. However, as blocking bounds become more accurate, task models more detailed, and synchronization techniques more advanced, the required blocking analyses also become more tedious to derive, more challenging to validate, and ultimately more error-prone. If the goal is to support safety-critical systems in practice, and to use multiprocessor real-time locking protocols and their analyses as evidence of system safety in certification processes, then this is a very dangerous trend, in particular in the light of prior missteps that have only recently come to light [YCH:17, CB:17, CNHY:19, GZBW:17]. As a first step towards a higher degree of confidence in the correctness of advanced blocking analyses, recent LP- and MILP-based blocking analyses [B:13, WB:13a, BB:16, YWB:15] offer the advantage that each constraint can be checked and proven correct individually (rather than having to reason about the entire analysis as a whole), which simplifies the problem considerably. However, while this is a much needed improvement, it is clearly not yet enough. In the longterm, it will be desirable (if not outright required at some point) for analyses and protocols intended for use in safety-critical systems—such as blocking bounds for multiprocessor real-time locking protocols—to come with a machine-checked proof of soundness, or other equivalent soundness guarantees backed by formal verification methods. Much interesting and challenging work remains to be done before a full formal verification of a multiprocessor real-time locking protocol and its timing properties can become reality.