Virtual Gang based Scheduling of Real-Time Tasks on Multicore Platforms

12/23/2019 ∙ by Waqar Ali, et al. ∙ The University of Kansas 0

We propose a virtual-gang based parallel real-time task scheduling approach for multicore platforms. Our approach is based on the notion of a virtual-gang, which is a group of parallel real-time tasks that are statically linked and scheduled together by a gang scheduler. We present a light-weight intra-gang synchronization framework, called RTG-Sync, and virtual gang formation algorithms that provide strong temporal isolation and high real-time schedulability in scheduling real-time tasks on multicore. We evaluate our approach both analytically, with generated tasksets against state-of-the-art approaches, and empirically with a case-study involving real-world workloads on a real embedded multicore platform. The results show that our approach provides simple but powerful compositional analysis framework, achieves better analytic schedulability, especially when the effect of interference is considered, and is a practical solution for COTS multicore platforms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 4

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

High-performance multi-core based embedded computing platforms are increasingly being used in safety-critical real-time applications, such as avionics, robotics and autonomous vehicles. This is fueled by the need to efficiently process computationally demanding real-time workloads (e.g., AI and vision). For such workloads, multicore platforms provide ample opportunity for speedup by allowing these workloads to utilize multiple cores. However, the use of multicore platforms in safety-critical real-time applications brings significant challenges due to the difficulties encountered in guaranteeing predictable timing on these platforms.

In a multicore platform, tasks running concurrently can experience high timing variations due to contention in the shared hardware resources. The effect of contention highly depends on the underlying hardware architecture, which is generally optimized for average performance, thus often shows extremely poor worst-case behaviors [5]

. Furthermore, which tasks are to be co-scheduled at a time instance depends on the OS scheduler’s decision and may vary over time. Therefore, to estimate a task’s worst-case execution time (WCET) through empirical measurements, which is a common industry practice, one might have to explore all feasible co-schedules of the entire taskset under a chosen OS scheduling policy. Any slight change in the schedule or change in any of the tasks can make ripple effects in the observed task execution times. In other words, timing of a task is

coupled with the rest of the tasks, the OS scheduling policy, and the underlying hardware. For this reason, in the use-cases where hard real-time guarantees are a must, such as avionics, it is recommended to disable all but one core of a multicore processor [7], which obviously defeats the purpose of using multicore in the first place.

A common approach to address the co-runner dependent timing variations is to partition shared hardware resources among the tasks. Although partitioning techniques do improve time predictability, studies show that they are often not sufficient to guarantee tight WCET bounds in modern high-performance multicore architectures because many important shared hardware resources, which have profound impact on task timing, are unpartitionable [40, 5]. Moreover, in general, partitioning reduces efficiency and maximum achievable performance and it is especially ill suited for parallel tasks. For example, a core-based static cache partitioning strategy [34, 22] can incur massive performance hit for a parallel task when its threads are allocated on different cache partitions.

Gang scheduling was originally proposed in high-performance computing to maximize performance of parallel tasks [13], and many real-time varieties have been studied in the real-time community [6, 14, 15, 20, 12, 42]. In gang scheduling, threads of a parallel task are scheduled only when there are enough cores to schedule them all at the same time. Therefore, gang scheduling reduces scheduling induced timing variations and synchronization overhead [42]. However, most prior gang scheduling studies do not consider the co-runner dependent timing variations due to contention in the shared hardware resources, and instead simply assume that WCETs already account for such effects, which may introduce severe pessimism in their analysis [45].

Recently, a restrictive form of gang scheduling policy was proposed [1] to address the problem of shared resource contention. Unlike other real-time gang scheduling policies, it schedules only one real-time gang task at a time, even if there are enough cores left to accommodate other real-time tasks 111In this paper, by a real-time task, we mean a periodically activated task, which is composed of one or more parallel threads. We also use the terms task and process interchangeably in the rest of the paper. See Section II-A for our task model.. By design, this one-gang-at-a-time approach eliminates co-runner dependent timing variations among the real-time tasks because only the threads of a single parallel real-time task are scheduled, in which case sharing of the hardware resources is beneficial to the task’s performance. The obvious problem of resource under-utilization is partially mitigated by co-scheduling best-effort tasks on any idle cores where their memory bandwidth usages are strictly regulated, using a memory bandwidth regulation mechanism, to a certain threshold set by the real-time gang task so that their impact to the real-time gang task is strictly bounded.

However, the approach assumes that there are enough best-effort tasks to fill the idling cores, which might not always be the case. More importantly, it does not solve the problem of reduced schedulability of real-time tasks. Given that parallelization of a task often does not scale well, and more cores are being integrated in modern multicore processors, it is unlikely to be a general solution for many systems. The authors claim that the problem can be mitigated by the notion of virtual gang, which they define as a group of real-time tasks that are always scheduled together as if they are members of a single parallel task. Unfortunately, however, they did not show how such a virtual gang can be created, selected, and scheduled in ways to improve real-time schedulability.

In this paper, we first show that the idea of virtual gang, as proposed in [1], can actually substantially decrease the real-time schedulability of the system. In fact, we show that, in the worst-case, scheduling of a virtual gang task is equivalent of serializing all member tasks of the gang, which effectively nullifies any schedulability benefits of co-scheduling. We propose a light-weight intra-gang synchronization framework, which we call RTG-Sync, to address the problem by ensuring all member tasks of a virtual gang are synchronously released. It also provides easy to use APIs to create and destroy virtual gangs and their memberships.

RTG-Sync provides the following guarantees to the members of each virtual gang task: (1) all member tasks are statically determined and do not change over time; (2) no other real-time tasks can be co-scheduled; (3) best-effort tasks can be co-scheduled on any idle cores, but their maximum memory bandwidth usages will be strictly regulated to a certain threshold value set by the virtual gang. These properties greatly simplify the process of determining task WCETs, because once a virtual gang is created, the other tasks that do not belong to the virtual gang cannot interfere with the member tasks, regardless of the OS scheduling policy, and the effect of shared hardware resource contention is strictly bounded. In short, RTG-Sync enables compositional timing analysis on multicore platforms222Timing analysis of a real-time system is considered as compositional if analysis of a component can be carried out independently of other components..

We present virtual gang formation algorithms to help create virtual gangs and their member tasks from a given real-time taskset with a goal of maximizing system-level real-time schedulability. Lastly, we describe how a classical single-core schedulability analysis can be applied to analyze the scheduling of parallel real-time tasksets on a multicore platform.

For evaluation, we first present schedulability analysis results with randomly generated parallel real-time tasksets. We then present case-study results conducted on a real multicore platform, demonstrating that the proposed framework achieves higher utilization and time predictability.

In summary, we make the following contributions:

  • We establish, with the help of concrete examples, the requirements for supporting the virtual-gang abstraction in a standard operating system.

  • We present a light-weight synchronization framework RTG-Sync for ensuring synchronous release of the member tasks for each virtual-gang333

    Our framework will be available as open-source.

    .

  • We present virtual gang formation algorithms, which create virtual gangs and their member tasks from a given taskset to improve system-level real-time schedulability.

  • We implement our system in a real embedded multicore platform and evaluate our approach both analytically with generated tasksets and empirically with a case-study involving real-world workloads.

The rest of the paper is organized as follows. We provide necessary background in Section II. In Section III, we establish the requirements for supporting virtual-gangs in a gang-scheduling framework with the help of motivating examples. In Sections IV and V, we explain the design of RTG-Sync and the gang formation algorithms respectively. In Section VI, we describe the schedulability analysis results using synthetic tasksets. In Section VII, we present our case-study evaluation results. We discuss related work in Section VIII and conclude in Section IX.

Ii Background

In this section, we provide necessary background.

Ii-a Rigid Real-Time Gang Task Model

In this paper, we consider the rigid real-time gang task model [15] with implicit deadlines, and focus on scheduling periodic real-time tasks, denoted by , on a multicore platform with identical cores. In the rigid gang model, each real-time task is characterized by three parameters where represents the number of cores (threads) required by the gang task to run, is the task’s worst-case execution time (WCET), and represents its period, which is also equal to its deadline. The task model is said to be rigid because the number of cores a task needs is fixed and does not change over time. The rigid gang task model is well suited for multi-threaded parallel applications, often implemented by using parallel programming frameworks such as OpenMP [10]. We will review other task models in Section VIII.

Ii-B Priority-based Real-Time Gang Scheduling Algorithms

Gang FTP [15] is a fixed-priority gang scheduling algorithm, which schedules rigid and periodic real-time gang tasks as follows: At each scheduling event, the algorithm schedules the highest priority task on the first available cores (if exist) among the active (ready) tasks. The process repeats for the remainder of the active tasks on the remaining available cores. Gang EDF scheduler [23]

works similarly, except that the task’s priority is not fixed but may dynamically change based on deadlines of the tasks at the moment (a task with more imminent deadline is given higher priority).

Ii-C RT-Gang

RT-Gang is a recently proposed open-source real-time gang scheduler, which implements a restrictive form of Gang FTP scheduling policy in Linux kernel [1]. The gang scheduling policy is restrictive in the sense that at most one real-time task—which may be composed of one ore more parallel threads—can be scheduled at a time. When a real-time task is released, all of its threads are scheduled simultaneously on the available cores—if it is the highest priority real-time task—or none at all if there is a higher priority real-time task currently in execution.

Fig. 1: Illustration of the “one-gang-at-a-time” scheduling policy.

Figure 1 shows the comparison of the “one-gang-at-a-time” scheduling policy of RT-Gang against Gang-FTP with a simple example, in which two real-time tasks and are scheduled on a multicore platform. Under Gang-FTP, and can be co-scheduled because their combined core requirement is equal to the total number of system cores. Under RT-Gang, such co-scheduling is not possible. All threads of are simultaneously preempted when the higher priority task arrives because real-time tasks must be executed on one-at-a-time basis.

The rationale behind this “simple” gang scheduling policy—one-gang-at-a-time—is to eliminate the problem of shared hardware resource contention among co-executing real-time tasks by design. This also greatly simplifies schedulability analysis because it transforms the (complex) problem of multicore scheduling of real-time tasks into the well-understood unicore scheduling problem. Since each real-time task is guaranteed temporal isolation, its worst-case execution time (WCET) can be tightly bounded as opposed to pessimistic estimation of WCET when co-scheduling of real-time tasks is allowed. Note that this restrictive gang scheduling is still strictly better than Federal Aviation Administration (FAA) recommended industry practice of disabling all but one cores [7].

The obvious problem of low CPU utilization—because not all cores may be utilized by the scheduled one real-time gang task—is partially mitigated by allowing co-scheduling best-effort tasks on any idle cores, with a condition that their memory bandwidth usages are throttled to a certain threshold value, which is set by the real-time gang task, so that the impact of co-scheduling best-effort tasks to the critical real-time gang task is strictly bounded. For throttling best-effort tasks, RT-Gang implements a hardware performance counter based kernel-level throttling mechanism [46].

Note, however, that although co-scheduling best-effort tasks can improve CPU utilization, it has no effects in improving low schedulability of real-time tasks under the strict one-gang-at-a-time policy.

Ii-D Virtual Gang

To improve schedulability of real-time tasks, the authors of RT-Gang  [1] introduced the notion of virtual gang, which they define as a group of real-time tasks that are explicitly linked and scheduled together as if they are the threads of a single real-time gang task under the scheduler.

Fig. 2: Virtual gang concept. Adapted from [1]

Figure 2 illustrates the concept of a virtual gang, in which three separate real-time tasks, and , form a virtual gang task . The virtual gang task is then treated as single real-time gang task by the scheduler.

Unfortunately, however, the conditions under which virtual gangs can be created and how they can be effectively scheduled in ways to improve real-time schedulability of the system were not shown in [1].

Iii Motivation

In this section, we show that the virtual gang concept, as described in [1], does not necessarily improve real-time schedulability of the system. We present two main challenges that need to be addressed for effective virtual gang based real-time scheduling.

(a) No Virtual-Gangs
(b) Synchronized Virtual-Gang
(c) Unsynchronized Virtual-Gang
Fig. 3: Example schedules under different schemes

Iii-a Need of Synchronization

The first major requirement for virtual gang is that all member tasks must have equal period and that they must be released synchronously.

As for the period requirement, if the linked member tasks of a virtual gang task do not share the same period, then the consolidated gang task will not be effectively modeled as a single periodic task from the analysis point of view. Therefore, a virtual gang task can only be crated when all of its member tasks have a common period.

Task WCET () Period () # Thread
1 10 1
2
3
4
TABLE I: Taskset parameters of the illustrative example

As for the synchronous release requirement, consider the taskset in the Table I, and suppose it is scheduled on a quad-core platform (m=4). We consider scheduling of these tasks under the one-gang-at-a-time policy. When virtual-gangs are not used, each task in the taskset execute as a gang by itself. This results in the scheduling timeline shown in the Figure 3(a). In this scheme, the completion time of the taskset is 10 time units, as the tasks execute sequentially one-at-a-time. Note that even though each of these tasks do not fully utilize the cores—use only one leaving three idle cores—the idle cores cannot be used (they are said to be “locked” by the gang scheduler) to schedule other real-time tasks and thus result in reduced real-time schedulability.

Now we consider the execution of this taskset as a virtual-gang. Assuming that these are the only tasks that share the same period in our system and members of this taskset do not interfere with each other, an intuitive grouping of these tasks for execution as a gang would be to run them at the same time across all four cores in the system. This results in the execution timeline shown in Figure 3(b). In this scheme, the virtual gang completes in just 4 time units, after which other real-time tasks can be scheduled—i.e., improved real-time schedulability.

However, the execution of the taskset in the virtual-gang scheme assumes that the jobs of the members are perfectly aligned. If this is not the case, then the virtual gang task’s execution time will be increased, as shown in Figure 3(c), and in the worst-case, it can be as bad as the original schedule without using virtual gang in terms of real-time task schedulability.

Iii-B Gang Formation Problem

Another major challenge is to decide which tasks to be grouped together when creating virtual gangs.

Assume that in addition to tasks from Table I, there is one more task which needs to be scheduled on the quad-core platform. In this case, it is required that the taskset be split into at least two virtual-gangs since all five tasks in the taskset cannot execute simultaneously in our target system. Hence the problem is to find an optimal grouping of tasks into virtual-gangs such that the execution time of the taskset is minimized. For the simple taskset considered here, it can be seen, with a little trial and error, that a virtual-gang comprising , , , and another one comprising just will achieve the desired result. The resulting execution timeline of the taskset is shown in inset (a) of Figure 4.

Fig. 4: Example schedules under different gang formations.

However if the tasks in a virtual-gang are not carefully selected, the execution time of the taskset can increase significantly. In the example taskset, a virtual-gang comprising , , , and the other one comprising leads to an execution time of 7 time units as compared to 5 time units in the previous case; as can be seen in inset (b) of Figure 4.

Given a taskset, the problem of selecting the tasks which should be run together as virtual gangs so that the execution time of the entire taskset is minimized, is non-trivial. The problem is further complicated by the fact that the tasks in a virtual gang can interfere with each other when run concurrently due to shared hardware resource contention, which may require some degree of pessimism in estimating the virtual gang’s WCET.

Without taking the synchronization and gang formation problem into account, a strategy to improve system utilization via virtual-gangs under one-gang-at-a-time policy may not lead to the desired results and may actually deteriorate the system’s performance and real-time schedulability.

Iv RTG-Sync

The goal of RTG-Sync is to manage the life-cycle of virtual-gangs, which are scheduled based on the “one-gang-at-a-time” scheduling policy. As explained in the Section III, this requires providing a synchronization mechanism to align the periodic execution of the member tasks of a virtual-gang. Under RTG-Sync, this requirement is fulfilled by the intra-gang synchronization framework.

For a typical multi-threaded process (task), synchronization between the threads of the process can easily be achieved by using a barrier mechanism available in the parallel programming library it uses (e.g., OpenMP barrier). However, such a barrier mechanism is tied to the particular parallel programming framework, which is used by the particular parallel task, and is not designed to be used by disparate tasks for system-level scheduling.

In essence, RTG-Sync provides a specially designed system-wide barrier mechanism to each virtual gang so that all its member tasks can be synchronously released and scheduled by the kernel-level gang scheduler simultaneously.

Iv-1 RTG-Sync Manager

RTG-Sync manager consists of two daemons: client and server. The purpose of the client daemon is to establish an IPC socket with the server daemon at a predefined location and to relay commands from the user to the server daemon for processing. The server daemon is then responsible for managing the life-cycle of virtual-gangs and the operating system resources associated with them. It does so by providing users with commands for creating and destroying virtual-gangs which are described below:

The primary service provided by the server is creating virtual-gangs and initializing their associated resources. The server receives the count of processes which need to be run as a single virtual-gang. It then creates a new memory mapped file in a predefined location and uses that file for creating a system-wide barrier with the waiter count equal to the number of constituent processes. A unique ID value is generated for the virtual-gang which is then published to the user program and hashed in an internal database for book-keeping.

Once all members of a virtual-gang have exited, the user can make the call for gang deletion; to free up all resources associated with the gang. The user provides the virtual-gang ID value to the server via the client program as part of the command. Upon receiving the command, the server looks up the barrier structure in the internal database against the virtual-gang ID. Once found, the barrier is deleted and the associated shared-memory is released to the operating system.

Iv-2 RTG-Sync User-Library

RTG-Sync user-library provides an easy to use interface to programs which have to be run as part of virtual-gangs. The interface consists of two function calls: one for registering a process as part of an established virtual-gang and the other for synchronizing with the other virtual gang members.

The API call to register a process as a virtual-gang member takes the virtual-gang ID value issued by the RTG-Sync manager. Internally, the API uses the RTG-Sync system-call to record the virtual-gang ID value into the calling process’s task-structure. Furthermore, it maps the virtual-gang ID with the system-wide barrier mechanism.

1# Spawn RTG-Sync server daemon
2./rtg_server &
3
4# Issue command to create virtual-gang with 4 tasks
5id = ‘./rtg_client -c 4‘
6
7# Create member tasks
8chrt -f 5 taskset -c 0 ./tau_1 -v ${id} ... &
9chrt -f 5 taskset -c 1 ./tau_2 -v ${id} ... &
10chrt -f 5 taskset -c 2 ./tau_3 -v ${id} ... &
11chrt -f 5 taskset -c 3 ./tau_4 -v ${id} ...
12
13# Destroy virtual-gang
14./rtg_client -f ${id}
Listing 1: Virtual gang management example under RTG-Sync

Once a process is registered as part of a virtual-gang, the call to synchronize gang members is simple. It takes the barrier pointer returned by the aforementioned API call and uses it to synchronize on the barrier. This call must be made by the member task just before the periodic execution is about to begin. As soon as the waiter count for the barrier is reached, the member processes are unblocked simultaneously; leading to desired alignment of their periodic execution.

Iv-3 Kernel Modification

RTG-Sync uses the RT-Gang framework to provide gang scheduling inside the Linux kernel. We have made one change to the RT-Gang framework related to how membership of a gang is checked inside the gang scheduler. RT-Gang uses the SCHED_FIFO priority value of a process to determine gang membership i.e., different processes which have the same SCHED_FIFO priority are considered part of the same gang and are allowed simultaneous execution. There are two drawbacks of using this criterion of gang membership. First, it forces the RMS priority assignment in that tasks which have the same period must be executed as a single gang. Second, it makes the gang scheduler restricted to fixed priority assignment policy only. In RTG-Sync, we have modified this criterion to instead use the virtual-gang ID value recorded in the task’s control block, to check gang membership, thus avoiding the aforementioned problems.

One caveat of this design choice is that for multi-threaded processes, the virtual-gang registration call, which sets the virtual-gang ID value for the calling process in the process control block, must be made before any threads are spawned by the process. This is needed because in Linux, the child threads inherit the parent process’s task-structure upon creation, and they must have the same virtual gang ID value in their kernel-level task structures to be considered as part of the same gang by the gang scheduler.

Iv-4 Example Usage

Listing 1 shows how the framework is actually used for managing virtual-gangs.In this example, the RTG-Sync server is started for the first time and run in the background at line-2. At line-5, the RTG-Sync client is used to issue the command to create a virtual-gang with four member tasks. If the command completes successfully, it returns an integer value to the caller which is the virtual-gang ID value. Once this is done, the actual member tasks of the virtual-gang are spawned (lines-8:11). The source code of these tasks is slightly modified to accept the virtual-gang ID value as an input parameter and use it via the RTG-Sync user-library calls to register as part of a virtual-gang and synchronize on the shared-memory barrier before starting their periodic execution. Once all tasks of a virtual-gang have exited, the command to destroy the virtual-gang with the given ID value is issued at line-14. Upon receiving this command, the RTG-Sync server removes the barrier created against the virtual-gang ID value and releases the shared memory back to the operating system.

V Gang Formation Algorithm

In this section, we describe the gang formation algorithm of RTG-Sync.

V-a Problem Statement

For a given candidate-set of real-time rigid gang tasks with a shared period and a given multicore platform with homogeneous CPU cores, the algorithm’s goal is to form a set of virtual gang tasks such that the total completion time of the virtual gangs is minimized. We assume that the member tasks of each virtual gang are synchronously released through the RTG-Sync framework and the virtual gang’s WCET can be obtained empirically or analytically without excessive pessimism.

We first present a brute-force algorithm to solve the virtual gang formation problem and then describe a heuristic based algorithm. Before delving into the details of the algorithms, we first define key terms, which are used in the remainder of this section.

V-A1 System Configuration

In our algorithm, a system configuration (), given a candidate-set, describes a unique combination of virtual-gangs which are sufficient to execute every task from the candidate-set. We use the following notation to denote a configuration: where each denotes a virtual-gang comprising tasks from the candidate-set.

V-A2 Completion Time

The completion time of a configuration is defined as the time it takes for all the virtual-gangs, which are part of the configuration, to complete their execution in a given period. Under our framework, the completion time of a configuration is equal to the sum of WCETs of the virtual-gangs which are part of the configuration.

1 Input: Candidate Set (), Number of Cores (m)
2 Output: Best System Configuration
algorithm gang_formation(, m)
3       configs = generate_system_configs (, m)
4       completionTimes = calc_config_times (configs)
5       rankedConfigs = rank_configs (configs, completionTimes)
6       bestConfig = pick_best_config (rankedConfigs)
       while True do
7             bestCompletionTime = bestConfig.completionTime
8             measure (bestConfig)
9             if (bestConfig.completionTime (1 + tolerance) * bestCompletionTime) then
10                   break
11                  
12             else
                   completionTimes =   update_completion_times (bestConfig)
                   rankedConfigs =   rank_configs (configs, completionTimes)
                   newBestConfig =   pick_best_config (rankedConfigs)
13                   if (newBestConfig == bestConfig) then
14                         break
15                        
16                   else
17                         bestConfig = newBestConfig
18                        
19                  
20            
21      
return bestConfig
Algorithm 1 Brute-Force Algorithm

V-B Brute-Force Algorithm

The brute-force algorithm for finding the best configuration from the candidate-set is stated in Algorithm 1 and it revolves around the following key steps. Given the candidate-set, we generate all the possible configurations containing all possible pairings of tasks into virtual-gangs (line-4). For this purpose, we write a recursive algorithm which, starting from the simplest configuration of tasks in the candidate-set into virtual-gangs where every task runs as a gang by itself, successively generates more complex configurations by programmatically pairing tasks into viable444A virtual-gang is viable if it requires up-to m cores to execute. virtual-gangs.

We compute the completion time of each configuration (line-5). For this step, we assume that the tasks inside a virtual-gang do not interfere with each other. Under this assumption, the execution time of each virtual-gang is equal to the WCET of its longest running constituent task. Once the completion time of each configuration has been computed, all the configurations are ranked based on their completion time from shortest to longest (lines-6:7). If two configurations have the same completion time, then the one which comprises smaller number of virtual-gangs is given the higher rank. If a tie still exists between multiple configurations, then the one which is computed first by the algorithm is given the higher rank. The rank value is then used to sort the configurations from best to worst—best being one with the highest rank. Table II shows the ranked configurations for the taskset used in the illustrative example in the Sec III-B. For this taskset, there are 51 unique configurations of candidate-set into virtual-gangs. We show the best three configurations and the worst three configurations in Table II.

Iterative Step: We empirically determine the completion time of the best configuration by executing the virtual-gangs which are part of the configuration on the target platform under the synchronization framework of RTG-Sync and measuring their WCETs (line-10). Based on the empirically determined completion time, there are two possibilities. If the empirically determined completion time is within a specified tolerance threshold (e.g., 20%) of its analytically computed value, the algorithm can finish and the best system configuration has been computed (line-11).

If, on the other hand, the empirically determined completion time is more than the tolerance threshold, the completion time of all configurations, which contain one or more of the virtual-gangs from the best configuration and whose execution time changed in the empirical measurement, is recalculated and then the configurations are re-ranked (lines-14:16). If the best configuration stays the same, then the algorithm can finish (line-17). Otherwise, the iterative step is repeated with the new best configuration until the algorithm converges.

Configuration Completion Time Gang Count Rank
{(), (, , , )} 5 2 1
{(), (, , , )} 6 2 2
{(), (, , , )} 6 2 3
{(, ), (), (), ()} 11 4 49
{(, ), (), (), ()} 11 4 50
{(), (), (), (), ()} 12 5 51
TABLE II: Ranked configurations for the illustrative example in Sec III-B. Rank = 1 is highest.

V-B1 Complexity Analysis

The worst-case complexity of the brute-force algorithm is linear with respect to the total number of system configurations due to the while loop on line-8. Given a candidate-set with tasks and a system with cores, the worst-case with respect to the number of unique system configurations arises when each task is single-threaded. In this case, the total number of system configurations is upper bounded by the following series sum:

(1)

where is Stirling number of the second kind [16] and can be calculated using the following equation [36]:

(2)

In the context of virtual-gang formation, each represents the unique number of ways to partition tasks into virtual-gangs. For the illustrative example from Sec III-B with and , the Equations 1 and 2 yield which is the same as the number of system configuration listed in Table II.

V-C Greedy Packing Heuristic

The brute-force algorithm is adequate when the candidate-set is small and the tasks in the candidate-set are heavily parallel. For lightly parallel tasks and large candidate-sets, the complexity of the brute-force algorithm rapidly becomes intractable555With and , there are more than 4 million system configurations for the brute-force algorithm. For this reason, we present a simple to use heuristic for gang formation. The first step in the heuristic is to sort the tasks in the candidate-set in decreasing order of their WCETs. Then we remove the task with the highest WCET, which we call anchor task, and pack as many tasks with it for co-execution as permissible by of the platform; giving preference to tasks with larger WCETs if multiple tasks can be paired with the anchor task. The tasks which are paired off are removed from the candidate-set.We continue this process until the candidate-set is empty. Once the virtual-gangs are formed by the heuristic, we empirically determine the WCETs of the virtual-gangs under RTG-Sync synchronization framework. If the WCET is within an acceptable tolerance threshold (e.g., 20%) of the analytically computed WCET, the virtual-gang is accepted. Otherwise, the virtual-gang is rejected and its member tasks are considered as separate gangs. The runtime of the heuristic is determined by the computational complexity of the sorting algorithm which in our case is .

Vi Schedulability Analysis

As stated in Section II, we consider the rigid real-time gang task model [15] with implicit deadlines.

We consider a set of periodic real-time tasks, denoted by , and a multicore platform with identical cores. Each task is characterized by three parameters where is the number of cores it needs to run, is the WCET, which may be analytically or empirically determined, and is the period. For RTG-Sync, we assume that each represents a virtual gang task, which may be created by a gang formulation algorithm described in the previous section.

Because the underlying gang scheduler schedules these virtual gang tasks one-at-a-time, the exact schedulability test for our system is a straight-forward application of the standard unicore response time analysis under the rate-monotonic priority assignment scheme [2], as depicted in the following:

(3)

.

The task-set is schedulable if the calculated response time of each and every task is less than its period.

Fig. 5: Schedulability results of the analyzed policies for different taskset types on 8 cores

Vi-a Simulation Setup

In this section, we present the schedulability results comparing RTG-Sync with RT-Gang with the help of synthetic taskset simulation. For the taskset generation, we begin by uniformly selecting a period in the range . For each , up to 10 rigid gang tasks are generated by selecting a WCET in the range and a parallelism level , which is varied depending on the taskset type, which will be defined later. The utilization of is calculated using the relation: . If is less than the remaining utilization for the taskset, the WCET is adjusted so that fills the remaining utilization. Otherwise, taskset generation continues until the desired level of utilization is reached.

We consider three types of tasksets in our simulation, based on the allowed level of parallelization for the tasks in the taskset. For a lightly-parallel taskset, is uniformly selected in the range ( is the number of cores as defined earlier). For a heavily-parallel taskset, the value of is picked from the range . Finally, for mixed taskset, the parallelization level is selected randomly from the interval .

Our taskset generation scheme is similar to the one used in [42]. However, there are two key differences. First, we consider rigid gang tasks instead of bundled gang tasks. This means that we consider the parallelization level of a task to be fixed prior to scheduling and it stays the same throughout the execution of the task. Second, we purposefully generate multiple tasks with the same period so that there is room for virtual-gang formation under RTG-Sync.

We consider the rate-monotonic based priority assignment scheme: if . For tasks with the same period, we assign priorities based on task’s WCET: if . For each taskset type, we calculate the schedulability results under two different scheduling policies. Under the RT-Gang policy, the unicore response time analysis using Equation 3 is applied to calculate schedulability of the taskset under the one-gang-at-a-time scheduling. In RTG-Sync scenarios, on the other hand, virtual-gangs are formed from the taskset and then Equation 3 is used to calculate schedulability of the taskset comprising virtual-gangs. Under RTG-Sync-BFC, virtual-gangs are formed using the brute-force method whereas in RTG-Sync-GPC, the greedy packing heuristic is used to form virtual-gangs.

Vi-A1 Simulation Results

Figure 5 shows the schedulability plots for 8 cores (). Note that the RT-Gang policy shows drastic difference in the schedulability trend among the three taskset types. RT-Gang works poorly for the lightly parallel tasks whereas it is considerably better for the highly parallel taskset. This behavior can be explained in light of the one-task-at-a-time policy of RT-Gang. For heavily parallel tasks, a single gang task can effectively utilize most cores in the multicore platform. However, for lightly parallel tasks, the same policy can leave most of the cores unused, leading to poor schedulability results. For RTG-Sync policies, this shortcoming of RT-Gang is remedied by means of virtual-gang formation, hence resulting in better schedulability in all types of tasksets. Moreover, the greedy-packing heuristic for virtual-gang formation yields almost exactly the same schedulability results as the “optimal” brute-force algorithm.

We repeated the simulation for 32 cores () and obtained similar results, which we did not include here due to space constraints.

Vii Evaluation

In this section, we describe the evaluation results of RTG-Sync on a real multicore platform.

Vii-a Setup

We use NVIDIA’s Jetson TX-2 [19] board for our evaluation experiments with RTG-Sync. The Jetson TX-2 board has a heterogeneous multicore cluster comprising six CPU cores (4 Cortex-A57 + 2 Denver666We do not use the Denver cores because their lack of support for necessary hardware performance counters to implement the throttling mechanism. ). We use the included Linux kernel version 4.4, which is patched with the open-source RT-Gang kernel patch [31] to enable real-time gang scheduling at the kernel level. We made several additional modifications to the Linux kernel to keep the virtual gang information in each thread’s kernel data structure. In all our experiments, we put our evaluation platform in maximum performance mode which involves statically maximizing the CPU and memory bus clock frequencies and disabling the dynamic frequency scaling governor. We also turn off the GUI and networking components and lower the run-level of the system () to keep the background system services to a minimum.

Vii-B Case-Study

In this case-study, we demonstrate the effectiveness of using virtual-gangs to improve system utilization, compared to the “one-gang-at-a-time” scheduling and Linux’s default scheduler.

Task WCET (ms) Period (ms) # Threads Priority
50.0 100.0 4 5
8.2 50.0 2 10
8.2 50.0 2 10
N/A 2 N/A
N/A 2 N/A
TABLE III: Taskset parameters for case-study
Fig. 6: Distribution of job duration for
(a) Linux Default
(b) RT-Gang
(c) RTG-Sync
Fig. 7: Annotated KernelShark trace snapshots of case-study scenarios for one hyper period

The taskset for the case-study is shown in Table III. It consists of three real-time tasks and two best-effort ones. For real-time tasks, we use the DNN workload from DeepPicar [4] as two of the real-time tasks and . Both DNN tasks use two threads each and have the same period of 50 ms. We use the synthetic bandwidth-rt benchmark as the third real-time task , which uses 4 threads and has a period of 100 ms. is designed to be oblivious to shared resource interference but it creates significant shared hardware resource contention to DNN tasks under co-scheduling. As per the RMS priority assignment, we assign higher real-time priority to DNN tasks than the bandwidth-rt task. For best-effort tasks, we use two benchmarks from the Parboil benchmark suite [37]. Among the best-effort tasks, is significantly more memory intensive than . Both best-effort tasks use two threads each and are pinned to disjoint CPU cores.

We evaluate the performance of this taskset on Jetson TX-2 under three scenarios. The Linux scenario represents the scheduling of the taskset under the vanilla Linux kernel. In RT-Gang scheme, the real-time tasks are gang scheduled with the one-gang-at-a-time policy. Finally, under RTG-Sync, we create a virtual-gang, which is comprised of the two DNN tasks, using the RTG-Sync framework.

Figure 6

shows the cumulative distribution function of the job execution times of

under the three compared schemes. Note that this task has the highest real-time priority in our case-study. In this figure, the performance of remains highly deterministic under both RT-Gang and RTG-Sync. In both cases, the observed WCET of stays within 10% of its solo WCET—i.e., measured WCET in isolation—from Table III. However, under the baseline Linux kernel (denoted as Linux), the job execution times of vary significantly, with the observed WCET approaching 2X of the solo WCET.

The difference among the observed performance of under the three scenarios can be better explained by analyzing the execution trace of the taskset in one hyper-period of 100 ms, which is shown in Figure 7. Inset 7(a) displays the execution timeline under vanilla Linux. It can be seen that the DNN tasks suffer from two main sources of interference in this scenario. Whenever the execution of the DNN tasks overlaps with the execution of , the execution time of the task increases. The execution time also increases when the DNN tasks get co-scheduled with best-effort tasks. Note that the system is not regulated in any way in this scenario. Therefore, the effect of shared resource interference is difficult to predict, as evidenced in the CDF plot of Figure 6, which shows highly variable timing behavior. Under RT-Gang, on the other hand, the execution of DNN tasks is almost completely deterministic. Due to the restrictive one-gang-at-a-time scheduling policy, co-scheduling of DNN tasks with is not possible. Moreover, the shared resource interference from the best-effort tasks is strictly regulated due to the kernel level throttling framework of RT-Gang.

However, under RT-Gang, each DNN task executes as a separate gang by itself, which means that two system cores are left unusable for real-time tasks while the DNN tasks are executing because of the one-gang-at-a-time policy. This reduces the share of total system utilization of the multicore platform, which can be used by other real-time tasks. Although the idle cores are utilized by best-effort tasks, the strict regulation imposed by DNN tasks means that the best-effort tasks are mostly throttled when they are co-scheduled with DNN tasks. Under RTG-Sync, both of these problems are solved by pairing and into a single virtual-gang. In this case, the system is fully utilizable by real-time tasks. The execution of virtual DNN gang is completely deterministic due to the synchronization framework of RTG-Sync. Moreover, since there is no co-scheduling of best-effort tasks with real-time tasks, the throttling framework does not get activated and any slack duration left by real-time tasks can be utilized completely by best-effort tasks without imposing throttling.

Vii-C Overhead

The runtime overhead due to RTG-Sync can be broken down into two parts. First is the overhead due to synchronization. This overhead is incurred only once during the setup phase of the real-time tasks which are members of the same virtual-gang and it does not contribute to the WCETs of the periodic jobs. Second, the kernel level overhead is incurred due to the simultaneous scheduling of real-time tasks by the gang-scheduler. Since we use the RT-Gang framework for this purpose, the kernel level overhead is the same as reported in [1], which showed negligible overhead on a quad-core platform.

Viii Related Work

Parallel real-time tasks are generally modeled using one of following three models: Fork-join model [23, 33, 29, 8], DAG model [3, 32] and gang task model [6, 14, 15, 20, 12]. In the fork-join model, a task alternates between parallel (fork) and sequential (join) phases over time. In DAG model, which is a generalization of the fork-join model, a task is represented as a directed acyclic graph with a set of associated precedence constraints, which allows more flexible scheduling as long as the constraints are satisfied. Lastly, the gang model of parallel tasks is further divided into three categories. Under the rigid gang model [20, 12, 14], the number of cores required by the gang are determined prior to scheduling and are assumed to stay constant throughout its execution. In the moldable gang model [6], the number of cores required by the gang are determined by the scheduler on a per-job basis but once determined, they are assumed to stay constant throughout the execution of the job. Finally, in the malleable gang model [9], the number of cores required by the job can change during the job’s execution. Recently a bundled gang model is proposed in [42], which is a generalization of rigid gang model that allows more flexible parallel task modeling at the cost of increased analysis complexity.

In the real-time systems community, fixed-priority and dynamic priority real-time versions of gang scheduling policies, namely Gang FTP and Gang EDF, respectively, are studied and analyzed [15, 20, 12]. However, these prior real-time gang scheduling policies do not consider interference caused by shared hardware resources in multicore processors. On the other hand, the Isolation Scheduling model [17] and a recently proposed integrated modular avionic (IMA) scheduler design in [28] consider shared resource interference and limit co-scheduling to the tasks of the same criticality (in  [17]) or those in the same IMA partition (in [28]). However, they do not specifically target parallel real-time tasks and do not allow co-scheduling of best-effort tasks. Also, to the best of our knowledge, all aforementioned real-time scheduling policies were not implemented in actual operating systems. Recently, a restrictive form of gang scheduling policy, which limits scheduling of just one gang task at a time, was proposed and implemented in Linux as a open-source project [1, 31]. The gang scheduler, called RT-Gang, provides strong temporal isolation by avoiding and bounding shared resource interference. However, it can significantly under-utilize computing resources in scheduling critical real-time tasks. Our work leverages the open-source RT-Gang scheduler and develops mechanisms and methodologies that improve real-time schedulability of the system at a marginal cost in terms of execution time predictability.

Many researchers have attempted to make COTS multicore platforms to be more predictable with OS-level techniques. A majority of prior works focused on partitioning of shared resources among the tasks and cores to improve predictability. Page coloring has long been studied to partition shared cache [24, 25, 47, 35, 11, 41, 21, 43], DRAM banks [44, 26, 39], and TLB [30]. Some COTS processors [18, 27] support cache-way partitioning [38]. Mancuso et al. [27] and Kim et al. [22], used both coloring and cache way partitioning for fine-grained cache partitioning. While these shared resource partitioning techniques can reduce space conflicts of some shared resources, hence beneficial for predictability, but they are often not enough to guarantee strong time predictability on COTS multicore platforms because of many undisclosed yet important shared hardware [40, 5]. Furthermore, partitioning techniques generally lower performance and efficiency and are difficult to apply especially for parallel tasks.

Ix Conclusion

We presented a virtual gang based parallel real-time task scheduling approach for multicore platforms. Our approach is based on the notion of virtual gang, a group of parallel real-time tasks that are statically linked and scheduled together as a single scheduling entity. We presented an intra-gang synchronization framework and virtual gang formation algorithms that enable strong temporal isolation and high real-time schedulability in scheduling parallel real-time tasks on COTS multicore platforms. We evaluated our approach both analytically and empirically on a real embedded multicore platform using real-world workloads. Our evaluation results showed the effectiveness and practicality of our approach. In future, we plan to extend our approach to support heterogeneous cores and accelerators such as GPUs.

References

  • [1] W. Ali and H. Yun (2019) RT-gang: real-time gang scheduling framework for safety-critical systems. In Real-Time and Embedded Technology and Applications Symposium (RTAS), Cited by: §I, §I, Fig. 2, §II-C, §II-D, §II-D, §III, §VII-C, §VIII.
  • [2] N. Audsley, A. Burns, M. Richardson, K. Tindell, and A. Wellings (1993) Applying new scheduling theory to static priority preemptive scheduling. Software Engineering Journal 8 (5), pp. 284–292. Cited by: §VI.
  • [3] S. Baruah, V. Bonifaci, A. Marchetti-Spaccamela, L. Stougie, and A. Wiese (2012) A generalized parallel task model for recurrent real-time processes. In Real-Time Systems Symposium (RTSS), pp. 63–72. Cited by: §VIII.
  • [4] M. G. Bechtel, E. McEllhiney, and H. Yun (2018)

    DeepPicar: A Low-cost Deep Neural Network-based Autonomous Car

    .
    In Embedded and Real-Time Computing Systems and Applications (RTCSA), Cited by: §VII-B.
  • [5] M. G. Bechtel and H. Yun (2019) Denial-of-service attacks on shared cache in multicore: analysis and prevention. In Real-Time and Embedded Technology and Applications Symposium (RTAS), Cited by: §I, §I, §VIII.
  • [6] V. Berten, P. Courbin, and J. Goossens (2011) Gang fixed priority scheduling of periodic moldable real-time tasks. In Junior Researcher Workshop Session of the 19th International Conference on Real-Time and Network Systems (RTNS), pp. 9–12. Cited by: §I, §VIII.
  • [7] Certification Authorities Software Team (2016-11) CAST-32A: Multi-core Processors. Technical report Federal Aviation Administration (FAA). Cited by: §I, §II-C.
  • [8] H. S. Chwa, J. Lee, K. Phan, A. Easwaran, and I. Shin (2013) Global edf schedulability analysis for synchronous parallel tasks on multicore platforms. In Euromicro Conference on Real-Time Systems (ECRTS), pp. 25–34. Cited by: §VIII.
  • [9] S. Collette, L. Cucu, and J. Goossens (2008) Integrating job parallelism in real-time scheduling theory. Information Processing Letters 106 (5), pp. 180–187. External Links: ISSN 0020-0190 Cited by: §VIII.
  • [10] L. Dagum and R. Menon (1998) OpenMP: an industry-standard api for shared-memory programming. Computing in Science & Engineering (1), pp. 46–55. Cited by: §II-A.
  • [11] X. Ding, K. Wang, and X. Zhang (2011) SRM-buffer: an os buffer management technique to prevent last level cache from thrashing in multicores. In Proceedings of the Sixth Conference on Computer Systems, EuroSys, pp. 243–256. External Links: Link, Document Cited by: §VIII.
  • [12] Z. Dong and C. Liu (2017) Analysis Techniques for Supporting Hard Real-Time Sporadic Gang Task Systems. In Real-Time Systems Symposium (RTSS), pp. 128–138. External Links: Document, ISBN 9781538614143, ISSN 10528725 Cited by: §I, §VIII, §VIII.
  • [13] D. G. Feitelson and L. Rudolph (1992) Gang scheduling performance benefits for fine-grain synchronization. Journal of Parallel and distributed Computing 16 (4), pp. 306–318. Cited by: §I.
  • [14] J. Goossens and P. Richard (2016) Optimal Scheduling of Periodic Gang Tasks. Transactions on Embedded Systems 3 (1), pp. 1–4. External Links: Document Cited by: §I, §VIII.
  • [15] J. Goossens and V. Berten (2010) Gang FTP scheduling of periodic and parallel rigid real-time tasks. In International Conference on Real-Time Networks and Systems (RTNS), pp. 189–196. External Links: Link Cited by: §I, §II-A, §II-B, §VI, §VIII, §VIII.
  • [16] R. L. Graham, D. E. Knuth, and O. Patashnik (1989) Concrete mathematics: a foundation for computer science. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA. External Links: ISBN 0-201-14236-9 Cited by: §V-B1.
  • [17] P. Huang, G. Giannopoulou, R. Ahmed, D. B. Bartolini, and L. Thiele (2015) An isolation scheduling model for multicores. In Real-Time Systems Symposium (RTSS), pp. 141–152. Cited by: §VIII.
  • [18] Intel Improving real-time performance by utilizing cache allocation technology. Note: https://software.intel.com/en-us/articles/introduction-to-cache-allocation-technology Cited by: §VIII.
  • [19] Jetson tx2 module. Note: https://developer.nvidia.com/embedded/jetson-tx2 Cited by: §VII-A.
  • [20] S. Kato and Y. Ishikawa (2009) Gang EDF scheduling of parallel task systems. In Real-Time Systems Symposium (RTSS), pp. 459–468. External Links: Link, Document Cited by: §I, §VIII, §VIII.
  • [21] H. Kim, A. Kandhalu, and R. Rajkumar (2013) A coordinated approach for practical os-level cache management in multi-core real-time systems. In Euromicro Conference on Real-Time Systems (ECRTS), pp. 80–89. External Links: Document Cited by: §VIII.
  • [22] N. Kim, B. C. Ward, M. Chisholm, J. H. Anderson, and F. D. Smith (2017) Attacking the one-out-of-m multicore problem by combining hardware management with mixed-criticality provisioning. Real-Time Systems 53 (5), pp. 709–759. Cited by: §I, §VIII.
  • [23] K. Lakshmanan, S. Kato, and R. Rajkumar (2010) Scheduling parallel real-time tasks on multi-core processors. In Real-Time Systems Symposium (RTSS), pp. 259–268. Cited by: §II-B, §VIII.
  • [24] J. Liedtke, H. Hartig, and M. Hohmuth (1997) OS-controlled cache predictability for real-time systems. In IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), pp. 213–224. External Links: Document Cited by: §VIII.
  • [25] J. Lin, Q. Lu, X. Ding, Z. Zhang, X. Zhang, and P. Sadayappan (2008) Gaining insights into multicore cache partitioning: bridging the gap between simulation and real systems. In IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 367–378. External Links: Document Cited by: §VIII.
  • [26] L. Liu, Z. Cui, M. Xing, Y. Bao, M. Chen, and C. Wu (2012) A software memory partition approach for eliminating bank-level interference in multicore systems. In International Conference on Parallel Architectures and Compilation Techniques (PACT), pp. 367–375. Cited by: §VIII.
  • [27] R. Mancuso, R. Dudko, E. Betti, M. Cesati, M. Caccamo, and R. Pellizzoni (2013) Real-time cache management framework for multi-core architectures. In IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), pp. 45–54. External Links: Document Cited by: §VIII.
  • [28] A. Melani, R. Mancuso, M. Caccamo, G. Buttazzo, J. Freitag, and S. Uhrig (2017) A scheduling framework for handling integrated modular avionic systems on multicore platforms. In Embedded and Real-Time Computing Systems and Applications (RTCSA), pp. 1–10. Cited by: §VIII.
  • [29] G. Nelissen, V. Berten, J. Goossens, and D. Milojevic (2012) Techniques optimizing the number of processors to schedule multi-threaded tasks. In Euromicro Conference on Real-Time Systems (ECRTS), pp. 321–330. Cited by: §VIII.
  • [30] S. A. Panchamukhi and F. Mueller (2015) Providing task isolation via tlb coloring. In IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), pp. 3–13. External Links: Document Cited by: §VIII.
  • [31] RT-Gang code repository. Note: https://github.com/CSL-KU/RT-Gang Cited by: §VII-A, §VIII.
  • [32] A. Saifullah, D. Ferry, J. Li, K. Agrawal, C. Lu, and C. D. Gill (2014) Parallel real-time scheduling of DAGs. Parallel and Distributed Systems, IEEE Transactions on 25 (12), pp. 3242–3252. Cited by: §VIII.
  • [33] A. Saifullah, J. Li, K. Agrawal, C. Lu, and C. Gill (2013) Multi-core real-time scheduling for generalized parallel task models. Real-Time Systems 49 (4), pp. 404–435. External Links: Document, Link Cited by: §VIII.
  • [34] L. Sha, M. Caccamo, R. Mancuso, J. Kim, M. Yoon, R. Pellizzoni, H. Yun, R. B. Kegley, D. R. Perlman, G. Arundale, et al. (2016) Real-time computing on multicore processors. Computer 49 (9), pp. 69–77. Cited by: §I.
  • [35] L. Soares, D. Tam, and M. Stumm (2008) Reducing the harmful effects of last-level cache polluters with an os-level, software-only pollute buffer. In IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 258–269. External Links: Document Cited by: §VIII.
  • [36] Stirling Number of the Second Kind. Note: http://mathworld.wolfram.com/StirlingNumberoftheSecondKind.html Cited by: §V-B1.
  • [37] J. A. Stratton, C. Rodrigues, I. Sung, N. Obeid, L. Chang, N. Anssari, G. D. Liu, and W. W. Hwu (2012) Parboil: a revised benchmark suite for scientific and commercial throughput computing. Technical report University of Illinois at Urbana-Champaign. Cited by: §VII-B.
  • [38] G. E. Suh, S. Devadas, and L. Rudolph (2002) A new memory monitoring scheme for memory-aware scheduling and partitioning. In International Symposium on High Performance Computer Architecture, pp. 117–128. External Links: Document Cited by: §VIII.
  • [39] N. Suzuki, H. Kim, D. d. Niz, B. Andersson, L. Wrage, M. Klein, and R. Rajkumar (2013) Coordinated bank and cache coloring for temporal protection of memory accesses. In IEEE International Conference on Computational Science and Engineering (CSE), pp. 685–692. External Links: Document Cited by: §VIII.
  • [40] P. K. Valsan, H. Yun, and F. Farshchi (2016) Taming non-blocking caches to improve isolation in multicore real-time systems. In Real-Time and Embedded Technology and Applications Symposium (RTAS), Cited by: §I, §VIII.
  • [41] B. C. Ward, J. L. Herman, C. J. Kenna, and J. H. Anderson (2013) Making shared caches more predictable on multicore platforms. In Euromicro Conference on Real-Time Systems (ECRTS), pp. 157–167. External Links: Document Cited by: §VIII.
  • [42] S. Wasly and R. Pellizzoni (2019) Bundled scheduling of parallel real-time tasks. In Real-Time and Embedded Technology and Applications Symposium (RTAS), pp. 130–142. Cited by: §I, §VI-A, §VIII.
  • [43] Y. Ye, R. West, Z. Cheng, and Y. Li (2014) COLORIS: a dynamic cache partitioning system using page coloring. In International Conference on Parallel Architecture and Compilation Techniques (PACT), pp. 381–392. External Links: Document Cited by: §VIII.
  • [44] H. Yun, R. Mancuso, Z. Wu, and R. Pellizzoni (2014) PALLOC: DRAM bank-aware memory allocator for performance isolation on multicore platforms. In IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), pp. 155–166. External Links: Document Cited by: §VIII.
  • [45] H. Yun, R. Pellizzon, and P. K. Valsan (2015) Parallelism-aware memory interference delay analysis for cots multicore systems. In 27th Euromicro Conference on Real-Time Systems (ECRTS), pp. 184–195. External Links: Link, Document Cited by: §I.
  • [46] H. Yun, G. Yao, R. Pellizzoni, M. Caccamo, and L. Sha (2013) MemGuard: memory bandwidth reservation system for efficient performance isolation in multi-core platforms. In Real-Time and Embedded Technology and Applications Symposium (RTAS), Cited by: §II-C.
  • [47] X. Zhang, S. Dwarkadas, and K. Shen (2009) Towards practical page coloring-based multicore cache management. In Proceedings of the 4th ACM European Conference on Computer Systems, EuroSys ’09, pp. 89–102. External Links: Link, Document Cited by: §VIII.