Protecting Memory-Performance Critical Sections in Soft Real-Time Applications

02/08/2015 ∙ by Heechul Yun, et al. ∙ BOSE The University of Kansas 0

Soft real-time applications such as multimedia applications often show bursty memory access patterns—regularly requiring a high memory bandwidth for a short duration of time. Such a period is often critical for timely data processing. Hence, we call it a memory-performance critical section. Unfortunately, in multicore architecture, non-real-time applications on different cores may also demand high memory bandwidth at the same time, which can substantially increase the time spent on the memory performance critical sections. In this paper, we present BWLOCK, user-level APIs and a memory bandwidth control mechanism that can protect such memory performance critical sections of soft real-time applications. BWLOCK provides simple lock like APIs to declare memory-performance critical sections. If an application enters a memory-performance critical section, the memory bandwidth control system then dynamically limit other cores' memory access rates to protect memory performance of the application until the critical section finishes. From case studies with real-world soft real-time applications, we found (1) such memory-performance critical sections do exist and are often easy to identify; and (2) applying BWLOCK for memory critical sections significantly improve performance of the soft real-time applications at a small or no cost in throughput of non real-time applications.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In a multicore system, an application’s performance running on a core can be significantly affected by other applications on different cores due to contention in shared hardware resources such as shared Last-Level Cache (LLC) and DRAM. When the shared resources become bottlenecks, traditional CPU scheduling based techniques such as raising priorities [21] or using CPU reservation based approaches  [9, 16, 1] do not necessarily improve performance of the real-time applications.

In hard real-time systems, such as avionics systems, one solution adopted in the industry has been disabling all but one core in the system [20] to completely eliminate the shared resource contention problem, thereby being able to be certified [5] 111The current standard for certification is designed for unicore systems [2]. Another approach, adopted in PikeOS, is a time partitioning technique in which only one core is allowed to execute for a set of pre-defined time windows [10].

In the context of soft real-time systems, on the other hand, a certain degree of performance variation due to interference is often tolerable. Furthermore, modern multicore architecture provides a significant amount of parallelism in the processor architecture (e.g., out-of-order core design) and the memory subsystems (e.g., non-blocking caches and multi-bank DRAM) that can absorb a considerable degree of concurrent accesses without noticeable performance impacts [12]. Therefore, it is highly desirable to develop a solution that can provide better real-time performance while still allowing concurrent executions to leverage the full potential of multicore.

In our previous work, we developed a software based memory access control system, called MemGuard, that allows concurrent memory accesses from multiple cores—up to certain limits—by providing a minimum memory bandwidth guarantees to each core in the system [31]. One problem of this approach is, however, that the reservable amount of bandwidth is very small, compared to the peak memory bandwidth. While it tries to maximize performance via a prediction based bandwidth reclaiming, accurate prediction is challenging, especially for bursty memory access patterns, which are commonly found in many soft real-time applications.

In this paper, we present BWLOCK, a user-level API and memory bandwidth control mechanism to protect performance of soft real-time applications such as multimedia applications. Our key observation is that interference is most visible when multiple cores have high memory demands at the same time. In such cases, all participating cores will be delayed due to queueing and other issues that cannot be hidden by the underlying hardware. Therefore, in order to protect performance of real-time applications, the system must avoid overload situations when the real-time applications have high memory performance by executing memory intensive regions of the code. We call such a region as memory-performance critical section. Fortunately, in many soft real-time applications, such as multimedia applications, such memory-performance critical sections are often easy to identify via application level profiling techniques. For example, using perf in Linux, one can identify functions that have very high memory demands.

Motivated from the observations, BWLOCK provides a lock like API with which programmers can describe certain sections of code that are memory-performance critical. When a memory-performance critical section is being executed, BWLOCK limits the amount of allowed memory traffic from the other cores to avoid overloading memory performance. In cases that modifying source code or profiling is not desired or possible, BWLOCK also allows to declare the entire execution of an application as memory performance critical so that whenever the application is scheduled on a CPU core, its memory performance can be ensured. We call the former as fine-grained bandwidth locking and the latter as coarse-grained bandwidth locking.

We applied BWLOCK in two real-world soft real-time applications—Mplayer and WebRTC framework (as part of the chromium-browser)—to protect their real-time performance in the presence of memory intensive non real-time applications. In the case of Mplayer, we achieve near perfect isolation for the Mplayer at a cost of 17% throughput reduction of the non real-time applications in the coarse-grain mode, or achieve 17% better real-time performance for the Mplayer at the cost of only 7% throughput reduction of the non-real-time applications in the fine-grain mode. Similar improvements are observed for WebRTC as well.

Our contributions are as follows:

  • We propose an OS mechanism and API that can substantially improve performance of soft real-time applications in a multi-programmed environment such as cloud systems.

  • We present extensive evaluation results using real-world soft real-time applications demonstrating the viability and the practicality of the proposed approach.

The remaining sections are organized as follows: Section 2 provides background on software based memory access control technique and motivating experiments. Section 3 presents the design and implementation of BWLOCK. Section 4 describes the evaluation platforms and the implementation overhead analysis. Section 5 presents case study results using two real-world soft real-time applications. Section 6 discusses limitations and possible improvements. We discuss related work in Section 7 and conclude in Section 8.

2 Background and Motivation

In  [31], we proposed a software based memory bandwidth management system called MemGuard in Linux kernel. The key idea is to periodically monitor and regulate the memory access rate of each core using per-core hardware performance counters. If, for example, a group of tasks generates too much memory traffic and delays the critical real-time tasks, MemGuard can regulate the memory access rates of the cores running the offending tasks. With the regulation mechanism, MemGuard offers a bandwidth reservation service that partitions a fraction of available memory bandwidth among the cores and ensures the reserved bandwidth is to be guaranteed all the time. The bandwidth reservation parameter is chosen statically (albeit they can be modified at run-time by the system administrator) for each core. Once the reservation parameter is chosen, the primary goal of MemGuard is to guarantee the reserved bandwidth of each core. Static partitioning, however, is inherently inefficient when demands of cores change over time as unused bandwidth can be wasted. To minimize bandwidth under-utilization due to the static partitioning, MemGuard employs a prediction based bandwidth reclaiming mechanism that dynamically re-distributes unused bandwidths at runtime.

There are, however, a few issues when we apply the MemGuard to improve performance soft real-time applications. First, MemGuard reserves memory bandwidth on a per-core basis. Therefore, when a core hosts different types of applications (with different memory bandwidth demands), it is difficult to choose appropriate bandwidth reservation parameters. Second, while this restriction is mitigated—to a certain degree—by the runtime bandwidth re-distribution mechanism, the effectiveness of the mechanism depends on its prediction accuracy, which is in general very challenging. In particular, multimedia soft real-time applications, which we focus on this paper, often show bursty memory access patterns, which are difficult to predict without application level information or a sophisticated learning algorithm, which is difficult to implement efficiently in the kernel.

Figure 1: Memory bandwidth demand changes over time of Mplayer and WebRTC.
Figure 2: Average memory access latency of a Latency benchmark as a function of the memory bandwidth of each co-runner on three different cores.

Figure 1 shows memory access patterns of two multimedia applications—Mplayer and WebRTC—collected over a 10 second duration (sampled at every 1ms.) As both programs process video/audio frames at a regular interval, when processing a new frame, they require high memory bandwidth for a short period of time, while at other times their memory bandwidth demands are low as they are executing compute intensive instructions or waiting for the next period.

When these soft real-time applications compete memory bandwidth with other applications running on different cores, the short sections of code that demand high memory bandwidth could suffer a disproportionally high degree of performance impact. When the overall memory demand is low, memory access latencies often can be hidden due to a variety of latency hiding techniques (e.g., out-of-order) and the abundant memory level parallelism in modern multicore architecture [12]. However, when the overall demand is beyond a certain point, such techniques are no longer able to hide latencies and the requests are piled up in various queues in the system, which substantially slowdown all tasks requesting the memory. Figure 2 illustrates this phenomenon. In this experiment, we measure the average memory access latency (normalized to run-alone performance) of Latency [31] benchmark (a pointer chasing micro-benchmark) while varying the memory bandwidth demand of co-runners on the other three different cores in a quad-core Intel Xeon system (Detailed hardware setup is described in Section 4). When the co-runner’s bandwidth is low (100MB/s), the performance impact to the measured Latency benchmark is negligible. However, as the memory bandwidth demand of the co-runners increase (100 - 900 MB/s), the observed delay of the Latency benchmark is increased exponentially and then saturated (after 900 MB/s).

In summary, our observations are as follows: (1) Soft real-time applications such as multimedia applications often show bursty memory access patterns—regularly requiring a high memory bandwidth for a short duration of time. (2) Such a period, which we call a memory-performance critical section, is often critical for timely data processing but it can be disproportionally delayed by bandwidth demanding co-runners on different cores.

These observations motivate us to design a new memory access control system, BWLOCK, described in the next section.

3 Bwlock

BWLOCK is user-level APIs and a kernel-level memory bandwidth control mechanism, which is designed to improve performance of soft real-time time applications (e.g., multimedia applications.) on multicore systems. It provides simple lock like APIs that can be called by the applications to express the importance of memory performance for a given section of code (i.e., memory-performance critical section.) Once an application acquires the lock, which we call a memory bandwidth lock, the kernel-level memory bandwidth control system allows unlimited memory accesses to the requesting task while regulating the maximum allowed memory bandwidth of the other cores to avoid excess bandwidth contention, which could delay the task running the memory-performance critical section.

3.1 System Architecture

Figure 3 shows overall architecture of the proposed system. At the user-level, we provide two APIs—bw_lock() and bw_unlock()—to protect memory-performance critical sections. When the bw_(un)lock() is called, the kernel updates the calling process’s state so that whenever the CPU scheduler schedules the task, the kernel can determine whether the task is executing the memory critical section or not. Instead of modifying the code, external utilities can also set the bandwidth locks of other processes in the system. The per-core bandwidth regulators are activated when there are one or more cores executing memory-performance critical sections. In our current implementation, the check is periodically (e.g., at every 1ms) performed by software based bandwidth regulators. Ideally, however, hardware assisted mechanisms could support more fine-grained memory access control (See Section 6 for discussions on potential hardware support.)

Figure 3: Overall system architecture of BWLOCK.

3.2 Design and Implementation

API Description
bw_lock() begin a memory-performance critical section
bw_unlock() end a memory-performance critical section
Table 1: BWLOCK user-level APIs

BWLOCK supports fine-grained and coarse-grained bandwidth locking. In fine-grained mode, programmers are required to use the APIs in Table 1 to declare memory-performance critical sections. It allows fine-grain control over memory performance but requires detailed profiling information to be effective. Often, such profiling information can easily be obtained using publicly available tools such as perf in Linux as we will show in our case studies in Section 5. The coarse-grained mode is an equivalent of calling bw_lock once, by the program itself or by the external utility, and never releases it. Then, whenever the process is scheduled, it automatically holds the bandwidth lock. We provide an external tool to set the bandwidth lock of any existing process in the system. Therefore, BWLOCK can be applied to unmodified programs, albeit the granularity of control is the entire duration the task occupies a CPU core. It is important to note that unlike traditional locks used for synchronization [4], in which only one task can acquires a lock, a bandwidth lock can be acquired by multiple tasks, perhaps on different cores, at any given time. In other words, if there are multiple soft real-time applications who request a bandwidth lock, all of them will be granted to access the bandwidth lock. This design is because the primary goal of BWLOCK is to protect soft real-time applications from memory intensive non real-time applications. In a sense, our design is a two-level priority system that prioritizes real-time tasks over non real-time tasks in accessing memory. It can be, however, naturally extended to support multiple levels of priorities in accessing memory in the future. For example, instead of allowing unlimited memory accesses, the task which holds a bandwidth lock can also be regulated depending on the priority value associated with the bandwidth lock.

Figure 4 shows the kernel-level implementation of BWLOCK. We added an integer value bwlock_val to indicate the status of BWLOCK in the process control block structure of Linux (task_struct). The value can be updated via a system call (See syscall_bwlock()). Since it simply updates an integer value and nothing else, its calling overhead is very small (Overhead analysis is given in Section 4.2). Each core’s bandwidth regulator (See per_core_period_handler()) periodically checks how many cores are executing memory-performance critical sections (i.e., task’s bwlock_val 0). If one or more cores are executing memory-performance critical sections, only the cores that hold the bandwidth lock can access memory freely (maxperf_budget is an infinite value) while the others are regulated according to minperf_budget. Note that the  minperf_budget is a system parameter that indicates the maximum amount of memory traffic that can co-exist without significant performance interference. In our current implementation it is 100MB/s. If a PMC overflow interrupt occurs, due to exhausting the bandwidth limit, the core is immediately throttled by scheduling a high priority real-time kernel thread (kthrottle). The throttled core is re-activated at the beginning of every period handler.

// task structure
struct task_struct {
int bwlock_val; // 1 - locked, 0 - unlocked
};
// bwlock system call
syscall_bwlock(pid_t pid, int val)
{
struct task_struct *p;
if (pid == 0)
p = current; // current <- calling task
else
p = find_process_by_pid(pid);
p->bwlock_val = val;
return 0;
}
// periodic handler called by the
// bandwidth regulators
void per_core_period_handler()
{
// re-activate the suspended core
if (current == kthrottle)
deschedule(kthrottle);
if (nr_bwlocked_cores() > 0) {
// one or more cores requested bwlock
if (current->bwlock_val > 0)
budget = maxperf_budget;
else
budget = minperf_budget;
} else {
// no cores requested bwlock
budget = maxperf_budget;
}
// program the cores performance counter
// to overflow at budget memory accesses
}
// PMC overflow handler
void per_core_overflow_handler()
{
// stall the core till the next period
// kthrottle <- high priority idle thread
schedule(kthrottle);
}
Figure 4: BWLOCK kernel implementation

4 Evaluation Setup

In this section, we present details on the hardware platform and the BWLOCK software implementation. We also provide detailed overhead analysis and discuss performance trade-offs.

4.1 Hardware Platform

We use a quad-core Intel Xeon W3530 based desktop computer as our testbed. The processor has private 32K-I/32K-D (4/8 way) L1 cache, a private 256 KiB (8 way) L2 cache for each core and a shared 8MiB (16 way) L3 cache. The memory controller (MC) is integrated in the processor and connected to a 4GiB 1066 MHz DDR3 memory module. The graphic card is NVIDIA GeForce 8400. We disabled turbo-boost, dynamic power management, and hardware prefetchers for better performance predictability and repeatability.

4.2 Implementation Details and Overhead Analysis

We implemented BWLOCK in Linux version 3.6 222BWLOCK will be publicly available at https://github.com/heechul/bwlock. The kernel’s task_struct is modified according to Figure 4. For memory bandwidth control, BWLOCK uses a modified version of MemGuard kernel module [31].

There are two major sources of overhead in BWLOCK: system call and interrupt handling. First, in the fine-grained setting, two system calls are required for each memory critical section. In our current implementation, a single system call is used to implement both bw_lock() and bw_unlock(). The system-call overhead is small: 125.24ns on average (out of 10,000 iterations.) as it simply changes a single integer value in the task’s task_struct.

Second, in our current implementation, to monitor which cores are having the bandwidth lock, a periodic timer handler is being used as shown in Figure 4. And actual access control is performed by a performance counter overflow interrupt handler. Although the overflow handler is not in the critical path of normal program execution, the period timer interrupt is pure overhead that is added to the task’s execution time, just like the OS tick timer handler in standard operating systems. We quantified the period interrupt handling overhead by measuring the execution increases of a benchmark. Table 2 shows the measured overhead (i.e., percentage of the increased execution time.) under different period lengths. Based on this result, we use 1ms period unless noted otherwise.

Period (us) Overhead (%)
100 3.5
250 1.5
500 0.9
1000 0.7
2500 0.5
Table 2: Period interrupts handling overhead

5 Evaluation Results

In this section, we presents case-study results using two real-world soft real-time applications—Mplayer (a video player) and WebRTC [11] (a multimedia real-time communication framework for browser based web applications)—to evaluate the effectiveness of BWLOCK.

5.1 Mplayer

Mplayer is a widely used open-source video player. In the following set of experiments, our goal is to protect real-time performance of the Mplayer(s) in the presence of memory intensive co-running applications while still maximizing overall throughput of the co-runners.

In the first set of experiments, one Mplayer instance plays an H264 movie clip with a frame resolution of 1920816 and a frame rate of 24fps. We slightly modified the source code of Mplayer to get the per-frame processing time and other statistics. Decoded video frames are displayed on screen via a standard X11 server process. Therefore, the Mplayer and the X11 have soft real-time characteristics.

5.1.1 Profiling

To understand their memory-performance characteristics, we collect function level profiling information—cache-misses and cycles of each function—with the perf tool, which uses hardware performance counters. The profiled information of Mplayer and X11 is shown in Table 3 and 4, respectively. In both cases, the functions that generate most of memory traffic were identified: yuv420_rgb32_MMX in Mplayer and sse2_blt in X11. Note that each function is responsible for more than 50% of total LLC-misses of its respective application, while is responsible for much less CPU cycles (27.8% and 32.85% respectively). Therefore, they are prime candidates for applying BWLOCK. Due to the restrictions of our current implementation—integration of periodic bandwidth regulation—it is also important to know the duration of each function: if it is too short, BWLOCK may not be able to regulate co-runners’ memory accesses when needed. Table 5 shows the average and 99 percentile execution times of the functions. Fortunately, both functions are long enough to be regulated by the bandwidth control mechanism of BWLOCK.

LLC misses Cycles Function
51.6% 27.8% yuv420_rgb32_MMX
18.8% 9.3% prefetch_mmx2
4.5% 7.3% hl_decode_mb_simple_8
Table 3: Profiled information of Mplayer
LLC misses Cycles Function
53.29% 32.85% sse2_blt
24.13% 24.19% fbBlt
14.10% 19.61% sse2_composite_over_8888_88888
Table 4: Profiled information of X11
Average 99 pct. Function Application
duration duration
2.9ms 4.2ms yuv420_rgb32_MMX Mplayer
1.1ms 2.9ms sse2_blt X11
Table 5: Timing statistics of memory intensive functions

5.1.2 Performance comparison

To investigate the effectiveness of BWLOCK, we conducted a set of experiments. We first run the Mplayer alone (with the X-server) to get the baseline performance. In order to generate memory interference, we use two instances of a memory intensive synthetic benchmark [31], referred as bw_write. We also measure their baseline performance in isolation. We co-schedule all four processes—Mplayer, X11, and two bw_write instances—at the same time in four different configurations. For convenience of monitoring and measurements, each process is assigned to a dedicated core using a affinity facility in Linux. Note that all four processes are single-threaded. In Default, we use a standard vanilla Linux kernel. In MemGuard, we use MemGuard [31]; the memory bandwidth budgets are configured as 450, 450, 100, and 100MB/s for Core0 to 3, respectively, and predictive bandwidth re-distribution is enabled; note that Mplayer (Core0) and X11 (Core1) are reserved more bandwidths than the co-runners. In BWLOCK(fine), we manually insert bw_lock and bw_unlock in the previously identified memory intensive functions of Mplayer and X11 (Table 5), as shown in Figure 5. Lastly, in Bwlock(coarse), both Mplayer and X11 are not modified but configured to automatically hold the bandwidth lock whenever they are scheduled.

static inline int yuv420_rgb32_MMX
(SwsContext *c, const uint8_t *src[],
..
{
bw_lock(); // added
YUV2RGB_LOOP(4)
bw_unlock(); // added
}
Figure 5: Code modification example for fine-grained application of BWLOCK.
Figure 6: Normalized performance of Mplayer (average frame time) and co-running Bandwidth benchmarks (MB/s): 1Mpalyer and 2Bandwidth instances.
(a) Default
(b) MemGuard
(c) BWLOCK(fine)
(d) BWLOCK(coarse)
Figure 7: Per-core memory access patterns.

Figure 6 shows results. For Mplayer, performance is measured by the average frame processing time, normalized to run-alone performance. For co-running bw_write benchmarks, performance is measured by the aggregated throughput (MB/s) of the two. In the figure, performance is normalized to each application’s baseline performance measured in isolation. In Default, Mplayer’s performance is significantly suffered—dropped by 51%—due to memory contention with the co-running bw_write instances, which are much less affected—dropped by 22%. This kind of disproportional performance impact is common in COTS multicore systems and is caused by a combination of application memory characteristics and DRAM controller’s scheduling policy [25, 19]. In MemGuard, Mplayer’s performance is better protected—dropped by 32%—as more memory bandwidth is reserved for it. However, this comes at a cost of considerable performance reduction of the co-runners—only 51% the baseline performance. In BWLOCK(fine), on the other hand, both Mplayer’s and co-runners’ performance are improved over MemGuard—by 12% for Mplayer and 13% for co-runners. This is because memory-performance critical sections in the Mplayer, identified from profiling, are protected from being interfered by the co-runners’ memory accesses using the explicit bw_lock and bw_unlock. Lastly, in BWLOCK(coarse), the Mplayer is unmodified but whenever it is scheduled, it automatically calls the bandwidth lock by the CPU scheduler. As a result, the Mplayer’s performance is almost identical to the baseline performance. However, because the entire duration of Mplayer’s processing is protected by the bandwidth lock, even if it doesn’t access memory, the co-runners’ performance is slightly further degraded.

Figure 7 shows the memory access pattern of each core. The y-axis shows the number of LLC misses of each core for every one millisecond period. Note that Core2 and Core3 have a constant memory demand when they run in isolation. In Default, whenever Mplayer and/or X11 begin processing and demand high memory bandwidth, all tasks suffer considerable bandwidth contention. In MemGuard, we can observe that Mplayer (and X11) is getting more bandwidth than the bw_write when needed. However, due to difficulties of making accurate predictions on future usage, which MemGuard relies on, its demand is not always satisfied. In both BWLOCK(fine) and BWLOCK(coarse), on the other hand, we can observe co-runners are immediately regulated upon arrivals of Mplayer’s memory demands; hence Mplayer can achieve near identical to its baseline performance in isolation.

Figure 8 shows frame processing time in different system configurations. Note that Solo represents Mplayer’s baseline performance measured in isolation. BWLOCK(coarse) is mostly overlapped with Solo. BWLOCK(fine) and MemGuard take longer in processing frames and Default, as expected, takes the longest in most frames.

Figure 8: Frame processing time comparison.

5.1.3 Overloaded System

So far, we have assigned one task per core and both Mplayer and X11 do not consume 100% cycles of the assigned core. In other words, the system is under-utilized. In order to investigate how BWLOCK performs in an overloaded system, we performed another set of experiments in which each core runs a Mplayer and a bw_write instance (i.e., four Mplayer instances and four bw_write instances) to fully load the system. Performance metrics are the same: average frame processing time of Mplayer and the aggregate bandwidth of bw_write. Figure 9 shows the results. Notice that, in this experiment setup, all cores run both real-time and non-real-time tasks. Therefore, MemGuard’s core-based bandwidth partitioning, which prioritizes certain cores over the others, is not appropriate. Hence, we only compare the results of Default and the two BWLOCK settings (fine and coarse). As shown in the figure, both BWLOCK settings provide good performance isolations for the Mplayer instances at the cost of more degraded performance for the co-runners which do not request bandwidth locks. Note that our current BWLOCK implementation does not limit the number of tasks that can hold bandwidth locks at a given time. Therefore, memory contention among the soft real-time tasks, which hold bandwidth locks on different cores, could potentially cause delay with each other. The performance reduction of Mplayer in BWLOCK(coarse) is not from the contention from the bw_write instances but is entirely from the co-running Mplayer instances—we verified this by comparing it with the result obtained by running only four instances of Mplayer without the bw_write instances.

Figure 9: Normalized performance of Mplayer (average frame time) and co-running Bandwidth benchmarks (MB/s): 4Mplayer and 4Bandwidth instances.

5.2 WebRTC

WebRTC is an open source, plug-in free, RTC (real-time communication) platform for enabling audiovisual, network-based applications between browsers. The goal of this experiment is to provide real-time performance isolation to WebRTC sessions, in the presence of memory intensive co running applications on multi-core platforms. We also investigate the side effects of different isolation mechanisms on the performance of co runners. The setup is configured to achieve negligible congestion in the network by having two communication hosts directly connected through a Gigabit Ethernet switch. Hence, the performance variability observed is entirely because of resource contention in the host itself. WebRTC utilizes GCC (Google Congestion Control) algorithm to derive target bit-rate of audiovisual streams based on the resource contention in network, and the end hosts [6]. The frame rate and sending bandwidth are adjusted to match the available resources at any given time. The default resolution of 640480, and frame rate of 30 FPS is used for experimentation, while the threshold bandwidth is increased to 4 Mb/s from default 2 Mb/s. LBM benchmarks from SPEC2006 suite are chosen as co-running applications. Since X11 server is the front end of the WebRTC, they are considered together as group, and assigned to share the CPU cores in cgroups. While, lbm co-runners are allocated to remaining two CPUs belonging to another CGROUP.

5.2.1 Profiling

Similar to MPlayer, to understand the memory access pattern, we collected function level profiling information for WebRTC, using Linux perf tool. This time we only focused on cache-miss events, to understand the memory access behavior. Functions sk_memset32_SSE2 and S32A_Opaque_BlitRow32_SSE2 from Skia library seems to cause more than 50% (29.29% and 22.95% respectively) of cache misses during a WebRTC sessions. The mean execution length of these functions is 7.5 us, while more than 99% of sample values being less than 100 us. The function execution length is much smaller than that were observed with MPlayer profiled functions. These functions didn’t seem ideal for applying fine grained BWLOCK, as the minimum BWLOCK period is 1 ms. We think that large number of invocation of these low level graphics functions are being made by higher level subroutine(s). Bursty invocation of these functions might lead to aggregated continuous time periods (during which these functions are active), in the order of regulation period of BWLOCK. So, we experimented fine grained BWLOCK on above two discovered functions to understand the effects of fine grained memory bandwidth regulation, compared it’s performance with other isolation techniques, namely, coarse-grain BWLOCK, MemGuard, and Default (CPUSET partitioning). Similar to MPlayer approach, entry and exit (bw_lock and bw_unlock) calls are introduced during which sufficient memory bandwidth (1000 MB/s) is reserved for the corresponding cpu cores, while other cores bandwidth quota is set to 100 MB/s.

Figure 10: Normalized performance of WebRTC (average bandwidth) and co-running LBM benchmarks (MB/s)

5.2.2 Performance

Figure 10 shows the normalized performance of WebRTC and co-running LBM(s) with different isolation mechanisms. Coarse-grain BWLOCK achieves near perfect performance isolation for WebRTC from co running LBM tasks, albeit with heavy penalization for co runners. WebRTC process consists of ~20 threads, and out of which, couple of threads are involved in encoding and decoding of video. Hence, the coarse grain mechanism over reserves the bandwidth for WebRTC process, leaving very small spare bandwidth for co runners. With fine grained BWLOCK (by using bw_lock and bw_unlock) on profiled graphic functions, the performance of WebRTC improves without much penalty to co running tasks. Since the two profiled functions contribute around 50% of cache misses, perfect isolation is not achieved, at the same time, many non-core threads (threads not involved in encoding and decoding of video) are not bandwidth reserved leaving sufficient room for LBM co runners. Some performance penalty is incurred due to very small execution duration of profiled function leading to incensed overhead of system calls. By using MemGuard in reclaim and sparing sharing mode, we could achieve perfect isolation for WebRTC performance with more 50 % penalty for co runners. In comparison to MemGuard, BWLOCK is a dynamic, on-demand kind of mechanism, whereas, MemGuard requires static, pre-determined, per core bandwidth allocation. All the approaches achieve better real-time WebRTC performance compared to Default (CPUSET partitioning alone).
Table 6 shows the important metrics of WebRTC. The results correspond to the average bandwidth achieved by WebRTC in specific scenarios. Except for Default (CPSET alone partitioning) and fine-grained BWLOCK, all configurations provide complete isolation to WebRTC from co-runners, albeit, having varying degree of penalty on co running applications. A clear trade-off emerges, with Default, BWLOCK(fine), MemGuard, and BWLOCK(coarse) providing increasing levels of isolation to WebRTC, while increasing penalty for co runners. As GCC kicks in during resource contention, the bandwidth/frame rate is dynamically adapted leading to reduction in bandwidth and/or frame rate. These parameters together determine the achieved audiovisual quality.

Config. RTT (ms) FR (FPS) BW (kb/s)
Default 17.20 21.34 2917.85
BWLOCK(fine) 4.22 29.58 2229.10
MemGuard 2.24 29.98 4019.16
BWLOCK(coarse) 2.22 30.00 4025.30
Table 6: WebRTC internal performance metrics

6 Discussion

In this section, we discuss limitations of our approach and future improvements.

6.1 Hardware Assisted Memory Bandwidth Control

A significant limitation of our current approach is our software based periodic monitoring and bandwidth controlling mechanism in which the control granularity is limited to a millisecond range due to the interrupt handling overhead. This means the detection and application of bandwidth lock can be delayed up to the timer period. While this may not be a serious issue in many soft real-time applications as we have shown in this paper, there may be other applications in which such delay are not tolerated. This limitation can easily be overcome via hardware support in the memory controller or the CPU. For example, hardware can expose a set of registers—that control the memory access priorities in the DRAM controller [19] or the size of MSHR in the shared cache [8]—to the kernel. Then BWLOCK can simply update such registers to protect memory performance critical sections.

6.2 Application to Hard Real-Time Systems

Although we focus on soft real-time applications, we believe BWLOCK can also be applied to hard real-time systems in some cases. For example, it is possible to designate a single core to execute all hard real-time applications while the other cores execute non real-time applications. In such a scenario, we can apply BWLOCK to all hard real-time tasks on the designated core to ensure that while any of the hard real-time tasks execute, all other cores’ maximum memory bandwidth usage could be limited to a certain number. Then, non real-time tasks and hard real-time tasks can safely co-exist without needing to worry about excessive memory contention. Especially, with hardware support mention earlier, such design can be used for systems that need certification [5].

7 Related Work

OS level memory access control was first discussed in literature by Bellosa [3]. The basic idea is to reserve a fraction of memory bandwidth for each core [3, 30, 31] (or task [13]) by means of software mechanisms—e.g., TLB handler [3] or hardware performance counter interrupts [31, 13]. One problem of the memory bandwidth reservation approach is that by partitioning memory bandwidth among the cores (or tasks), usable bandwidth can be substantially wasted if the reserved bandwidth is not being fully used by the reserved core (task). The work in [31] partly solves the problem by supporting dynamic reclaiming and sharing that re-distribute memory bandwidth of the cores that under-utilize their reserved bandwidth to the cores that need more than their reserved bandwidth.

However, the effectiveness of the techniques depends on cores’ memory access patterns and the accuracy of future usage predictions. In general, memory bandwidth reservation systems are not ideal in efficiently utilizing available memory bandwidth—which is essential in many soft real-time systems where real-time applications are co-scheduled with non real-time applications—because the reserved bandwidth for certain real-time tasks would result in under-utilization of the memory subsystem. In contrast, BWLOCK allow unrestricted memory accesses for most of the time, hence leveraging full benefits of parallelism available in modern multicore architecture, but limit excessive concurrent memory accesses from non real-time tasks only when doing so would likely affect performance of the soft real-time tasks that are executing memory-performance critical sections. We find that these selective regulations are more efficient in utilizing memory bandwidth while still providing good real-time performance than the reservation based approaches.

In the context of proving performance isolation in multicore systems, software based cache partitioning technique, known as page coloring, has been extensively studied [22, 32, 7, 17, 24, 28, 29]. The basic idea is to allocate memory pages of certain physical addresses such that each core accesses different part of the cache-sets. This way, cache can be effectively partitioned without needing special cache hardware. A downside of this approach is, however, that it is very costly to change the size of partition at runtime. More recently, page coloring has been applied to partition DRAM banks [23, 27, 29]. In line with the problems of bandwidth partitioning, however, these shared space resource partitioned (cache and DRAM bank space) resources can be wasted if they are not utilized by the reserved cores or tasks. Nevertheless, these space partitioning techniques can reduce the degree of interference experienced by concurrent tasks and othorgonal to our approach.

There have been many hardware proposals that allow communications between the system software (OS) and the hardware to make better resource scheduling/allocation decisions. For example, many DRAM controller design proposals allow the OS to set priorities, on a per-core basis, on memory request scheduling [18, 19, 26, 15]. More recently, Intel’s new Xeon architecture begins to expose shared resource allocation interfaces, currently restricted to partitioning the LLC but the interface is generic which can support controlling other shared resources such as DRAM, to the OS [14] Such hardware support can be especially useful for BWLOCK because the software based periodic bandwidth control mechanism can be replaced by more efficient hardware mechanisms with lower overhead.

8 Conclusion

We have presented BWLOCK, a user-level API and kernel-level memory bandwidth control mechanism, designed to protect performance of soft real-time applications such as multimedia applications. It provides simple lock like APIs to declare memory-performance critical sections in the application code. When an application accesses a memory critical section, BWLOCK automatically regulates the other cores’ so that they cannot cause excessive memory interference.

We applied BWLOCK in two real-world soft real-time applications—Mplayer and WebRTC framework—to protect their real-time performance in the presence of memory intensive non real-time applications that share the same machine. In both cases, we were able to achieve near perfect real-time performance, or to choose not perfect—but still better than the vanilla Linux—real-time performance for minimal throughput reductions of non-real-time applications.

Our future work includes hardware assisted bandwidth control for better control quality and compiler based automatic identification of memory-performance critical sections in soft real-time applications.

Acknowledgements

This research is supported in part by NSF CNS 1302563. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the NSF.

References

  • [1] L. Abeni and G. Buttazzo. Integrating multimedia applications in hard real-time systems. In Real-Time Systems Symposium (RTSS), pages 4–13. IEEE, 1998.
  • [2] Aeronautical Radio Inc. Avionics Application Standard Software Interface (ARINC) 653, 2013.
  • [3] F. Bellosa. Process cruise control: Throttling memory access in a soft real-time environment. Technical Report TR-I4-97-02, University of Erlangen, Germany, July 1997.
  • [4] A. Block, H. Leontyev, B. Brandenburg, and J. Anderson. A flexible real-time locking protocol for multiprocessors. In Embedded and Real-Time Computing Systems and Applications (RTCSA), pages 47–56. IEEE, 2007.
  • [5] Certification Authorities Software Team (CAST). Position Paper CAST-32: Multi-core Processors (Rev 0). Technical report, Federal Aviation Administration (FAA), May 2014.
  • [6] Luca De Cicco et al. Experimental investigation of the google congestion control for real-time flows. In Proceedings of the 2013 ACM SIGCOMM workshop on Future human-centric multimedia networking, pages 21–26. ACM, 2013.
  • [7] X. Ding, K. Wang, and X. Zhang. SRM-buffer: an OS buffer management technique to prevent last level cache from thrashing in multicores. In European Conf. on Computer Systems (EuroSys). ACM, 2011.
  • [8] E. Ebrahimi, C.J. Lee, O. Mutlu, and Y.N. Patt. Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems. ACM Sigplan Notices, 45(3):335, 2010.
  • [9] D. Faggioli, M. Trimarchi, F. Checconi, M. Bertogna, and A. Mancina. An implementation of the earliest deadline first algorithm in linux. In Proceedings of the 2009 ACM symposium on Applied Computing, pages 1984–1989. ACM, 2009.
  • [10] S. Fisher. Certifying Applications in a Multi-Core Environment: a New Approach Gains Success. Technical report, SYSGO AG., 2012.
  • [11] Google. WebRTC. https://http://www.webrtc.org/.
  • [12] J.L. Hennessy and D.A. Patterson. Computer architecture: a quantitative approach. Morgan Kaufmann, 2011.
  • [13] R. Inam, N. Mahmud, M. Behnam, T. Nolte, and M. Sjödin. The Multi-Resource Server for Predictable Execution on Multi-core Platforms. In Real-Time and Embedded Technology and Applications Symposium (RTAS). IEEE, April 2014.
  • [14] Intel. Intel®64 and IA-32 Architectures Software Developer Manuals, 2014.
  • [15] Ravi Iyer, Li Zhao, Fei Guo, Ramesh Illikkal, Srihari Makineni, Don Newell, Yan Solihin, Lisa Hsu, and Steve Reinhardt. Qos policies and architecture for cache/memory in cmp platforms. ACM SIGMETRICS Performance Evaluation Review, 35(1):25–36, 2007.
  • [16] S. Kato, R. Rajkumar, and Y. Ishikawa. Airs:supporting interactive real-time applications on multicore platforms. In Euromicro Conference on Real-Time Systems (ECRTS), 2010.
  • [17] H. Kim, A. Kandhalu, and R. Rajkumar. A coordinated approach for practical os-level cache management in multi-core real-time systems. In Real-Time Systems (ECRTS), pages 80–89. IEEE, 2013.
  • [18] Y. Kim, D. Han, O. Mutlu, and M. Harchol-Balter. Atlas: A scalable and high-performance scheduling algorithm for multiple memory controllers. In High Performance Computer Architecture (HPCA), pages 1–12. IEEE, 2010.
  • [19] Y. Kim, M. Papamichael, O. Mutlu, and M. Harchol-Balter. Thread cluster memory scheduling: Exploiting differences in memory access behavior. In Microarchitecture (MICRO), 2010 43rd Annual IEEE/ACM International Symposium on, pages 65–76. IEEE, 2010.
  • [20] O. Kotaba, J. Nowotsch, M. Paulitsch, S. Petters, and H Theilingx. Multicore in real-time systems temporal isolation challenges due to shared resources. In Workshop on Industry-Driven Approaches for Cost-effective Certification of Safety-Critical, Mixed-Criticality Systems (at DATE Conf.), 2013.
  • [21] J. Lehoczky, L. Sha, and Y. Ding. The rate monotonic scheduling algorithm: Exact characterization and average case behavior. In Real Time Systems Symposium (RTSS), pages 166–171. IEEE, 1989.
  • [22] J. Lin, Q. Lu, X. Ding, Z. Zhang, X. Zhang, and P. Sadayappan. Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems. In High Performance Computer Architecture (HPCA). IEEE, 2008.
  • [23] L. Liu, Z. Cui, M. Xing, Y. Bao, M. Chen, and C. Wu. A software memory partition approach for eliminating bank-level interference in multicore systems. In Parallel Architecture and Compilation Techniques (PACT), pages 367–376. ACM, 2012.
  • [24] R. Mancuso, R. Dudko, E. Betti, M. Cesati, M. Caccamo, and R. Pellizzoni. Real-Time Cache Management Framework for Multi-core Architectures. In Real-Time and Embedded Technology and Applications Symposium (RTAS). IEEE, 2013.
  • [25] T. Moscibroda and O. Mutlu. Memory performance attacks: Denial of memory service in multi-core systems. In Proceedings of 16th USENIX Security Symposium on USENIX Security Symposium, page 18. USENIX Association, 2007.
  • [26] L. Subramanian, V. Seshadri, Y. Kim, B. Jaiyen, and O. Mutlu. Mise: Providing performance predictability and improving fairness in shared main memory systems. In High Performance Computer Architecture (HPCA2013), pages 639–650. IEEE, 2013.
  • [27] N. Suzuki, H. Kim, D. de Niz, B. Andersson, L. Wrage, M. Klein, and R. Rajkumar. Coordinated Bank and Cache Coloring for Temporal Protection of Memory Accesses. In Computational Science and Engineering (CSE), pages 685–692. IEEE, 2013.
  • [28] Y. Ye, R. West, Z. Cheng, and Y. Li. COLORIS: a dynamic cache partitioning system using page coloring. In Parallel Architectures and Compilation Techniques (PACT), pages 381–392. ACM, 2014.
  • [29] H. Yun, R. Mancuso, Z. Wu, and R. Pellizzoni. PALLOC: DRAM Bank-Aware Memory Allocator for Performance Isolation on Multicore Platforms. In Real-Time and Embedded Technology and Applications Symposium (RTAS), 2014.
  • [30] H. Yun, G. Yao, R. Pellizzoni, M. Caccamo, and L. Sha. Memory Access Control in Multiprocessor for Real-time Systems with Mixed Criticality. In Euromicro Conference on Real-Time Systems (ECRTS), 2012.
  • [31] H. Yun, G. Yao, R. Pellizzoni, M. Caccamo, and L. Sha. MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isolation in Multi-core Platforms. In Real-Time and Embedded Technology and Applications Symposium (RTAS), 2013.
  • [32] X. Zhang, S. Dwarkadas, and K. Shen. Towards practical page coloring-based multicore cache management. In European Conf. on Computer Systems (EuroSys), 2009.