Memory-Aware Denial-of-Service Attacks on Shared Cache in Multicore Real-Time Systems

05/21/2020 ∙ by Michael Bechtel, et al. ∙ The University of Kansas 0

In this paper, we identify that memory performance plays a crucial role in the feasibility and effectiveness for performing denial-of-service attacks on shared cache. Based on this insight, we introduce new cache DoS attacks, which can be mounted from the user-space and can cause extreme WCET impacts to cross-core victims—even if the shared cache is partitioned—by taking advantage of the platform's memory address mapping information and HugePage support. We deploy these enhanced attacks on two popular embedded out-of-order multicore platforms using both synthetic and real-world benchmarks. The proposed DoS attacks achieve up to 75X WCET increases on the tested platforms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Multicore computing platforms are increasingly used in safety-critical cyber-physical systems such as self-driving cars and drones. However, in a multicore platform, a task’s execution time can vary significantly due to contention in shared micro-architectural resources when other tasks run concurrently on the platform [12]. Such timing variation in multicore can be exploited by attackers. Consider, for example, a scenario where some cores of a multicore platform are reserved for critical real-time tasks while some other cores are reserved for user downloaded third party programs. Even if the platform’s runtime (OS or hypervisor) partitions cores and memory to isolate the potentially dangerous programs from the critical tasks, as long as they share the same multicore computing platform, an attacker controlled downloaded program may still be able to delay the critical tasks, simply by executing code that exhausts shared micro-architectural resources, effectively mounting denial-of-service (DoS) attacks.

Modern multicore processors provide a high-degree of parallelism in accessing memory throughout the memory hierarchy. At the cache-level, non-blocking caches [21] are used, which can be accessed even when there are multiple outstanding cache misses. However, a non-blocking cache can become inaccessible whenever its internal hardware buffers are exhausted, at which point the cache cannot accept any further requests. The cache remains blocked until the internal buffers become available again [1, 32]. For a shared last level cache (LLC), cache blocking is especially problematic because it affects all cores that share the cache, as all requests to the cache would be blocked, regardless of their origins. As a result, the cores need to wait for the cache to unblock, which can take a long time as the cache may need to access slower main memory, which in turn can take hundreds of CPU cycles. Therefore, if an attacker can intentionally induce cache blocking on the shared LLC, they can cause massive timing impacts to the rest of the cores even if they cannot directly access them.

Prior work demonstrated the feasibility and severity of micro-architectural DoS attacks [6, 7, 35]

on shared non-blocking caches, which identified two internal cache hardware structures: (1) miss-status-holding-registers (MSHRs), which track individual requests generated from cache misses, and (2) WriteBack Buffers, which temporarily hold and delay cache write-backs, as potential DoS attack vectors. In these works, an attacker simply accesses a large array, which quickly generates a large number of concurrent cache-misses and exhaust those cache internal structures—inducing cache blocking. They showed that conventional cache partitioning techniques are ineffective to defend against such DoS attacks that target internal cache hardware structures, which may still be shared even if the cache space is partitioned.

In this paper, we first experimentally show that the effectiveness of cache DoS attacks is critically dependent on memory performance. This is because the basis for cache DoS attacks—inducing cache blocking—is reliant on how quickly the cache-misses are processed by the lower-level memory hierarchy. If the memory requests can be processed at a relatively faster rate, then cache DoS attacks will be less effective as the cache will not be blocked for as long a time. On the other hand, if the memory requests are processed more slowly, the cache DoS attacks would be more effective as the likelihood of cache blocking would increase.

Based on this insight, we propose memory-aware cache DoS attacks that can induce more effective cache blocking by taking advantage of information of the underlying memory hardware. Like prior cache DoS attacks, our new attacks also generate continuous cache misses to exhaust cache internal shared hardware resources. The difference is that we carefully control those cache misses to target the same DRAM bank to induce bank conflicts. Note that accesses to different DRAM banks can occur in parallel, and are thus faster. However, accesses to the same bank are serialized, and thus slower [42]. Since each memory access request takes longer to finish, the cache would take longer to become unblocked. We further extend these attacks to exploit HugePage support in Linux to directly control physical address bits and to avoid TLB misses, while mounting the attacks from the userspace.

We deploy the proposed memory-aware DoS attacks on two contemporary embedded multicore platforms using both synthetic and real-world representative benchmarks. We find that the proposed DoS attacks are significantly and consistently more effective at impacting victim task’s WCET when compared to existing state-of-the-art DoS attacks. For instance, on one embedded multicore platform we tested, our proposed attacks result in 75X slowdown to a cross-core victim task, while the state-of-the-art cache DoS attacks cause 21X slowdown.

This paper makes the following contributions:

  • We show that main memory (DRAM) performance plays a crucial role in the feasibility and effectiveness of denial-of-service attacks on shared cache.

  • We propose new cache DoS attacks that leverage platform’s memory address mapping information and HugePage support to induce prolonged shared cache blocking by intentionally generating lots of memory bank conflicts in accessing memory.

  • We experimentally demonstrate the effectiveness of the proposed DoS attacks on two embedded multicore processors using both synthetic and real-word applications. The results show that our attacks are significantly more effective than the state-of-the-art.

The remainder of this paper is organized as follows: Section 2 provides necessary background on cache DoS attacks, memory address mappings and HugePages. Section 3 defines the threat model. Section 4 provides motivation for targeting memory performance. Section 5 discusses the cache DoS attacks and how they function. Section 6 details how the enhanced DoS attacks work and displays their efficacy on embedded multicore platforms. Section 7 reviews how our enhanced attacks can be mitigated, or potentially prevented. We discuss related work in Section 8 and conclude in Section 9.

2 Background

In this section, we provide necessary background.

2.1 Non-Blocking Cache

Fig. 1: Internal organization of a shared L2 cache. Adopted from Figure 11.10 in [32].

In order to improve cache-level parallelism, modern processors employ non-blocking caches. Figure 1 shows the internal organization of a non-blocking L2 cache, and its two internal hardware structures, Miss-Status-Holding-Registers (MSHRs) and the WriteBack (WB) buffer.

On a non-blocking cache, when a cache-miss occurs, an MSHR entry is allocated to record the miss related information. The MSHR entry is then cleared only when the desired cache-line is returned from the lower levels of the memory hierarchy (e.g., LLC, DRAM). Note that while the outstanding cache miss is being serviced, the cache can still service memory requests from the CPU cores or higher-level caches. Multiple outstanding cache-misses can be supported by a non-blocking cache, although the degree to which it can happen depends on the size of the cache MSHR, which determines the cache’s memory-level parallelism (MLP). For the remainder of this paper, we use the terms local MLP and global MLP as the number of MSHRs in a private cache and a shared LLC, respectively.

On the other hand, the WriteBack buffer holds dirty cache-lines that are evicted from the cache and need to be written back to the next level in memory. Because reads from memory, such as the cache refills generated from cache-line evictions, are generally more important for application performance, delaying writebacks to memory while reads are being processed can improve system performance by reducing bus contention. The writebacks are then sent to memory when there are no reads being serviced or when the buffer is full. In this way, a non-blocking cache can support concurrent access to the cache efficiently most of the time.

Note, however, that when either the MSHRs or WriteBack buffer become full, the entire cache is blocked and rejects all subsequent requests until the cache is unblocked when both structures have free entries available. Unfortunately, unblocking can take a relatively long time as it depends on response times from the lower memory levels. In the worst case, it can take upwards of hundreds of CPU cycles when accesses to the slower main memory are required, which are then affected by how congested the DRAM controller is. Cache blocking is especially problematic in shared caches as it affects all cores that share that cache. Even if a task’s memory accesses are all cache hits, the task can still suffer massive slowdowns if the cache is blocked for a significant portion of the time.

2.2 Cache DoS Attacks

Recent works have demonstrated that the internal hardware structures of a non-blocking cache can be exploited to mount denial-of-service (DoS) attacks [35, 6]. Cache DoS attacks are software attacks that target cache internal hardware structures of a shared non-blocking cache. Generally, they are designed to generate as many cache misses and/or write-backs as fast as possible to overflow the cache internal hardware structures to induce cache blocking [35, 6]. When cache blocking occurs on a shared cache in a multicore processor, it affects all cores and can be devastating. For example, it has been shown that on a popular embedded multicore platform, the Raspberry Pi 3, a cache DoS attack could cause over 300X slowdown to a victim task [6], even when the victim task is running on a different core with its own dedicated cache partition and almost all of its memory accesses are cache hits.

2.3 Memory (DRAM) Address Mapping

A DRAM module is organized into ranks and each rank is further divided into multiple banks. A bank contains storage cells, which are organized in rows and columns in a 2D array-like structure. To access data in the storage cells, the corresponding row must be activated, which copies the data of the row into an intermediary buffer, called a row buffer, which acts as a cache. While in the row buffer, the data can be read from/written to efficiently. To access a different row, however, the current row needs to be closed (precharged). Since both activation and precharge take considerable time, accesses to different rows in the same bank can decrease memory performance.

To access specific locations in the memory, the system’s DRAM controller employs a memory address mapping scheme that translates a given physical address to the DRAM specific addresses (module, rank, bank and row, column). Because DRAM banks can be accessed in parallel, the address mapping of DRAM banks is particularly important for performance–if multiple concurrent memory requests are mapped over different banks, they can be processed efficiently in parallel; if, however, they are mapped to the same bank, resulting bank conflicts will slowdown the memory performance.

Fig. 2: Bits used in the memory address mapping scheme on the Odroid XU4.

Figure 2 shows the address mapping information of the Odroid XU4 platform. In this platform, five physical address bits—7, 13, 14, 15, and 16—are used by the DRAM controller to map 32 different DRAM banks of the system’s main memory. Such DRAM bank mapping information can be obtained experimentally [29, 41] or through platform documentations. If one can control these physical address bits in allocating memory blocks, the person can control the DRAM banks on which the allocated memory blocks will be located.

2.4 HugePages

In a virtual memory-based system, memory is typically allocated in 4KB page granularity. However, for large applications, the 4KB page size can be a performance bottleneck due to increased address translation overhead (TLB miss, page fault handling, etc.) To address the shortcomings of 4KB small pages, HugePage support was introduced in Linux kernel (since v2.6), which gave the option to use larger page granularities when allocating memory (e.g. 2MB pages). This reduces the number of pages used by applications and the CPU’s TLB pressure, as each TLB entry can cover a larger address range (2MB vs. 4KB). As the 64-bit ARM architecture has become mainstream, increasingly many ARM based embedded platforms and operating systems already support HugePage. Note that support for HugePage can be beneficial for embedded real-time systems as it can significantly improve time predictability of the real-time applications.

Fig. 3: Virtual address mappings in 4KB and 2MB pages.

Figure 3 shows the virtual address mappings for 4KB and 2MB page granularity on a 32-bit system. For the 4KB granularity, 12 bits are used as offset of a page, while 21 bits are used as offset for a 2MB hugepage.

Note that this means that allocating a single 2MB hugepage allows us to control a larger portion of the physical address space without requiring any system privileges. In the following, we will exploit this ability to control physical address in creating effective cache DoS attacks.

3 Threat Model

Fig. 4: Threat model.111The icons are by icons8: https://icons8.com/

We assume both the victim and attacker are co-located on the same multicore processor, but they run on their own dedicated cores, as shown in Figure 4. We assume that each core of the multicore processor has private caches, but all cores share a single shared last-level cache (LLC). We assume that the system’s runtime (OS and hypervisor) provides core, memory, and LLC space partitioning capabilities. For partitioning the LLC, we assume page coloring [41] is implemented in the runtime. We assume that the system supports HugePages. We assume that the attacker can obtain the system’s DRAM bank mapping information, via experimental methods [29, 41, 28, 24] or from a manual. Lastly, we assume that the attacker has no privileges on the system and can only run non-privileged code on its assigned attacker cores.

In this setting, the attacker’s primary goal is to delay execution time of the victim task by mounting denial-of-service attacks on the shared cache.

4 Impact of Memory Performance to Cache DoS Attacks

In this section, we provide an experimental evidence on the impact of memory performance to the effectiveness of cache DoS attacks.

As discussed in Section 2, cache DoS attacks [35, 6] attempt to exhaust internal hardware buffers of a shared non-blocking cache to induce cache blocking. The cache becomes unblocked when the cache internal buffers become available again, which would require accesses to main memory. Therefore, we hypothesize that if the access to memory takes longer, the longer the cache would be blocked. In other words, our hypothesis is that poor memory performance would increase the effectiveness of cache DoS attacks.

To test the hypothesis, we perform the same cache DoS attack experiments as in [6]. That is, we measure the slowdown of a synthetic victim task on Core 0 while co-scheduling cache DoS attackers on Cores 1-3 on a Raspberry Pi 3. Likewise, we structure the tasks such that the victim fits inside the LLC while all attackers fit inside of memory (i.e. their working set size is greater than the size of the LLC) and partition the LLC so that the attackers can’t directly affect the victim’s performance through cache evictions.

Unlike previous experiments, however, we make use of the Pi 3’s ability manually control the memory frequency of its LPDDR2 memory module by modifying the sdram_freq parameter in the config.txt file in the boot partition. By default, the Pi 3 operates at 900 MHz, but it can be changed to run anywhere between 100 to 1000 MHz. In our testing, we configure the memory to run from 1000 MHz down to 100 MHz, in increments of 100 MHz.

Fig. 5: Effect of memory frequency on performance slowdown caused due to cache DoS attacks.

Figure 5 shows the results. As we expected, the operating frequency of the shared memory, and by proxy the memory’s performance, has significant impact to the effectiveness of cache DoS attacks. As we decrease the frequency, the slowdown experienced by the victim task significantly increases. In particular, when we lower the frequency to only 100 MHz, we observe more than 900X WCET increase of the victim task. Note that the victim was not supposed to be affected by slow DRAM performance as its working-set fit in its own cache partition. The reason for this ”unexpected” massive WCET increase of the victim is that the slow DRAM access speed results in longer shared cache blocking when the attackers generate lots of concurrent cache misses even though the attackers are on different cores and accessing their own dedicated cache partitions.

Based on this finding, we hypothesize that DoS attacks can be more effective if they target main memory, in addition to the shared LLC, with the goal of decreasing its performance.

The simplest and most straight-forward way to accomplish this would be to directly modify and lower the performance of the main memory. However, in a real-world setting, attackers would neither have the required privileges nor physical access to the target machine in order for such approaches to be viable.

Therefore, in this work, we focus on software approaches to create more effective DoS attacks by generating memory access patterns that are slower to process at the hardware level, and thereby improve the effectiveness of cache DoS attacks. To achieve this, we make use of a system’s memory address mapping information and HugePage support as we detail in the following section.

5 Memory-Aware Cache DoS Attack

In this section, we discuss memory access characteristics of prior cache DoS attacks and their limitations, followed by the proposed memory-aware cache DoS attacks.

5.1 Sequential Attack

1for (i = 0; i<mem_size;
2    i += LINE_SIZE)
3{
4    sum += ptr[i];
5}
(a) Read attack (BwRead)
1for (i = 0; i<mem_size;
2    i += LINE_SIZE)
3{
4    ptr[i] = 0xff;
5}
(b) Write attack (BwWrite)
Fig. 6: Sequential memory access attacks. LINE_SIZE = a cache-line size.

Figure 6 shows the code snippets of the prior cache DoS attacks [35, 6], which perform a series of sequential memory accesses over a large array.

The BwRead attack iteratively reads entries of a large one-dimensional array at a cache-line granularity (LINE_SIZE, typically 64 bytes). When mem_size is larger than the size of the LLC, it generates lots of cache-misses, which would access the main memory. On a modern processor, multiple cache-misses can occur concurrently—with the help of out-of-order execution and/or hardware prefetchers—which may stress the MSHRs of the LLC [35, 6].

The BwWrite attack operates in a similar manner, but instead writes a value to each array entry. This will then generate continuous store operations that can also be configured to intentionally miss the LLC. Again, multiple write misses can occur concurrently, which can stress both the MSHRs and the Writeback Buffer of the cache. This is because each missed write can generate up to two memory requests: a read for a cache linefill and a write for a cache writeback [6].

While these attacks are effective at generating a large number of concurrent cache misses—nnecessary to overflow the cache internal buffers—the sequential memory access nature of these attacks means these cache misses can be processed efficiently at the memory level. Concretely, success memory blocks (64B cache-lines) are likely to be allocated on the same DRAM row (e.g., 8KB), which are then efficiently processed at the DRAM as costly row switching is not required. Efficient processing at memory is undesirable from the perspective of a cache DoS attack because the goal of the attack is to induce longer cache blocking and fast memory performance would reduce the duration of cache blocking.

5.2 Parallel Linked-List Attack

To address the shortcomings of the sequential memory access-based cache DoS attacks, we first introduce parallel linked-list attacks, which generate concurrent random memory accesses.

1static int* list[MAX_MLP];
2static int next[MAX_MLP];
3
4for (int64_t i = 0; i < iter; i++) {
5    switch (mlp) {
6    case MAX_MLP:
7    .
8    .
9    case 2:
10        next[1] =
11            list[1][next[1]];
12        /* fall-through */
13    case 1:
14        next[0] =
15            list[0][next[0]];
16    }
17}
(a) Read attack
(LatencyRead)
1static int* list[MAX_MLP];
2static int next[MAX_MLP];
3
4for (int64_t i = 0; i < iter; i++) {
5    switch (mlp) {
6    case MAX_MLP:
7    .
8    .
9    case 2:
10        list[1][next[1]+1] =
11            0xff;
12        next[1] =
13            list[1][next[1]];
14        /* fall-through */
15    case 1:
16        list[0][next[0]+1] =
17            0xff;
18        next[0] =
19            list[0][next[0]];
20    }
21}
(b) Write attack
(LatencyWrite)
Fig. 7: Random memory access attacks. MAX_MLP = global MLP of the platform. List entires are randomly shuffled over a large address space.

Figure 7 shows the code snippets for the parallel linked-list attacks: LatencyRead for read and LatencyWrite for write. In both cases, the attacks traverse a set number of linked lists, which can be accessed concurrently on a modern out-of-order core because there is no data dependency between the entries of different lists. Each linked list is then randomly shuffled over a large memory space to prevent data prefetching. As such, the number of linked lists determines the degree of memory-level parallelism (MLP) of the attacks. Note that the parallel-linked list attacks are based on the MLP measurement code in [11].

Like the sequential access attacks, the parallel-linked list attacks are designed to generate concurrent cache-misses, which would stress cache internal hardware buffers and induce cache blocking. However, the parallel-linked list attacks are potentially less efficient in memory as successive memory requests due to the cache-misses are likely to be mapped to different DRAM rows, which would require costly row switching. Note, however, entries of these linked-lists may still be mapped to different DRAM banks, and thus can be processed in parallel in DRAM. In such a situation, despite the overhead of frequent row-switching, the concurrent cache-misses may still be processed efficiently, which is undesirable from the perspective of cache DoS attacks.

5.3 DRAM Bank-Aware Parallel Linked-List Attack

To overcome the limitations—efficient memory processing—of the prior cache DoS attacks, we propose a memory-aware cache DoS attack, which is based on the parallel-linked list attack code (Section 5.2) but differs in that the entries of the linked-lists are constructed in such a way that they are all allocated in the same DRAM bank. The rational is that when multiple accesses target the same bank, they will take longer to be serviced at the DRAM because of increased DRAM bank conflicts and frequent row switching.

To create such linked lists, the attack code allocates a set of 2MB huge pages. It then creates a user-defined number of linked lists and populates them with addresses specifically chosen to be mapped to the same DRAM bank, utilizing the system’s DRAM bank address mapping information. (e.g., the Odroid XU4 we test uses physical address bits 13-16 as DRAM bank addresses).

1int paddr_to_color(unsigned long mask, unsigned long paddr)
2{
3    int color = 0;
4    int idx = 0;
5    int c;
6    for_each_set_bit(c, &mask, sizeof(unsigned long) * 8) {
7        if ((paddr >> (c)) & 0x1)
8            color |= (1<<idx);
9        idx++;
10    }
11    return color;
12}
Fig. 8: Physical address coloring used for creating enhanced cache DoS attacks.

Figure 8 shows the address coloring code snippet that is used by the attacks for finding addresses that correspond to the same memory bank. Specifically, it checks the value of each bit in a given address that corresponds to the set of bits specified in the mask bitmask, which is the platform’s physical address bits that are mapped to DRAM banks. If the returned value is zero, which means that the given physical address would be mapped to the DRAM bank zero, we add the address to the linked list as a new entry, otherwise we discard the address and continue. In these attacks, since 2MB pages are used for memory allocation, the attack code can effectively control both the physical and virtual addresses of the list entries. This is because the first 21 bits of the phsyical addresses are identical to the 21-bit offset in the corresponding virtual address. In our test platforms, this bit range was sufficient to control DRAM bank mappings. As a result, all of the linked lists generated, and their entries, will be allocated to the same memory bank which will, in turn, generate bank contention.

One notable shortcoming of the proposed memory-aware attacks is that they do not support in-order processing cores, such as the Cortex A53 used by the Raspberry Pi 3. This is because in-order cores cannot concurrently traverse multiple linked lists. As such, they would not be able to generate the necessary amount of concurrent memory requests for cache blocking to occur. As such, we instead target out-of-order core architectures that are inherently capable of generating multiple memory requests so that our attacks can still successfully and effectively generate cache blocking.

6 Evaluation

In this section, we evaluate the effectiveness of the proposed memory-aware cache DoS attacks on two embedded multicore-based platforms using both synthetic and real-word applications.

6.1 Embedded Multicore Platforms

width=.5 Platform Odroid XU4 Raspberry Pi 4 Model B SoC Exynos5422 BCM2711 CPU 4x Cortex-A7 4x Cortex-A15 4x Cortex-A72 in-order out-of-order out-of-order 1.4GHz 2.0GHz 1.5GHz Private Cache 32/32KB 32/32KB 32K/32K Shared Cache 512KB (16-way) 2MB (16-way) 512KB (16-way) Local MLP 1 6 6 Global MLP 4 11 19 Memory 2GB LPDDR3 4GB LPDDR4 (Peak BW) (14.9GB/s) (25.6 GB/s) DRAM bank bits 13, 14, 15, 16 8, 11, 12, 13, 14 (Bitmask) (0x1E000) (0x7900)

TABLE I: Compared embedded multicore platforms.

We deploy our DoS attacks on two embedded multicore platforms: an Odroid XU4 and a Raspberry Pi 4 Model B. The Odroid XU4 employs a big.LITTLE processor configuration comprised of a smaller 4xCortex-A7 [4] in-order core cluster and a larger 4xCortex-A15 [3] out-of-order core cluster. Note that we do not use the Cortex-A7 cluster on the Odroid XU4 as we are primarily focused on out-of-order architecture designs. The second platform we test, the Raspberry Pi 4, only equips a single cluster of 4x Cortex-A72 out-of-order cores. The Cortex-A15 and Cortex-A72 each have a local MLP of 6, with the A15 having a global MLP of 11 [3, 35] and the A72 having global MLP of 19 [5]. Note that we verified the global MLP of the A72 cluster by employing the same MLP micro-benchmark described in Appendix A of [35]. For both multicore systems, we use platform specific DRAM bank bitmasks in the creation of the enhanced attacks linked lists. We reverse engineer the address mapping schemes for both platforms by using the same technique as in [29, 41]. Table I shows the characteristics of the tested platforms. The XU4 runs Ubuntu 18.04 and Linux kernel 4.14, while the Pi 4 runs Raspbian Buster and Linux kernel 4.19. Note that both platforms run the latest version of Linux officially supported on them, respectively.

6.2 Synthetic Workloads

(a) Odroid XU4(A15)
(b) Raspberry Pi 4 (A72)
Fig. 9: Impacts of cache DoS attacks to the BwRead victim on two multicore platforms. The victim (BwRead) runs on Core 0 while the attackers (X-axis) run on Core 1-3.
(a) Odroid XU4(A15)
(b) Raspberry Pi 4 (A72)
Fig. 10: Impacts of cache DoS attacks to the LatencyRead victim on two multicore platforms. The victim (LatencyRead) runs on Core 0 while the attackers (X-axis) run on Core 1-3.

The experimental setup is as follows: we run each victim task alone on a single core, Core 0, to measure its solo response time. We then run the victim task alongside up to three instances of each attacker, scheduled on Cores 1-3, and measure the response times to determine the slowdown each attack caused on the victim relative to the solo case.

For the victim tasks, we use two of the cache DoS attackers—namely BwRead (Section 5.1) and LatencyRead (Section 5.2)—as victim tasks. We selected them as victim tasks as they differ in their access patterns: BwRead performs sequential accesses while LatencyRead performs random accesses. Note that we configure both victims to fit inside the LLC of each tested platform.

For the attackers, we employ all three cache DoS attack types discussed in Section 5, with each one capable of being read intensive or write intensive, for a total of six attacking tasks. For all attacking processes, we configure their working set sizes to fit inside of DRAM such that they can effectively generate shared cache misses.

Figure 9 shows the impacts of cache DoS attacks to the BwRead victim. On Odroid-XU4’s Cortex-A15 cluster, both BankRead and BankWrite attackers, which are DRAM Bank-aware, caused the most slowdowns to the victim task’s performance. The worst case slowdown of the BwRead victim was 30X when paired with three BankRead attackers, and 56X with three BankWrite attackers. On the Raspberry Pi 4, however, BwRead victim suffers little to no interference from any of the DoS attackers. We believe this difference is due to a combination of the following two factors. First, the Pi 4 equips a more powerful LPDDR4 memory module when compared to the XU4’s DDR3 module. As a result, this allows for faster memory accesses on the Pi 4 which ultimately reduces the amount of time the LLC spends blocked, as we explained in Section 4

. Second, the Pi 4’s Cortex-A72 employs stride-based L1D hardware prefetchers that are triggered after accesses to consecutive cache lines 

[5]. As such, the sequential access pattern of the BwRead victim would make the L1 prefetcher very effective on the Pi 4. On the other hand, the Cortex-A15 of Odroid-XU4 does not equip L1D prefetcher according to  [3]. Hence, XU4’s victim is more vulnerable to the LLC blocking than that of Pi 4.

Figure 10 shows the impacts of cache DoS attacks to the LatencyRead victim. On Odroid-XU4, BankRead and BankWrite attackers achieves up to 40X and 75X slowdowns, respectively, to the LatencyRead victim task’s performance. Interestingly, unlike the BwRead victim in Figure 9, LatencyRead victim is also significantly affected by BwRead and BwWrite attackers as it suffers up to 33.5X slowdown. This is likely because the random access memory pattern of the LatencyRead victim reduces the effectiveness of Cortex-A72 cores’ L1D prefetcher.

6.3 Impact of Cache Partitioning

In this section, we evaluate the effectiveness of a standard cache isolation mechanism in protecting victim performance from the BankRead and BankWrite attacks. Specifically, we employ a kernel level memory allocator called PALLOC [41] which uses a page coloring technique to control the cache space a process can use for allocating data. Using PALLOC, we partition the L2 cache on the XU4 such that each core is given its own private quarter of the cache space. We then run the same LatencyRead vs. BankRead/BankWrite attacks on the XU4’s A15 cluster as that is where we saw the most interference from our enhanced DoS attacks.

(a) Odroid XU4(A15)
(b) Raspberry Pi 4(A72)
Fig. 11: Effect of cache partitioning on BankRead and BankWrite impacts to a LatencyRead victim.

Figure 10(a) shows the results. We find that partitioning the cache is ineffective in protecting the performance of the LatencyRead victim as the slowdown observed is practically the same regardless of whether the LLC is partitioned. This is consistent with previous findings [35, 6] as partitioning the cache space does not partition the internal hardware structures, namely the MSHRs and the Writeback Buffer. As such, the DoS attacks can still induce massive shared cache blocking despite cache (space) partitioning.

6.4 Impact to Real-World Applications

We also test the effectiveness of our enhanced DoS attacks on real-world applications to test their feasibility.

6.4.1 End-to-End Deep Learning Based Autonomous Vehicle Control

We begin our testing by running DoS attacks against a DNN used by NVIDIA’s DAVE-2 system [8] and the DeepPicar autonomous car platform [7]. We employ the same basic experimental setup where a single victim instance is run alone on Core 0 and alongside 1-3 cache DoS attackers on Cores 1-3. For the DNN, we use 1000 video frames as input and calculate the average frame inferencing time.

(a) Odroid XU4(A15)
(b) Raspberry Pi 4(A72)
Fig. 12: Impacts of cache DoS attacks to the DeepPicar DNN control task on two multicore platforms. The DNN victim runs on Core 0 while the attackers (X-axis) run on Core 1-3.

Figure 12 shows the results. Note that when the DNN control task runs on either the XU4’s A15 cluster or the Pi 4, the DNN experienced noticeably more slowdown when run alongside BankRead and BankWrite attackers. This is especially true for the BankWrite attackers as they were able to cause 5X slowdown on both platforms. Compared to the synthetic victim tests, though, the magnitude of slowdown experienced by the DNN is much less. This is because the DNN task accessed the shared LLC less frequently than the synthetic ones and thus is less impacted by the cache DoS attacks.

6.4.2 Real-World Benchmark Suites

We also deploy the BankWrite DoS attacks against real-world benchmarks from the SPEC2017 [2] and SD-VBS [37] benchmark suites. In total, we use 32 benchmarks as victim tasks, 23 from the SPECrate 2017 Integer suite and all 9 from the SD-VBS suite. For all benchmarks, we employ the same experimental methodology used in Section 6.2. That is, we measure the victim’s execution time first alone in isolation and then together with three instances of each of the three different DoS attackers. For the attackers, we choose to only employ only the write versions (BwWrite, LatencyWrite, BankWrite) as they have been shown to generate more contention than their respective read versions.

(a) Odroid XU4 (A15)
(b) Raspberry Pi 4 (A72)
Fig. 13: Impacts of cache write DoS attacks on SPEC2017 and SD-VBS benchmarks. All benchmark victims run on Core 0 while the cache DoS attackers (X-axis) run on Core 1-3.

Figure 12(a) shows the results for the XU4 while Figure 12(b)

shows the results for the Pi 4. Like the DNN, the SPEC2017 and SD-VBS benchmarks suffer the most from the memory-aware cache DoS attack. On the XU4, our best memory-aware attack, BankWrite, achieves a geometric mean of 7.5X slowdown (up to 23.8X for

cactuBSSN), which is 90% and 123% better than the prior BwWrite and LatencyWrite attacks, respectively. On the Pi 4, BankWrite achieves a geometric mean of 6X slowdown (up to 21.4X for bwaves), which is 43% and 49% better than the BwWrite and LatencyWrite attacks, respectively.

In summary, we find that proposed memory-aware attacks are substantially more effective than prior cache DoS attacks in increasing the execution times of real-world applications.

7 Discussion

In this section, we discuss different system design choices and mechanisms that can be used to effectively prevent the cache DoS attacks we introduce in this paper. In particular, we focus on the two main aspects of our attacks that are necessary for successful DoS attacks and a system design choice that can effectively eliminate MSHR contention.

The key underlying assumption made by our attacks is that the memory address mapping scheme used by a system is already known or can be determined by the attacker. As such, our memory-aware DoS attacks can be prevented by making it difficult to determine such mapping information. For example, employing a XOR addressing scheme [43], and its variations used in recent Intel processors, can defeat our current memory-aware cache DoS attacks.

Another assumption for our attacks is that HugePages are enabled and can be allocated by the attacker. Without HugePage support, only a 4KB address space can be controlled by the non-privileged attacker, which is insufficient to be able to control DRAM bank allocation on most platforms. Therefore, disabling HugePages or making it accessible to only privileged users can also defeat our memory-aware DoS attacks, though it could also lead to increased timing variability due to increased TLB pressure and page fault handling overhead.

The partitioning of cache internal structures (MSHRs, WBBuffer) as proposed in  [35, 9] can fundamentally prevent our DoS attacks, although it requires hardware modifications and is not available on COTS processors. Lastly, memory bandwidth throttling can help protect against our cache DoS attacks, as was shown by the OS-based solution presented in [6].

8 Related Work

Most prior work in the real-time community to improve isolation has been focused on the isolation of shared cache space. Various cache partitioning mechanisms and policies are studied  [17, 38, 16, 25, 15, 39, 22, 34, 18, 10]. However, as shown in prior works [35, 6] and in our experiments in Section 6.3, such partitioning techniques are ineffective at mitigating DoS attacks aimed at the cache’s internal hardware structures since they remained shared between all cores, as was shown to be the case in  [35, 6].

Denial-of-service (DoS) attacks have been studied for several different types of shared resources in multicore systems. Moscibroda et al. demonstrated DoS attacks on memory (DRAM) controllers. In particular, they found that the widely used FR-FCFS [30] scheduling algorithm, which prioritizes row hits, is susceptible to DoS attacks. To address this, they proposed the use of “fair” scheduling algorithms, which has since been adopted in many memory controllers [26, 27, 19, 33]. Keramidas et al. studied DoS attacks on cache space and proposed a cache replacement policy that allocates less space to such attackers (or cache “hungry” threads) [14]. Woo et al. investigated DoS attacks on cache bus (between L1 and L2) bandwidth, main memory bus (front-side bus) bandwidth, and shared cache space, on a simulated multicore platform [40]. In contrast, recent prior work focused on internal hardware buffers of shared non-blocking caches and demonstrated the effectiveness and severity of cache DoS attacks [6, 7, 35]. The work presented here significantly improves the effectiveness of the prior cache DoS attacks by taking advantage of memory address mapping information and HugePage support to create more effective attacks.

Recently, micro-architectural timing channel attacks have gained notoriety, as seen by the Meltdown, Spectre, Foreshadow, and ZombieLoad attacks [23, 20, 36, 31]. These attacks generally work by measuring the timing differences in accessing certain micro-architectural resources (e.g., cache), whose states are altered in such a way to leak secrets. For example, Jiang et al. [13] showed that knowledge of L1 cache bank layouts could be leveraged for a timing channel attack that intentionally generated L1 bank contention. By doing so, they were able to retrieve a full 128-bit AES encryption key in less than three minutes. While these attacks target micro-architectural resources, they are different from DoS attacks in that their goal is to leak secrets rather than impacting real-time performance.

9 Conclusion

In this paper, we introduced memory-aware cache DoS attacks that leverage a system’s memory address mapping information and HugePage support to induce prolonged cache blocking by intentionally creating DRAM bank congestion. From extensive experiments on two popular embedded multicore platforms, we show that our new DoS attacks can generate significantly higher timing impacts to cross-core victim tasks compared to prior cache DoS attacks. For future work, we plan to launch our attacks on platforms that employ more sophisticated XOR address mapping schemes and evaluate their feasibility on server and cloud-based platforms.

References

  • [1] Memory system in gem5. http://www.gem5.org/docs/html/gem5MemorySystem.html.
  • [2] Spec cpu2017. https://www.spec.org/cpu2017.
  • [3] ARM. Cortex™-A15 Technical Reference Manual, Rev: r4p0, 2011.
  • [4] ARM. Cortex™-A7 Technical Reference Manual, Rev: r0p5, 2012.
  • [5] ARM. Cortex™-A72 Technical Reference Manual, Rev: r0p3, 2016.
  • [6] Michael G Bechtel and Heechul Yun. Denial-of-service attacks on shared cache in multicore: Analysis and prevention. In RTAS, 2019.
  • [7] Michael Garrett Bechtel, Elise McEllhiney, Minje Kim, and Heechul Yun.

    DeepPicar: A Low-cost Deep Neural Network-based Autonomous Car.

    In RTCSA, 2018.
  • [8] Mariusz Bojarski et al. End-to-End Learning for Self-Driving Cars. arXiv:1604, 2016. arXiv:1604.07316.
  • [9] Thomas Bourgeat, Ilia Lebedev, Andrew Wright, Sizhuo Zhang, and Srinivas Devadas. Mi6: Secure enclaves in a speculative out-of-order processor. In MICRO, 2019.
  • [10] D. Chandra, F. Guo, S. Kim, and Y. Solihin. Predicting inter-thread cache contention on a chip multi-processor architecture. In HPCA, 2005.
  • [11] D. Eklov, N. Nikolakis, D. Black-Schaffer, and E. Hagersten. Bandwidth bandit: quantitative characterization of memory contention. In PACT, 2012.
  • [12] Arne Hamann. Industrial challenges: Moving from classical to high performance real-time systems. In WATERS, 2018.
  • [13] Zhen Hang Jiang, Yunsi Fei, and David Kaeli. A novel side-channel timing attack on gpus. In GLSVLSI, 2017.
  • [14] Georgios Keramidas, Pavlos Petoumenos, Stefanos Kaxiras, Alexandros Antonopoulos, and Dimitrios Serpanos. Preventing denial-of-service attacks in shared cmp caches. In SAMOS, 2006.
  • [15] Richard E Kessler and Mark D Hill. Page placement algorithms for large real-indexed caches. TOCS, 1992.
  • [16] H. Kim, A. Kandhalu, and R. Rajkumar. A coordinated approach for practical os-level cache management in multi-core real-time systems. In ECRTS, 2013.
  • [17] Namhoon Kim, Bryan C Ward, Micaiah Chisholm, James H Anderson, and F Donelson Smith. Attacking the one-out-of-m multicore problem by combining hardware management with mixed-criticality provisioning. Real-Time Systems, 2017.
  • [18] S. Kim, D. Chandra, and Y. Solihin. Fair cache sharing and partitioning in a chip multiprocessor architecture. In PACT, 2004.
  • [19] Yoongu Kim, Michael Papamichael, Onur Mutlu, and Mor Harchol-Balter. Thread cluster memory scheduling: Exploiting differences in memory access behavior. In MICRO, 2010.
  • [20] Paul Kocher, Daniel Genkin, Daniel Gruss, Werner Haas, Mike Hamburg, Moritz Lipp, Stefan Mangard, Thomas Prescher, Michael Schwarz, and Yuval Yarom. Spectre Attacks: Exploiting Speculative Execution. arXiv preprint, 2018.
  • [21] D. Kroft. Lockup-free instruction fetch/prefetch cache organization. In ISCA, 1981.
  • [22] Jochen Liedtke, Hermann Hartig, and Michael Hohmuth. Os-controlled cache predictability for real-time systems. In RTAS, 1997.
  • [23] Moritz Lipp, Michael Schwarz, Daniel Gruss, Thomas Prescher, Werner Haas, Stefan Mangard, Paul Kocher, Daniel Genkin, Yuval Yarom, and Mike Hamburg. Meltdown. arXiv preprint, 2018.
  • [24] L. Liu, Z. Cui, M. Xing, Y. Bao, M. Chen, and C. Wu. A software memory partition approach for eliminating bank-level interference in multicore systems. In PACT, 2012.
  • [25] R. Mancuso, R. Dudko, E. Betti, M. Cesati, M. Caccamo, and R. Pellizzoni. Real-Time Cache Management Framework for Multi-core Architectures. In RTAS, 2013.
  • [26] Onur Mutlu and Thomas Moscibroda. Stall-time fair memory access scheduling for chip multiprocessors. In MICRO, 2007.
  • [27] Onur Mutlu and Thomas Moscibroda. Parallelism-aware batch scheduling: Enhancing both performance and fairness of shared dram systems. In ISCA, 2008.
  • [28] H. Park, S. Baek, J. Choi, D. Lee, and S. Noh. Regularities Considered Harmful: Forcing Randomness to Memory Accesses to Reduce Row Buffer Conflicts for Multi-core, Multi-bank Systems. In ASPLOS, 2013.
  • [29] Peter Pessl, Daniel Gruss, Clémentine Maurice, Michael Schwarz, and Stefan Mangard. Drama: Exploiting dram addressing for cross-cpu attacks. In USENIX Security Symposium, 2016.
  • [30] S. Rixner, W. J Dally, U. J Kapasi, P. Mattson, and J. Owens. Memory access scheduling. In ACM SIGARCH Computer Architecture News, 2000.
  • [31] Michael Schwarz, Moritz Lipp, Daniel Moghimi, Jo Van Bulck, Julian Stecklina, Thomas Prescher, and Daniel Gruss. Zombieload: Cross-privilege-boundary data sampling. In CCS, 2019.
  • [32] John Paul Shen and Mikko H Lipasti. Modern Processor Design: Fundamentals of Superscalar Processors. Waveland Press, 2013.
  • [33] Lavanya Subramanian, Vivek Seshadri, Yoongu Kim, Ben Jaiyen, and Onur Mutlu. Mise: Providing performance predictability and improving fairness in shared main memory systems. In HPCA, 2013.
  • [34] G Edward Suh, Srinivas Devadas, and Larry Rudolph. A new memory monitoring scheme for memory-aware scheduling and partitioning. In HPCA, 2002.
  • [35] Prathap Kumar Valsan, Heechul Yun, and Farzad Farshchi. Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems. In RTAS, 2016.
  • [36] Jo Van Bulck, Marina Minkin, Ofir Weisse, Daniel Genkin, Baris Kasikci, Frank Piessens, Mark Silberstein, Thomas F Wenisch, Yuval Yarom, Raoul Strackx, and Ku Leuven. Foreshadow: Extracting the Keys to the Intel SGX Kingdom with Transient Out-of-Order Execution. In USENIX Security Symposium, 2018.
  • [37] Sravanthi Kota Venkata, Ikkjin Ahn, Donghwan Jeon, Anshuman Gupta, Christopher Louie, Saturnino Garcia, Serge Belongie, and Michael Bedford Taylor. Sd-vbs: The san diego vision benchmark suite. In IISWC, 2009.
  • [38] B. Ward, J. Herman, C. Kenna, and J. Anderson. Making Shared Caches More Predictable on Multicore Platforms. In ECRTS, 2013.
  • [39] Andrew Wolfe. Software-based cache partitioning for real-time applications. Journal of Computer and Software Engineering, 1994.
  • [40] D Hyuk Woo and HH Lee. Analyzing performance vulnerability due to resource denial of service attack on chip multiprocessors. In CMP-MSI, 2007.
  • [41] H. Yun, R. Mancuso, Z. Wu, and R. Pellizzoni. PALLOC: DRAM Bank-Aware Memory Allocator for Performance Isolation on Multicore Platforms. In RTAS, pages 155–166, 2014.
  • [42] Heechul Yun, R. Pellizzoni, and P. Valsan. Parallelism-Aware Memory Interference Delay Analysis for COTS Multicore Systems. In ECRTS, 2015.
  • [43] Zhao Zhang, Zhichun Zhu, and Xiaodong Zhang. A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality. In MICRO, 2000.