Chunked-Cache: On-Demand and Scalable Cache Isolation for Security Architectures

Shared cache resources in multi-core processors are vulnerable to cache side-channel attacks. Recently proposed defenses have their own caveats: Randomization-based defenses are vulnerable to the evolving attack algorithms besides relying on weak cryptographic primitives, because they do not fundamentally address the root cause for cache side-channel attacks. Cache partitioning defenses, on the other hand, provide the strict resource partitioning and effectively block all side-channel threats. However, they usually rely on way-based partitioning which is not fine-grained and cannot scale to support a larger number of protection domains, e.g., in trusted execution environment (TEE) security architectures, besides degrading performance and often resulting in cache underutilization. To overcome the shortcomings of both approaches, we present a novel and flexible set-associative cache partitioning design for TEE architectures, called Chunked-Cache. Chunked-Cache enables an execution context to "carve" out an exclusive configurable chunk of the cache if the execution requires side-channel resilience. If side-channel resilience is not required, mainstream cache resources are freely utilized. Hence, our solution addresses the security-performance trade-off practically by enabling selective and on-demand utilization of side-channel-resilient caches, while providing well-grounded future-proof security guarantees. We show that Chunked-Cache provides side-channel-resilient cache utilization for sensitive code execution, with small hardware overhead, while incurring no performance overhead on the OS. We also show that it outperforms conventional way-based cache partitioning by 43 while scaling significantly better to support a larger number of protection domains.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 5

page 6

page 12

09/20/2019

HybCache: Hybrid Side-Channel-Resilient Caches for Trusted Execution Environments

Modern multi-core processors share cache resources for maximum cache uti...
10/29/2020

CURE: A Security Architecture with CUstomizable and Resilient Enclaves

Security architectures providing Trusted Execution Environments (TEEs) h...
04/12/2022

Fast Selective Flushing to Mitigate Contention-based Cache Timing Attacks

Caches are widely used to improve performance in modern processors. By c...
03/23/2022

TPPD: Targeted Pseudo Partitioning based Defence for Cross-Core Covert Channel Attacks

Contemporary computing employs cache hierarchy to fill the speed gap bet...
01/27/2022

CacheFX: A Framework for Evaluating Cache Security

Over the last two decades, the danger of sharing resources between progr...
03/08/2021

Prime+Probe 1, JavaScript 0: Overcoming Browser-based Side-Channel Defenses

The "eternal war in cache" has reached browsers, with multiple cache-bas...
04/28/2017

Mixed-criticality Scheduling with Dynamic Redistribution of Shared Cache

The design of mixed-criticality systems often involvespainful tradeoffs ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The outbreak of micro-architectural attacks has demonstrated the crucial implications of performance-boosting processor optimizations on the security of our computing platforms [53, 80, 60, 56, 52, 65, 102, 30, 27, 26, 58, 2, 1, 90, 68, 92, 94, 14, 15, 83, 13]. One of the most popular features, and also the subject of many recent attacks, are shared resources such as caches. Caches provide orders-of-magnitude faster memory accesses and large last-level-caches (LLCs) are usually shared across multiple processor cores to maximize utilization.

The Problem with Caches

When a sensitive (victim) and malicious (adversary) application run simultaneously on different cores and share the LLC, cache side channels can be exploited by the adversary to leak sensitive information, such as private keys. The timing difference between a cache hit and miss – which is why caches are used in the first place – is the most commonly exploited side channel to infer the memory access patterns of a victim application [37, 101, 34, 44, 33, 43, 46, 64, 31, 71, 35, 36, 100, 93]. In typical side-channel attacks [71, 43, 46, 64, 37, 101] the adversary deduces the victim’s memory access patterns by exploiting that both the victim and adversary compete for shared set-associative cache resources, which are designed in such a way that a larger number of memory lines are mapped to a smaller number of cache ways/entries in each cache set.

Besides compromising cryptographic implementations [6, 64, 71, 101], more recent attacks have had even stealthier impact such as bypassing address space layout randomization (ASLR) or leaking privacy-sensitive human genome indexing computation [33, 31, 12, 34, 35], leaving millions of platforms vulnerable. Even trusted execution environment (TEE) security architectures which aim to protect sensitive services by compartmentalizing them in isolated execution contexts, called enclaves, e.g., Intel SGX [41, 19] or ARM TrustZone [3], have been shown vulnerable to these attacks, thereby undermining their acclaimed privacy and isolation guarantees [12, 85, 69, 29, 59, 103]. This is alarming since TEE architectures are now widely deployed by major cloud providers, e.g., Microsoft Azure, Google Cloud, Alibaba Cloud and IBM Cloud, to offer confidential computing, where sensitive workloads are protected in enclaves.

The Problem with Recent Cache Defenses

To mitigate cache side-channel attacks, various approaches have been proposed over the years. These solutions range from time-constant cryptographic implementations [24, 25, 55]

to software- and hardware-based approaches that modify the cache organization itself. The latter can be broadly classified into either cache partitioning 

[28, 96, 51, 61, 21, 51] or randomization-based techniques [63, 91, 78, 79, 98, 89] that attempt to obfuscate the relationship between the memory address and the cache location to which it is mapped.

More recently, various schemes for a randomized memory-to-LLC mapping, such as CEASER, ScatterCache, and Phantom-Cache [91, 78, 79, 98, 89] have been proposed to mitigate these attacks by obfuscating the adversary’s view of which cache lines actually get evicted. However, such defenses continue to evict cache lines from a small number of locations in a shared cache, thus cache set-based conflicts essentially still occur. While these defenses were shown effective against the eviction set construction algorithms and techniques at the time, subsequent more efficient eviction set construction algorithms [79] were able to undermine them. Consequently, enhancements to these defenses were proposed [79]

, only to be rendered ineffective again by yet another attack vector, e.g., weak low-latency cryptographic primitives 

[75, 8], or alternative attack techniques that exploited design/implementation flaws in the proposed defenses [88].

Caught in an arms race, randomization-based defenses remain as good as the best known attack technique at the time and are constructed to mitigate very specific side channels and attack strategies [10], with no future-proof and well-grounded security guarantees. They only make the attacks computationally more difficult, but do not address their fundamental root cause, i.e., sharing set-associative caches across mutually distrusting processes. These schemes also assume that all execution contexts require side-channel resilience without providing mechanisms for a selective configuration of side-channel-resilience, thus, taxing the entire system with the resulting performance impact. In practice, however, only a small portion of the workload is usually security-/privacy-sensitive and requires this sophisticated security guarantee.

On the other hand, strict partitioning approaches promise well-grounded security guarantees due to their cache isolation across different execution contexts. However, these approaches usually rely on conventional way-based partitioning [4, 57, 96, 51, 21], and thus, are not fine-grained, cannot scale with an increasing number of execution contexts and large LLCs, or do not provide support for shared memory.

With these limitations in mind, we argue that a more future-proof and practical approach for side-channel resilient cache computing is to address the root cause of these attacks, namely, sharing set-associative cache structures across mutually distrusting execution contexts. Meanwhile, performance, usability, flexibility and scalability should still be preserved. We further observe that, in practice, cache side-channel resilience is most prominently a concern in dedicated security architectures, e.g., TEE security architectures. Thus, it is crucial to develop side-channel-resilient cache designs that cater for the security/functionality requirements of these architectures, e.g., with integrated support for enabling the side-channel resilience (and the performance cost) only for specific execution contexts that require it.

Our Goals

In this work, we aim to selectively enforce clean partitioning of the cache resources across mutually distrusting execution contexts that require side-channel resilience, such that all side channels are blocked (including stealthy cache occupancy channels [86] which are not mitigated by recent works [98, 21]), while maintaining the desired performance requirements.

To address this performance-security trade-off, we propose a new cache design for TEE security architectures, which we call Chunked-Cache, that enables each execution context or domain to “carve” out its exclusive cache sets, if desired. These sets essentially constitute an independent set-associative cache, which we call the domain’s cache chunk, that this domain can utilize exclusively but fully and efficiently, unlike in cache partitioning, e.g., way-based partitioning. A domain can flexibly request and configure 1.) whether it requires side-channel-resilient cache utilization, 2.) for which memory regions, and 3.) the required capacity of this exclusive side-channel-resilient cache chunk. Memory accesses by a domain that requires side-channel-resilient cache utilization are mapped exclusively to its cache chunk, while mainstream cache resources are freely and conventionally utilized whenever side-channel-resilience is not required. Enabling this on-demand flexibility per domain practically requires addressing multiple key challenges. Firstly, efficient design mechanisms are required to configure the memory-to-set mapping at run time for each domain depending on its chunk capacity, while preserving conventional cache behavior for the rest of the execution. Secondly, it must be ensured that the operating system performance is not degraded as cache sets get allocated exclusively to domains. Finally, seamless support must be provided for shared memory between domains to meet the security and functionality requirements of different sensitive applications.

Our Contributions

Our main contributions are as follows:

  • [noitemsep,topsep=0pt]

  • We present Chunked-Cache, a novel cache architecture for TEE security architectures, which enables a selective, flexible and scalable configuration of side-channel resilient caches for execution domains, without degrading the OS performance.

  • We address the performance-security trade-off by enforcing clean cache partitioning that blocks all cache side channels by allocating exclusive cache chunks for different domains. In doing so, future-proof and solid security assurances are guaranteed, while still preserving performance, functionality and compatibility requirements.

  • We extensively evaluate the cycle-accurate performance overhead of Chunked-Cache for compute-intensive SPEC CPU2017 workloads and I/O-intensive real-world applications. We show that it outperforms shared cache utilization in some cases, that the OS performance even improves owing to Chunked-Cache’s flexible cache utilization, and that Chunked-Cache outperforms partitioning (way-based) by 43% while also scaling better to support a larger number of protection domains.

  • We implement and evaluate a hardware prototype of Chunked-Cache. We show that it incurs a minimal 2.3% memory overhead relative to a 16 MB LLC, 1.6% logic overhead relative to a single-core RISC-V processor, and 12.3% LLC power consumption overhead.

Ii Cache Attacks & Defenses

Next, we briefly introduce recent cache side-channel attacks that are relevant for our work and a summary of the shortcomings of recent defenses that our work overcomes.

Ii-a Cache Side-Channel Attacks

Cache side-channel attacks have been shown to constitute a profound threat that underlies popular attacks such as Spectre [53] and Meltdown [60], besides threatening a wide spectrum of platforms and architectures [59, 64, 43, 104], and even TEE architectures [12, 85, 69, 29, 59, 103]. The attacks usually work by provoking controlled evictions of the victim’s cache line, such that the inherent information leakage from the access-timing difference between cache hits and misses can be exploited by the adversary. This can be achieved using three main approaches:

  • [noitemsep,topsep=0pt]

  • Access-based approaches where the target address is explicitly accessed and flushed [37, 101, 34, 44, 33].

  • Conflict-based approaches where the adversary triggers a controlled cache contention in the same cache set of the target address to evict the corresponding victim cache lines [71, 43, 46, 101, 64, 100, 22, 31, 71, 36, 93, 6, 9].

  • Occupancy-based approaches [86] where the adversary observes an eviction of its own cache lines and uses this information to infer the size of the victim’s working set.

Ii-B Recent Defenses and their Shortcomings

Various defenses against side-channel attacks have been proposed, focusing on access-based and conflict-based attacks.

Side-channel Resilient Implementation

This aims at implementing algorithms, e.g. cryptographic algorithms, in a time-constant (thus side-channel-resilient) fashion [42, 5]. Time-constant algorithms vary between hardware platforms [17] and require considerable effort that is not generalizable and scalable for all software.

Attack Detection

Other approaches aim to detect attacks in progress by observing hardware performance counters (e.g., on cache miss rates) [16, 73]

and killing the suspicious process. However, being based on heuristics, attacks can only be discovered with a certain probability and no guaranteed protection is provided. Moreover, some attacks have been shown to not cause an abnormal cache behavior 

[34].

Noisy Measurements

Another group of defenses aims to impede a successful attack by preventing the adversary from performing precise time measurements, e.g., by restricting the access to timers [72, 74, 67], by injecting noise into the system [95, 39] or deliberately slowing down the system clock [38, 66]. However, workarounds have been found to create timers [84] or to perform attacks without relying on timers [23]. Moreover, such defenses cannot protect TEE architectures since they assume a strong adversary that can compromise the OS kernel and circumvent such restrictions.

Cache-level Defenses

Other approaches tackle the side-channel problem directly where it originates, i.e., at the cache level. These defenses fall under one of two paradigms: 1.) randomized cache line mapping to make the attacks computationally impractical [91, 78, 79, 98, 89, 97, 63, 62] or 2.) cache partitioning to provide strict isolation [28, 50, 103, 61, 20, 32, 105, 47, 99, 57, 4, 96, 51, 97, 21]. We discuss the works most related to Chunked-Cache in more detail in Section VII.

Randomization-based defenses cannot provide comprehensive future-proof security guarantees, e.g., advances in attack strategies and minimal eviction set construction techniques, besides alternative attack techniques have been shown to undermine such defenses [79, 10, 76, 75, 88]. Moreover, many rely on cryptographic primitives which have been shown vulnerable to cryptoanalysis, while deploying more secure primitives would further degrade performance [8, 75].

Cache partitioning defenses provide strict resource isolation which allows to give solid security guarantees on side-channel protection. However, existing partitioning defenses suffer from high performance penalties, restrictive and inflexible cache utilization [97] and their inability to scale with a larger number of protection domains [96, 51, 32]. Several approaches do not directly cater for the use of shared libraries [28, 96], are architecture-specific [47, 99] or do not defend against occupancy-based attacks. Memory page coloring approaches [20, 50, 28] are impractical since they require invasive modifications of the memory management of commodity software and cannot sufficiently support Direct Memory Access (DMA). Most importantly, existing partitioning defenses to date apply their side-channel cache protection for the entire execution workload, impacting overall system performance, which is not even required in most scenarios.

To fundamentally address all these shortcomings, we propose a modified cache microarchitecture, which we call Chunked-Cache, that provides strict, yet configurable partitioning across the mutually distrusting execution domains. For each domain, Chunked-Cache carves out and isolates an exclusive cache share only as the domain requires. This effectively mitigates all interference across domains, thus, defending against even stealthy cache occupancy attacks unlike recent cache defenses, while activating side-channel resilience only for sensitive execution domains that require it. All other execution domains can freely utilize mainstream cache resources at the same performance or even improved performance than conventional non-secure cache sharing.

Iii System & Adversary Model

In the following section, we describe our assumptions regarding the system and adversary model.

Iii-a System Model

Chunked-Cache targets computing systems which implement a TEE security architecture and contain a set-associative cache architecture. In the following, we first present our standard assumptions regarding the cache architecture, followed by our assumptions on the TEE security architecture which are aligned with existing academic [20, 57, 11, 4] and industry solutions [41, 45, 3].

Cache Architecture

In Chunked-Cache, we assume a typical modern set-associative cache architecture with multiple cache levels, where some cache levels are core exclusive (typically L1 and L2) and others shared between multiple cores (L3), whereby the L3 can be a sliced cache, e.g., sliced Intel LLCs. While Chunked-Cache can be deployed to provide partitioning for smaller L1 and L2 caches in principle, we assume, however, that core-exclusive caches are flushed at context switching (similar to most recent TEE architectures [4, 20, 57]), and thus, that Chunked-Cache is deployed for the last-level L3 cache. Moreover, we assume that the cache controller can be configured via dedicated configuration registers, in line with typical platforms.

TEE Architecture

We assume that the computing systems which deploy Chunked-Cache implement a TEE architecture. TEE architectures already have established mechanisms for protecting sensitive code in compartmentalized execution contexts called enclaves or Isolated Domains (I-Domain), as we refer to them in this work. All non-sensitive code which does not require enhanced protection is consolidated in a Non-Isolated Domain (NI-Domain). The domains are also each assigned a unique identifier (domain ID). The separation between the I-Domains and the NI-Domain is enforced by access control mechanisms already implemented in the TEE architectures, e.g., at the MMU in Intel SGX [41] or Sanctum [41], at the system bus in CURE [4] or by the Physical Memory Protection (PMP) unit in Keystone [57]. The access control mechanisms are either configured by microcode [41, 45] or by a small software component which consists only of a few thousand lines of code (to be formally verifiable) and which runs in the highest software privilege level of the system [20, 57, 11, 4, 3]. We refer to this component as a trusted software component. The trusted software component is also responsible for all other security-sensitive operations, e.g., assigning the domain IDs, and, in the case of Chunked-Cache, configuring our novel protection mechanisms in the cache controller which we describe in detail in Section IV.

Although I-Domains are security-sensitive, they might still require to share data with another domain, e.g., to enable communication with the operating system. Thus, TEE architectures typically provide the possibility to mark parts of an I-Domain’s memory as shared, whereby this information is again managed by the trusted software component. In many TEE architectures, e.g., TrustZone [3], CURE [4] or AMD SEV [45], security-relevant metadata, which is required to perform access control, is sent as part of every memory request. For Chunked-Cache we assume the same, namely, that the domain ID of the domain issuing a memory access request and the information whether the requested memory address is shared or non-shared, are sent within the memory request.

Iii-B Adversary Model

Since we focus on the deployment of Chunked-Cache on systems with TEE architectures, we assume the same strong adversary model where the operating system kernel and hypervisor are untrusted [20, 57, 11, 4, 41, 45, 3].

With regard to cache side-channel attacks, we assume the adversary has access to the Chunked-Cache specification and is able to mount access-based and conflict-based side-channel attacks, which are the most sophisticated and applicable cache attacks (cf. Section II-A), to leak information about a sensitive execution domain (I-Domain). Since the adversary is also able to control the OS kernel, we assume a worst-case scenario where an adversary can easily mount the described attacks, i.e., has knowledge about the Chunked-Cache design and specs, and knows the virtual to physical address mapping of the victim domain. Moreover, the adversary can mount attacks from all privilege levels (except the highest privilege level that contains the trusted software component), has access to precise timing measurements and eviction instructions (e.g., clflush), can attack from the same CPU core executing the victim domain or a different core (cross-core), freely interrupt the victim domain and even keep the system noise to a minimum. In contrast to related work [91, 78, 79, 98, 89], we also consider the stealthier cache occupancy-based attacks (cf. Section II-A). Collision-based attacks [9], which exploit cache collisions at the victim caused by the victim’s own cache utilization, are, aligned with related work, kept out of scope. Collision-based attacks have not been widely shown and are very specific to particular software implementations (e.g., table-based).

Apart from cache side-channel attacks, an adversary who compromises the OS kernel has full control over the memory management and thus, can easily map physical memory pages of a victim domain into its own memory. This allows an adversary to perform rogue cache accesses to sensitive data directly without the need of a cache side channel.

In line with related work [28, 96, 51, 61, 21, 97, 63, 91, 78, 79, 98, 89], we do not consider physical attacks on caches, e.g., physical side-channel attacks [54], fault injection attacks [7], and attacks that exploit hardware flaws [90, 48, 77]. We do not consider denial-of-service attacks from a security point of view. However, to avoid the performance impact on the OS, Chunked-Cache ensures that a certain amount of cache resources are always available to the OS (described in Section IV). Based on our system model (Section III-A) , we assume that the adversary cannot compromise the trusted software component.

Iv Chunked-Cache Design

We first describe the high-level idea of Chunked-Cache, a novel cache microarchitecture that provides flexible and on-demand assignment of cache resources to execution domains (Section IV-A). We follow with a detailed explanation of our design (Section IV-B) and the required cache tag store and cache controller modifications (Section IV-C).

Fig. 1: Computing system with TEE architecture and Chunked-Cache as the shared last-level cache.

Iv-a High-Level Design

In Figure 1, we show how Chunked-Cache is integrated as the last-level cache in a computing system which implements a TEE architecture, aligned with our system model detailed in Section III-A. Figure 2 steers the focus to the design of Chunked-Cache itself and illustrates its architecture abstractly. As described in Section III-A, all TEE architectures provide built-in mechanisms to protect sensitive code in Isolated Domains (I-Domains), whereas non-sensitive code is running in a Non-Isolated Domain (NI-Domain).

Fig. 2: Chunked-Cache high-level design: each domain gets an exclusive cache chunk allocated on-demand.

Each active domain (NI-Domain and I-Domains) is uniquely identified by an ID: . The operating system (OS) and all workloads which do not require protection (and are combined in the NI-Domain) are assigned the 0 by default. Every I-Domain can request exclusive cache resources of desirable capacity, forming the domain’s exclusive cache chunk, that is only utilized by the owner domain. The NI-Domain utilizes the cache sets which are not exclusively allocated to I-Domains, which we call mainstream cache sets.

Each I-Domain requests its dedicated cache chunk consisting of the required number of cache sets, e.g., I-Domain 1 in Figure 2 requested 4 sets. Thus, at I-Domain 1 setup, 4 available (unallocated) sets are located in the cache (sets with global IDs 8-11 here) and allocated to I-Domain 1 such that they form its cache chunk. The allocated sets are mapped to I-Domain 1’s chunk set IDs 0-3, and they are used to exclusively cache all and only memory accesses issued by I-Domain 1. Enabling each I-Domain to request its desired cache chunk capacity exclusively provides strict partitioning and completely isolates its cache utilization on-demand. Besides enabling selective cache-based side-channel resilience, this also allows that each I-Domain acquires the performance that corresponds to the cache capacity it has requested, without any competition from other workload. In contrast to partitioning schemes [61, 57, 4, 96, 51] that provide each domain with only 1 or 2 ways within each set of the full cache structure, Chunked-Cache also partitions the cache but more efficiently. Chunked-Cache carves out a full cache chunk (with all its ways per set) of configurable capacity for the I-Domain and configures all its memory accesses to be mapped to the cache chunk, thus promising maximum and unshared utilization of the allocated cache chunk. We show in Section VI that Chunked-Cache provides better performance and enhanced scalability than partitioning schemes.

By allowing each I-Domain a custom and configurable cache chunk capacity on-demand, in contrast to fixed allocation, Chunked-Cache enables an adaptive security-performance trade-off in the cache microarchitecture. On one hand, non-sensitive workload can be allowed to freely utilize the shared mainstream cache resources. On the other hand, if side-channel resilience is a concern, a cache chunk with default capacity can be allocated to each I-Domain without any further intervention from the developer. Only if the developer requires to further optimize the performance of the workload in a particular I-Domain, then the cache chunk capacity (its number of sets) can be accordingly calibrated, i.e., assigning an I-Domain more cache resources if affordable/available.

Iv-B Design Details of Chunked-Cache

In the following, we discuss the key design goals and challenges of Chunked-Cache, and the mechanisms we propose to achieve them.

Configurable Per-Domain Isolation Modes

One of our key design goals for Chunked-Cache is to support configurable cache isolation modes that provide different security guarantees, thus catering for different use cases and their requirements. In line with the design paradigm of TEEs, it is not reasonable to assume that all workloads require cache isolation and side-channel resilience. Thus, in Chunked-Cache, we provide 2 different Isolation Modes that each I-Domain can selectively configure for the workload it protects: 1.) Mainstream-Cache Mode: where cache isolation and side-channel resilience is not a security requirement, and thus, the I-Domain can utilize the mainstream cache. However, the cached I-Domain data must still be protected from malicious OS accesses. 2.) Exclusive-Cache Mode: where cache isolation is required since side-channel resilience is a security requirement and thus, an exclusive cache chunk is required by this I-Domain. The latter mode is configured for I-Domain 1 and I-Domain 2 shown in Figure 2. In addition to the Isolation Mode, the I-Domain can also configure its Shared Memory settings, i.e., if it requires to share memory regions (and thus cache lines) with the OS, e.g., when using OS services. To cache shared memory, the mainstream cache that the OS uses is utilized. Typically, the developer of the workload decides which Isolation Mode an I-Domain uses and identifies which memory regions need to be shared, which is on par with the requirement in TEE architectures where the developer must identify the security-sensitive parts of the overall workload [41, 3, 20]. If a developer is not sure whether cache side-channel attacks are a threat, the  Exclusive-Cache Mode should be selected out of caution. At setup, an I-Domain configures: 1.) the desired Isolation Mode for its cache utilization and 2.) its Shared Memory regions if required. This metadata is securely configured by the trusted component (as shown in Figure 1). The Isolation Mode is communicated to the cache controller at domain setup, whereas the Shared Memory information is transmitted at every memory request, aligned with our assumed system model (Section III-A).

Mainstream Cache vs. Shared Memory Support

When an I-Domain is in Mainstream-Cache Mode, it uses the mainstream cache sets also used by the OS ( 0). To prevent a malicious OS from mapping the memory of an I-Domain in its own memory space and accessing it directly in the cache, Chunked-Cache requires that cache lines are tagged with the domain ID . The hardware mechanisms integrated into the Chunked-Cache controller enforce this tagging when caching the data, and that only the owner domain which cached the data can access it. Being hardware managed, the OS has no means to modify the stored in the cache lines.

When an I-Domain is also sharing memory with the OS, the corresponding cache lines for the defined Shared Memory regions are cached in the mainstream cache sets, and are to be accessed by both the owner domain and the OS. To support that, cache lines need to be tagged with an additional flag that indicates whether the cache line is shared with the OS. For typical TEE architectures, the developer of the workload protected in the I-Domain configures which of its memory regions are to be shared.

Exclusive-Cache Mode Chunk Set Indexing

The index bits of a memory address are used to locate the cache set to which it is mapped. In a conventional cache, the number of index bits is fixed and depends on the number of sets the cache supports. However, for Chunked-Cache to support cache chunks of different sizes for different domains, configurable set indexing is required.

When an I-Domain is in Exclusive-Cache Mode and requests a number of cache sets for its cache chunk, the number of set index bits that will be used to map its memory lines has to be computed individually for this domain. Therefore, the cache controller keeps track of the global IDs of sets which constitute the cache chunk (Figure 2), and the index bits for each domain. When a memory access is issued by a domain, this metadata is looked up, and the pertinent cache chunk sets correctly indexed. Moreover, when an I-Domain is torn down and its sets are de-allocated, the relevant metadata needs to be updated accordingly, besides flushing and invalidating the cache lines. Chunked-Cache also enables support for dynamic cache allocation, i.e., allocating additional cache sets to an I-Domain’s cache chunk at runtime and reconfiguring the index bits accordingly. In Section IV-C, we describe how the cache microarchitecture and controller are modified to enable this configurability efficiently.

NI-Domain Chunk Set Indexing

Another design challenge in Chunked-Cache is managing the sets allocated to the OS, which represents the NI-Domain with 0, such that both flexibility as well as maximum utilization (as in an unmodified insecure cache architecture) are preserved. At bootup, when no domains are set up yet besides the OS, the OS should ideally be able to utilize all the available cache capacity, i.e., all cache sets are allocated to the OS by default. We refer to these as the mainstream cache sets. Then, once domains are set up and request exclusive cache sets, these get “torn away” from the OS’s cache and are allocated to the domains. This would, however, incur an impractical performance degradation for the OS since every time some of the OS’s cache resources are allocated to another domain, its own capacity is changed, and so would its set indexing. This renders all memory lines already cached by the OS inaccessible unless complicated remapping is performed. Essentially, the OS would need to cache these memory addresses once again, thus suffering a high number of cold misses every time a new domain is set up and subjecting the OS to an unreasonably high performance overhead.

Fig. 3: Chunked-Cache OS-specific chunk set indexing.

To avoid this performance penalty on the OS, the OS is allocated a fixed (sufficiently large) number of the cache sets in Chunked-Cache which remain always dedicated to the OS, while still allowing it to utilize the other cache sets so long as they remain unallocated. We demonstrate this in Figure 3 where the OS is always allocated a fixed number of 8 sets (0-7) which form its principal cache chunk. Since the 8 sets are always available for the OS, the memory address indexing and the number of index bits do not change at runtime. In other words, no OS memory lines cached in this principal cache chunk must ever be flushed out when any other domain requests to allocate additional cache sets, since the OS cache chunk sets are never torn away from the OS. However, the OS can still utilize unallocated sets (sets 12-15) in parallel until they get allocated to another domain, thus also guaranteeing maximum utilization of the available cache resources. This works by indexing cache sets in parallel which are congruent to the set to which a memory address is mapped. In Figure 3, 3 index bits are required to map a memory address to the correct set for a cache chunk of size 8 sets. Thus, if the index bits, e.g., map to set 4, then set 12 can also be utilized by the OS (set ID + OS cache chunk size) to cache that memory line. The same applies for memory lines that are mapped to sets 5, 6 and 7; they also map to sets 13, 14 and 15, respectively. However, memory lines mapped to sets 0-3 cannot utilize the congruent sets 8-11 because these are already allocated to I-Domain 1.

Fig. 4: Chunked-Cache controller operations for cache chunk allocation, de-allocation and access control.

Iv-C Cache Tag Store & Cache Controller

Cache lines need to be additionally tagged with the domain ID () bits as well as a 1-bit flag bit to enforce access control and moderate sharing with the NI-Domain. For instance, to support 16 parallel active domains, we require to extend the cache tag store with 4 bits to represent the . We emphasize that the Chunked-Cache design does not limit the number of parallel domains to 16; a larger number is possible but increases the hardware overhead of Chunked-Cache (but only linearly). Moreover, the number of domains only limits how many domains can be simultaneously active on the system. It does not limit how many applications can be protected in I-Domains on the system in general.

To support the configurable set indexing, the allocation/de-allocation of cache sets to different I-Domains and to differentiate between OS (NI-Domain) cache accesses vs. I-Domain accesses, 2 table structures are required by the Chunked-Cache controller which are shown in Figure 5. The Cache Set Status Table (CST) is a 1-bit vector that is indexed by the global set ID (SID) and that stores the status of each set, i.e., whether it is allocated to a domain. The CST is used to query the status of a set when searching for free cache sets to allocate to an I-Domain.

Fig. 5: Chunked-Cache table structures.

The Domain Cache Allocation Table (DCAT) is indexed by the domain ID . It maintains whether this domain is configured by the cache controller (ALLOC), a vector of the global set IDs that form its cache chunk (SID-VEC), and the corresponding number of index bits (INDEX) required to map a memory line to the correct set ((number of sets in the cache chunk)), as shown in Figure 5.

We describe next how the Chunked-Cache controller performs these cache management operations, i.e., allocation, de-allocation and access control and represent this in Figure 4. The description in Figure 4 only represents the sequence of operations for understanding, but does not reflect the temporal nature of the operations, i.e., whether they occur sequentially or in parallel.

Cache Allocation & De-allocation

When an I-Domain requests to allocate exclusive cache sets, this request (, the number of sets (CH-NUM) requested, and the corresponding number of INDEX bits ( CH-NUM) is securely communicated from the trusted component to the cache controller via configuration registers of the cache controller (Section III-A). The is looked up in the DCAT to check if it is already allocated and that the maximum sets number allowed per I-Domain is not exceeded. The maximum/minimum limits for I-Domains are configured by the trusted software component, while ensuring that each I-Domain is always assigned at least a minimum cache chunk size.

The CST is queried to locate free sets and to allocate them to the I-Domain by flipping the ALLOC bit, until CH-NUM sets are allocated. If CST runs out of free sets, this is communicated back to the trusted component in order to modify the cache request. Next, the DCAT is indexed with the and its metadata updated by updating the INDEX bits and the SID-VEC with the global IDs of the allocated sets.

If a domain requests to de-allocate its cache sets, DCAT is indexed with , ALLOC reset and the SID-VEC read out. Next, the CST is indexed with each set ID in SID-VEC and de-allocated. For both allocation and de-allocation, the cached memory lines in the relevant sets are invalidated and flushed (if dirty) to remove potentially malicious data in the allocation case and prevent information leakage in the de-allocation case.

The number of cache sets which are always assigned to the NI-Domain are hardwired, since the circuitry for the parallel tag lookup (described below) must be hardwired and cannot be configurably extended.

Cache Access Management

The of an incoming cache access request indicates whether it is an access by the NI-Domain (OS domain with 0) or an I-Domain. If it is an OS access, then the index bits are fixed, since its number of cache sets are hardwired (no need to look its INDEX bits up in the DCAT). The OS domain is assigned the least significant cache sets by default, thus the SID-VEC is also not needed. The correct set index in the principal chunk is computed from the memory address in the request. Because it is an OS access, congruent cache sets that are not allocated can also be utilized (see Section IV-A). Thus, they are also computed and their ALLOC status queried in the CST to locate the unallocated sets. The tag store of the ways in the principal as well as the congruent sets are looked up in parallel to locate a tag bit match (cache hit), thus, neither impacting performance nor routing delay especially since a large number of principal sets are usually allocated to the NI-Domain which minimizes the number of congruent sets that are looked up in parallel (1 or 2 more sets). The and tag bits are also checked in parallel. If the cache line belongs to a non-zero (I-Domain), the tag bit should be 1 to allow the OS to access it.

For an I-Domain (non-zero ), if access is requested to a Shared Memory region or if the I-Domain is in Mainstream-Cache Mode, then the access is treated by the controller as a NI-Domain access where the mainstream and congruent cache sets are accessed. However, at the tag comparison, the issuing is checked against the cache line to verify that only the owner domain accesses it. If the access is performed in Exclusive-Cache Mode, the exclusive cache chunk of the domain is accessed. The DCAT is indexed with the and the SID-VEC and INDEX bits are read out. The chunk set index is computed and used to index into the SID-VEC to map to the correct global set ID. Then, the tag store is accessed for a tag bits comparison.

Chunked-Cache’s design is independent from the implemented cache replacement policy and thus, does not require additional modifications to it. On every cache miss experienced by an I-Domain in Exclusive-Cache Mode, a cache line in the corresponding set in the domain’s exclusive cache chunk is selected for eviction. On cache misses by an I-Domain in Mainstream-Cache Mode or when accessing Shared Memory, and for all misses by the NI-Domain, a cache line in the corresponding set from the mainstream cache is selected.

V Security Considerations

In this section, we discuss how Chunked-Cache protects from the adversary described in Section III-B. One key aspect of Chunked-Cache is that its protection capabilities rely on a strict partitioning of cache resources. Thus, in contrast to related work, which rely on probabilistic defenses (e.g., randomized cache line mappings [91, 78, 79, 98, 89]), Chunked-Cache provides certainty that the attacker cannot infer the cache accesses of a victim, if the partitioning is correctly implemented. The main security goals of Chunked-Cache are to prevent an adversary from accessing (read/write) data in the exclusive cache chunk of an I-Domain and to prevent eviction interference between the adversary and victim domain. In the following, we show how Chunked-Cache achieves these goals with strict cache partitioning and we discuss why Chunked-Cache’s security guarantees even hold in the event of a strong adversary that compromised the operating system kernel. Besides these security considerations, we verified the correctness of our implemented Chunked-Cache prototype by explicitly issuing memory requests which try to read, write and evict cached data of I-Domains.

Strict Partitioning of I-Domain Cache Chunks

As described in Section IV, the trusted software component communicates the number of chunk sets which should be assigned to an I-Domain to the Chunked-Cache cache controller which configures the DCAT and verifies that each cache chunk set is only assigned to a single I-Domain. At every cache memory access, the cache controller uses the domain ID to index the DCAT and to retrieve the list of assigned sets (SID-VEC). Since the assignment of domain IDs and configuration of the DCAT can only be performed by the trusted software component, the indexing logic of the cache controller will never return a cache set which does not belong to the issuer of the memory request. Thus, an adversary is never able to read an I-Domain’s exclusive sets (cache chunk), write to them or evict them. As a result, Chunked-Cache protects from access-based attacks, which require the adversary to flush memory out of the victim’s sets, and conflict-based attacks, which require to fill the victim’s sets and thus, evicting its cache lines. Moreover, Chunked-Cache’s strict cache resource separation prevents an adversary from observing evictions of its own sets caused by the victim, which protects from occupancy-based attacks, and also strictly prevents the sharing of replacement policy metadata, which has been shown exploitable [51]. In general, the adversary can only infer how many cache sets are assigned to an I-Domain but cannot infer which sets (and therefore which memory addresses) are accessed at which point in time. As described in Section III-B, collision-based attacks are not considered. Defending against them architecturally requires locking the victim cache lines. Chunked-Cache could be extended to integrate this, though mitigating an attack which is very specific to particular software implementations and is not widely shown does not justify the resulting large performance overhead.

Chunked-Cache allows for a dynamic assignment of cache sets to I-Domains. Whenever the cache chunk capacity of an I-Domain is modified, all assigned chunk sets are invalidated. This prevents leakage of sensitive I-Domain data when chunk sets are reassigned to another execution domain, and prevents an adversary from injecting malicious data into a set, when additional sets are assigned to an I-Domain. The invalidation is however only required for the I-Domain whose cache chunk is resized; all other I-Domains do not need to be modified and thus, their cache lines do not need to be flushed. The same applies when the cache chunk for an I-Domain is completely de-allocated. An adversary could also try to trick an I-Domain into storing sensitive data in a mainstream cache line that is accessible for the adversary ( flag bit set). Chunked-Cache prevents this by checking the metadata on every memory request of an I-Domain to verify that the memory region was indeed configured as shared.

Protecting from Compromised NI-Domain

As described in Section III-B, in the adversary model of TEE architectures, the OS (and therefore the NI-Domain) is not trusted, allowing an adversary to map physical memory pages of a victim I-Domain to its own memory space and to directly access it in the cache. If an I-Domain (represented by an enclave) demands side-channel protection (Exclusive-Cache Mode), all data is cached in the exclusive cache chunk and thus, not accessible for the adversary. However, if an I-Domain is not concerned about cache side channels (Mainstream-Cache Mode), the data is cached in the shared mainstream sets and thus, must still be protected from malicious direct accesses. Chunked-Cache prevents those attacks with the domain ID tag which is added to every cache line. On every cache write, the domain ID tag is set to the ID of the write request issuer. Subsequently, on every read request, the ID of the issuer is compared to the stored ID and the request only permitted if both IDs match. Evictions are permitted for every domain to achieve a perfect utilization of the shared cache sets. This is, however, not a security concern since an I-Domain’s data will only be cached in the shared sets if the I-Domain is in Mainstream-Cache Mode or if the data is explicitly shared with the NI-Domain.

Vi Implementation & Evaluation

To evaluate Chunked-Cache with respect to its hardware footprint, power consumption overheads, and performance impact, we implemented our design in hardware and on an architectural cycle-accurate simulator.

Methodology

We implemented a hardware RTL model of Chunked-Cache

to extend an open-source RISC-V processor and synthesized it to evaluate the storage and logic overhead incurred. We use our hardware implementation to extract the additional cycle latencies incurred by

Chunked-Cache due to individual cache management and access operations. Then, to evaluate the performance impact of Chunked-Cache on large mixed workloads, we extend an architectural cycle-accurate simulator, the gem5 simulator, with Chunked-Cache and configure it to model a multi-core architecture with a 3-level cache hierarchy which matches our system assumptions (Section III-A). We incorporate the cycle latencies derived from our hardware implementation into our gem5 setup and use it to collect performance measurements on the standard SPEC CPU2017 [18] benchmarks suite (aligned with related work  [78, 79, 98, 89]) to evaluate the overall performance impact of Chunked-Cache. Complementary to the compute-intensive SPEC benchmarks, we also evaluate Chunked-Cache on the I/O-intensive webserver nginx. In order to achieve the most realistic results, we conduct our experiments in the full-system simulation mode of gem5 which simulates the user- and kernel-space software and also I/O devices.

We describe next our hardware implementation (Section VI-A), performance evaluation (Section VI-B), and our hardware an power overhead evaluation (Section VI-C).

Vi-a Hardware Implementation

In our hardware model, we extended the cache tag store with a 4-bit and a 1-bit bit to tag the owner domain of each cache line and whether it is shared with the NI-Domain (OS), respectively. We also extended the cache controller with the table structures shown in Figure 5. To track the status of the 16,384 sets of a 16 MB LLC with 16-ways, the CST is implemented as a 16,384-bit register that is indexed by the set ID to read out the corresponding 1-bit ALLOC flag. To support set allocation for 16 domains in parallel, the DCAT is implemented as a 16-row -indexed vector structure. We decided for 16 parallel domains in our hardware implementation since this is also the maximum number of enclaves supported by multiple TEE architectures in parallel [57, 4]. We define for our implementation that the maximum number of sets that can be allocated to any domain is 8,192 sets. Thus, we reserve 4 bits to represent the set INDEX bits number (to index into one of 8,192 sets), 114,688 bits (8,192 sets 14 bits to represent each set’s global ID) for the SID-VEC, and 1 bit ALLOC flag per domain. We discuss the storage overheads incurred by the tables in Section VI-C.

We implement the control finite-state-machines (FSMs) that receive cache allocation and de-allocation requests and perform the necessary management. For allocation, the FSM controls cycling through the sets sequentially to allocate free ones to the requesting I-Domain, updating their status in the CST and updating the corresponding domain status in the DCAT. For de-allocation, another FSM controls that the SID-VEC of the pertinent I-Domain is read from the DCAT, its ALLOC flag reset, and then, all sets of that I-Domain de-allocated (by sequentially indexing through the CST with the respective set IDs from the SID-VEC). Both allocation and de-allocation occur in powers-of-2 set numbers in our prototype. This is only an implementation decision in our prototype to minimize the logic complexity and overhead.

The cache access mechanisms are extended to include the DCAT lookup required for Chunked-Cache to identify which global set IDs belong to the issuing domain and to map the access to the correct set prior to tag lookup. Additionally, for NI-Domain accesses, after mapping to the correct set ID, concurrent sets are computed and looked up in the CST in parallel to identify which ones are unallocated.

Vi-B Performance Evaluation

In this section, we first describe the latencies from our RTL model which we incorporate into our gem5 implementation. Next, we provide an evaluation of Chunked-Cache’s performance impact using the gem5 implementation.

Cycle Latencies

As described in Section IV, Chunked-Cache introduces a new indexing policy. For I-Domain memory requests in Exclusive-Cache Mode, a lookup in the DCAT is required. For requests in Mainstream-Cache Mode and all NI-Domain (OS) memory requests, the mainstream sets must be looked up. The comparison of the stored with the requester is done in parallel with the address tag comparison and thus, does not introduce additional latency. For I-Domain requests in Exclusive-Cache Mode, we measure an additional latency of 1 cycle and for NI-Domain requests and I-Domain requests in Mainstream-Cache Mode

of an additional 2 cycles. For the access latencies of modern LLCs on multi-core systems, we estimate a baseline of 80 cycles in line with vendor multi-core processors 

[40].

Whenever an I-Domain gets sets allocated, unallocated sets are looked up and the DCAT updated. At de-allocation, sets of the I-Domain must be invalidated (and possibly flushed) and the CST and DCAT updated. For allocation, the overall latency incurred is variable and is a function of: 1.) how many sets CH-NUM are requested for allocation, and 2.) how many sets have to be looked up in the CST. At the worst case, this incurs a latency of 16,384 cycles and at the best case, CH-NUM cycles. An additional 1 cycle is incurred to update the DCAT subsequently. The INDEX is computed and communicated already by the trusted component in the allocation request, thus it does not contribute additional latency.

For de-allocation, we measure an overall latency of CH-NUM + 2 cycles, where 1 cycle is required to look up the DCAT, and another cycle to update it, followed by CH-NUM cycles to de-allocate each set in the CST. At worst case, a latency of 8,194 cycles is incurred (assuming a maximum of 8,192 sets per domain). However, de-allocating the sets in CST is done in parallel to invalidating (and possibly flushing if dirty) the respective cache lines.

We emphasize that allocating new sets to any I-Domain does not require invalidating or flushing any other sets of the NI-Domain or other I-Domains which would require re-caching them. This is one key design goal of Chunked-Cache since it eliminates this performance overhead on other domains, particularly the NI-Domain. The allocation of sets either happens only once during the I-Domain setup or occasionally when the number of assigned sets is modified at runtime which requires a context switch out of the I-Domain. The Chunked-Cache allocation/de-allocation overheads induced remain negligible when compared with the general overheads of TEE architectures [20, 11, 57, 4]. Therefore, we do not invest in increased logic complexity to optimize the cycle overheads incurred for allocation and de-allocation, since they are not in the critical path, i.e., LLC accesses.

Parameters L1 L2 L3 L3
(I&D) (gem5) (Chunked-Cache)
size 64 KB 512 KB 16 MB 16 MB
& 32 KB
# of sets 128 & 64 512 16,384 16,384
associativity 8-way 16-way 16-way 16-way
access latency 4 14 80 81 / 82
(in cycles)
TABLE I: Cache configuration on our gem5 evaluation setup with an inclusive 3-level cache hierarchy.

Mixed-Workload Cycle-Accurate Evaluation

We implement Chunked-Cache on the cycle-accurate gem5 simulator and construct a multi-core system which resembles a modern computing system with an inclusive 3-level cache hierarchy. Each core has access to a core-exclusive L1 and L2 cache, and an L3 LLC shared among all cores. For the L1 and L2, we use the unmodified cache implementation provided by gem5, whereas we use our Chunked-Cache implementation for the L3 cache. The configuration parameters of each cache level are shown in Table I. We derive realistic values for the cache sizes, number of cache sets, associativity and access latency in line with modern caches. For the Chunked-Cache L3 cache, we add our induced latencies collected from our hardware implementation. Constructing a gem5-based multi-core system with 3-level cache hierarchy in full-system simulation mode to collect representative cycle-accurate traces for large workloads involved significant engineering challenges, as also evident by recent works that rely on trace-based simulators for their evaluation with SPEC workloads [78, 79, 98, 89].

Fig. 6: Cache miss rate impact of Chunked-Cache for SPEC benchmarks on a 8-domain setup; compared to a shared L3 cache.

We measure the performance impact of Chunked-Cache

on real-world workloads by using the standard SPEC CPU2017 benchmarks with both the SPECspeed 2017 Integer and SPECspeed 2017 Floating Point suites which represent a wide range of compute-intensive applications such as compilers, video compression, machine learning or modeling tasks. Since running all of the benchmarks on our full-system cycle-accurate gem5-based simulation setup would be very costly in terms of memory and time, we selected benchmarks from the different application domains and different working set sizes, guided by this memory-centric characterization of the SPEC CPU2017 benchmarks 

[87]. Moreover, to also cover I/O-intensive workloads, we evaluate the impact of Chunked-Cache on the widely used webserver nginx

. We run our experiments for 1 trillion instructions before we start to collect measurements, in order to boot the system, start the benchmarks and collect more representative metrics. We run all our experiments for a total of 1 billion instructions in the full-system mode of gem5 and collect statistics to compute the Cycles Per Instruction (CPI) metric, in order to capture the additional latency effect, and the L3 cache miss rates for the reduced cache capacity effects. If not stated otherwise for single experiments, the miss rates are calculated as the geometric mean over the instruction and data miss rates of the page table walker and core. We compare

Chunked-Cache to 1.) a baseline system with an unmodified insecure L3 cache and to 2.) an L3 cache which implements a way-based partitioning scheme in which cache ways are assigned to I-Domains as provided, e.g., by CATalyst [61] which uses Intel CAT [49], SecDCP [96], DAWG [51], Keystone [57] or CURE [4]. We evaluate Chunked-Cache with a set of experiments which investigate different computing scenarios. First, we show how Chunked-Cache’s partitioning influences the performance of mixed workloads when encapsulated in I-Domains (in Exclusive-Cache Mode). Then, we evaluate Chunked-Cache’s impact on the NI-Domain (OS-domain) and compare against way-based partitioned cache schemes. We conclude our evaluation with a set of experiments which show the scalability of Chunked-Cache. In general, when comparing to the baseline (unpartitioned L3 cache shared by the same workload), our experiments show a negative effect of Chunked-Cache on the performance of a benchmark when only a small cache chunk size is assigned to it. However, when increasing the cache chunk size this effect vanishes. At some point, depending on the specific characteristics of a benchmark, the exclusive cache chunk assigned by Chunked-Cache

leads to a positive effect on its performance as we show in the following experiments. This gives the developer some degree of freedom to calibrate the performance of the workload by distributing the cache resources accordingly, e.g., to optimize the performance of a particular benchmark if desired given that the cache resources are available/affordable. All experiments were conducted on an x86 platform equipped with an Intel Xeon Silver 4215 CPU (2.50 GHz) and 186 GB RAM.

I-Domain Performance Impact

In the first set of experiments, we evaluate the performance impact Chunked-Cache has on mixed workloads when protected in I-Domains in Exclusive-Cache Mode. We run 7 randomly selected SPEC benchmarks in I-Domains and show our results in Figure 6. The NI-Domain (D0) runs Linux (kernel version 4.19.83) and 2 benchmarks with large working sets (600.perlbench_s and 602.gcc_s). In this experiment, we assign 8,192 sets to the NI-Domain and a varying number of sets to each I-Domain as indicated in the plot. We chose the number of sets by briefly analyzing the working set size of the benchmark running in each I-Domain, and assigning bigger working sets to more cache sets. This is only required when optimizing for performance, otherwise a default number of sets can be assigned to each benchmark. We observe in the experiment that the overall miss rate significantly decreases for most benchmarks when compared to sharing the L3 cache. This shows that the assignment of a smaller but exclusive cache portion can even reduce the cache miss rates of a workload. Moreover, our results indicate that the number of cache sets required to reduce or completely avoid the impact of Chunked-Cache heavily depends on the characteristics of the workload. In our experiment, the benchmarks 605.mcf_s and 620.omnetpp_s would require more cache sets than the assigned 512 and 1024 sets to avoid an impact on the cache miss rates. We investigate this in another experiment where we customize the number of sets allocated to an I-Domain for some of the benchmarks and show how the miss rate decreases significantly when increasing the chunk size (Figure 7). In another experiment (Figure 8

), we show how the varying chunk sizes also influence the CPI values. As for the miss rates, the CPI decreases in general. We observe, however, some outliers with the CPI metrics collected, owing to the complexity of a full-system multi-core simulation on gem5 which also includes unpredictable kernel runtime behavior into the statistics.

Fig. 7: Cache miss rate impact of Chunked-Cache for SPEC CPU2017 benchmarks (varying sets); compared to a shared L3 cache.
Fig. 8: CPI impact of Chunked-Cache for SPEC CPU2017 benchmarks (increasing sets); compared to a shared L3 cache.

Additionally, to evaluate the impact of Chunked-Cache on I/O-intensive workloads, we conduct experiments in which we run the nginx webserver in one I-Domain and the HTTP benchmarking tool wrk in another I-Domain, whereas we keep the NI-Domain unmodified. We then use wrk to send HTTP requests to the webserver using 12 threads and 400 open connections. In Figure 9, the miss rate impact of Chunked-Cache on nginx and wrk is shown when increasing the number of sets from 128 to 2048. The results show, in line with our results on SPEC, how the increase of cache sets leads to a decrease in the overall miss rate. The decrease is already noticeable for a relatively small number of sets since the exclusive assignment of the cache sets prevents nginx and wrk from evicting the sets from one another.

Fig. 9: Cache miss rate impact of Chunked-Cache for nginx and wrk (increasing sets); compared to a shared L3 cache.

NI-Domain Performance Impact

In the second set of experiments, we focus on the performance impact of Chunked-Cache on workloads executing in the NI-Domain. We again run mixed workloads from the SPEC benchmarks in I-Domains, while running Linux and the 2 memory-intensive benchmarks 600.perlbench_s and 602.gcc_s in the NI-Domain. In Figure 10, we vary the number of sets allocated to the NI-Domain from 2,084 to 8,192 while keeping the sets for the other domains unchanged. For these experiments, we show all 4 miss rate metrics over which we average in the other experiments, the data and instruction miss rates of the page table walker (DTB MR and ITB MR, respectively), and the data and instruction miss rates of the core (Data MR and Instr. MR, respectively). While in general, all miss rates and CPI metrics decrease compared to the baseline, we only observe a slight improvement when increasing the chunk size from 2,084 to 4,096 and 8,192 sets. This is because even when the number of statically allocated sets to the NI-Domain is rather small, the unallocated sets in the system (mainstream sets) remain available for the NI-Domain. Thus, performance is not significantly impacted for the NI-Domain and maximum utilization of the available resources (as in an unmodified insecure cache architecture) is preserved which was one of the key design goals of Chunked-Cache.

Fig. 10: Miss rate & CPI impact of Chunked-Cache on the NI-Domain (increasing sets); compared to a shared L3 cache.

To investigate this, we run experiments (same setup) in which we assign 1,024 sets to the NI-Domain and vary the number of unassigned sets. In the first run, all cache sets are allocated in our system, while in the second run, 4,096 sets remain unallocated and available for the NI-Domain. Figure 11 shows how the miss rates significantly decrease when 4,096 sets remain unallocated which demonstrates how Chunked-Cache enables the NI-Domain to utilize unused cache sets.

Fig. 11: Miss rate of Chunked-Cache on the NI-Domain with varying number of unassigned sets; compared to a shared L3 cache.

Comparison with Partitioning-based Schemes

We compare Chunked-Cache to a cache partitioning scheme which we implement on gem5, specifically way-based partitioning, being the only other strict cache partitioning approach. We run a number of experiments with a 5-domain setup where we assign the same cache capacity to the same benchmark in both, the Chunked-Cache and way-partitioned cache – 1,024 or 2,048 sets in Chunked-Cache and equivalently 1 way or 2 ways, respectively, in the way-partitioned setup. We show in Figure 12 how for the same cache capacity, Chunked-Cache outperforms way-based partitioning for randomly selected benchmarks. In fact, for some benchmarks such as 625.x264_s and 644.nab_s, allocating 1,024 sets even outperforms 2 ways (double the cache capacity) on a way-partitioned cache. We calculate an average decrease of 43% in the miss rate for Chunked-Cache vs. the way-partitioned cache for a 1 MB cache capacity (1024 sets) and a 39% decrease for 2 MB (2048 sets).

Scalability and Dynamic Cache Allocation

In Appendix A, we additionally evaluate Chunked-Cache’s ability to scale and support 32 I-Domains in parallel without degrading the performance of the NI-Domain (OS) and we also demonstrate how Chunked-Cache supports the dynamic allocation of cache sets to an I-Domain during runtime.

Fig. 12: Overall miss rate for SPEC CPU2017 benchmarks with Chunked-Cache; compared to a way-partitioned cache.

Vi-C Hardware Footprint and Power Consumption Evaluation

To evaluate the storage and logic overhead incurred by Chunked-Cache, we synthesize our implementation targeting a single-issue single-core RISC-V processor [81] using Xilinx Vivado tools. While this processor does not provide an LLC, this is not necessary since we can still extend the existing simple cache controller to implement Chunked-Cache, verify its functionality in cycle-accurate RTL-level simulations and evaluate its overheads.

Storage/Memory Overhead

The main contribution to the hardware area overhead of Chunked-Cache is the extra storage required, rather than the logic itself, since that requires the fabrication of memory which consumes more gates than hardware logic. The extra storage is needed for the additional tag bits required per cache line (4-bit and 1-bit flag), the CST and DCAT. In our current prototype implementation targeting 16 domains, 16 MB LLC with 16-ways and 16,384 sets, and an allowed maximum of 8,192 sets per domain, the CST consumes 2 KB, the DCAT 224 KB, and the additional tag storage 160 KB, totaling 386 KB. This amounts to a negligible 2.3% storage overhead relative to a 16 MB LLC which would consume approximately an additional 2.7% area in fabrication.

The capacity of these tables and the consequent storage area overheads are directly impacted by how the various design/implementation trade-offs involved are configured in different implementations of Chunked-Cache, namely 1) the number of active parallel domains supported (overhead increases only linearly) 2.) the total L3 cache capacity and its number of sets, and 3.) the maximum number of sets that can be allocated to a domain. For example, to support 32 domains, one more tag store bit is required costing an additional 0.25KB, relative to the overhead incurred for 16 domains as described above. The CST capacity is unaffected, while the DCAT capacity doubles to 448KB. The power consumption (evaluated below) would increase proportionally.

Logic Overhead

Chunked-Cache requires extra hardware logic for the FSMs that handle the cache de-/allocation, and look up the tables prior to cache accesses (Section VI-A). We synthesize our hardware implementation using Xilinx Vivado targeting a ZedBoard Zynq-7000 FPGA board, and estimate a logic overhead of 1.6% relative to the single-core RISC-V processor that we extend. This would diminish relative to a significantly more complex multi-billion-transistor processor with a 3-level cache hierarchy which is the intended platform for Chunked-Cache. Furthermore, this overhead does not increase as the number of domains supported by Chunked-Cache increases.

Power Consumption Overhead

We focus here on the power consumption overheads incurred by the extended tag store and CST and DCAT tables, since the extra hardware added is dominated by them, and they contribute the most to the additional power consumption (static and dynamic) overheads. Besides, power consumption by cache memories is significantly more than logic, and is usually the largest contributor to the total power consumed by a chip. We estimate the power consumption overheads of Chunked-Cache in 22nm technology using the CACTI-6.0 tool [70]. For a 16-way 16 MB cache with 64 B cache line size, the total leakage power increases from 5056.57 mW (baseline) to 5313.83 mW. The CST and DCAT incur an additional 365 mW, amounting to a total of 12.3% increase in the LLC power consumption. To support OS-specific chunk set indexing, the power consumption increases accordingly. If 2 sets are looked up in parallel (when 8,192 sets are allocated to the OS), the penalty on power consumption is negligibly minimal. When 4 or 8 sets are looked up in parallel, the power consumption overhead additionally increases by 5.5% and 27.1% relative to the baseline of 5056.57 mW, respectively. Relative to the overall chip power consumption of modern multi-core processors (90-150W), the LLC power consumption increases incurred by Chunked-Cache remain reasonable.

Vii Related Work

We categorize cache side-channel defenses which tackle the problem directly in the cache into two broad classes: partitioning-based and randomization-based. We focus in this section only on the most relevant works to Chunked-Cache, which all propose hardware changes at the cache architecture.

Vii-a Partitioning-based Microarchitectures

The partitioning-based defenses most related to Chunked-Cache propose new cache architectures that assign cache resources (cache lines or ways) exclusively to protected domains. The TEE architectures Keystone [57] and CURE [4] implement way-based partitioning to assign cache ways exclusively to enclaves. SecDCP [96] forms security classes of applications with similar security requirements and assigns cache ways to them. DAWG [51] provides way-based cache partitioning in the context of speculative execution attacks. The main limitation of way-based partitioning is its inability to support a large number of protected domains in parallel since even large LLCs only comprise a small number of cache ways (up to 16). Moreover, these defenses lead to cache underutilization when assigned cache ways are not evenly utilized by a protected domain since the unused cache lines are blocked for all other domains on the system.

Chunked-Cache, besides other approaches [97, 21], is more flexible since it partitions the cache on a cache-line basis. PLcache [97] assigns cache lines exclusively to processes which allows for a strict and fine-grained partitioning of cache resources. However, PLcache’s strict isolation does not allow for caching data shared between processes and strongly impacts the overall system performance and fairness of the cache utilization. Moreover, PLcache does not protect against occupancy-based attacks since the adversary can still infer the victim’s memory accesses by observing that the victim is unable to access/evict cache lines.

HybCache [21] assigns cache ways to protected domains (or enclaves) by providing a fully-associative mapping with random replacement for the ways to overcome the cache underutilization problem of way-based partitioning schemes. In contrast to PLcache, HybCache assigns only a subset of the cache resources to the protected domains which can be reclaimed by non-sensitive domains and thus, a fairer cache utilization is achieved which does not heavily degrade the overall system performance. However, HybCache does not scale practically with large LLCs since it would incur high power consumption overheads. Moreover, HybCache does not provide strong security guarantees against occupancy-based attacks since it does not enforce a strict partitioning.

In memory page-coloring schemes [28, 50, 103, 61, 20], the mapping from physical memory addresses to cache lines is utilized to ensure that the cache lines used from sensitive applications do not overlap. One problem with page-coloring is its high impact on the software memory layout. It cannot fully support DMA and requires modifying the memory management (OS or hypervisor). Moreover, the assignment of cache lines is static, i.e., modifying the number of assigned cache lines during runtime would require to alter the physical memory layout of the software which is highly impractical.

Chunked-Cache, however, provides flexible cache-line partitioning that can scale to support a larger number of protection domains than the number of cache ways. It additionally overcomes the limitations of other cache-line partitioning techniques by providing support for shared memory and by scaling to large LLCs while still providing strict isolation. In contrast to page coloring schemes, Chunked-Cache does not influence the memory layout, is compatible with commodity memory management software, and allows dynamic modification of the chunk sizes during runtime.

Vii-B Cryptographic Randomization Defenses

These randomization techniques attempt to avoid the storage overhead of large randomized mapping tables that are deployed by earlier defenses [97, 63, 62] by relying on cryptographic primitives to reproducibly generate the randomized mapping. Time-Secure Cache [91] uses a set-associative cache indexed with a keyed function using the cache line address and process ID as its input. However, a weak low-entropy indexing function is used, thus, frequent re-keying and cache flushing must be performed which increases complexity and performance impact.

CEASER [78] also uses a keyed indexing function but without process ID. It also requires frequent re-keying of its index derivation function and re-mapping to limit the time interval available for an attacker to reconstruct the eviction set. Under a minimal eviction set construction algorithm of complexity, CEASER has been shown able to withstand attacks with a re-keying rate of 1%. However, under eviction set construction techniques with complexity [79]

, the re-keying rate needs to increase to 35%-100%, which incurs prohibitively high performance overheads. To resist these improved attacks, a skewed variant of CEASER, CEASER-S 

[79] was proposed that divides the cache ways into multiple partitions (skews), with different encryption keys used for each partition. A cache line maps to a different set in each partition, where one of the partitions is chosen randomly for the line placement, making the minimal eviction set construction more difficult.

ScatterCache [98] also uses keyed cryptographic indexing where cache set indexing is different and pseudo-random for every protected domain but consistent for any given key. Thus, re-keying is still required at time intervals to hinder the profiling and minimal eviction set construction efforts.

Phantom-Cache [89] relies on a set of hardware-efficient hash function and XOR operations to map a cache line to 1 of 8 randomly chosen sets in the cache, each with 16 ways, thus, increasing the associativity to 128. This requires accessing 128 locations on each cache access to check if an address is cached, resulting in a high power overhead of 67%.

Defenses based on cryptographic primitives have multiple weaknesses: 1.) These defenses remain only as secure as the best/fastest known attack strategy/minimal set eviction construction algorithm [10, 76] with no solid future-proof security guarantees. In fact, a recent work [88] has further shown that other attack techniques and workarounds can be used to exploit certain flaws in ScatterCache and CEASER-S to completely undermine them and their security guarantees. 2.) Their promised security guarantees often rely on the alleged, yet not thoroughly investigated unpredictability of low-latency cryptographic primitives. The primitives deployed by CEASER, CEASER-S and ScatterCache have been shown vulnerable to cryptanalysis which enables the construction of eviction sets without even accessing memory [75, 8]. Deploying primitives that resist formal cryptanalysis is also not practical since it would incur increased latency, thus, further degrading performance in the cache’s critical path. 3.) If the re-keying rate is increased to mitigate novel attacks, the induced performance overhead renders these defenses impractical.

Mirage, a concurrent work, attempts to overcome the vulnerability to newer faster eviction-set construction algorithms, by eliminating set-associative eviction altogether [82]. However, besides still being vulnerable to occupancy-based attacks, Mirage does not support selectively enabling side-channel resilience only for execution domains that require it, thus, incurring a performance slowdown on the entire workload.

Chunked-Cache, in contrast, eliminates the described unreliability and inflexibility fundamentally by providing strict, yet perfectly configurable and selective, partitioning across the execution domains. This enables each domain to allocate the cache capacity it requires and thus, experience the performance that it has opted to tolerate accordingly. This different paradigm provides well-grounded security assurances that stand the test of advances in cache side-channel attacks and different attack methodologies and complexities, without sacrificing performance. Instead, it provides by-design the possibility to tune the security-performance trade-off for each domain as desired, without overtaxing the OS either.

Viii Conclusion

In this paper, we presented a novel side-channel-resilient cache microarchitecture, Chunked-Cache, for TEE architectures, that enables each execution domain to flexibly and selectively configure its exclusive cache sets only when cache isolation and side-channel resilience is required. Unlike randomization-based cache microarchitectures recently proposed, Chunked-Cache fundamentally mitigates side-channel attacks by enforcing strict cache partitioning, thus providing future-proof and solid security guarantees. It also outperforms way-based partitioning and scales to support a larger number of execution domains, without degrading the performance of the OS. In this work, we show how Chunked-Cache incorporates this configurable performance-security trade-off by design in the cache microarchitecture to cater most optimally for TEE architectures. Through our security analysis and evaluation, we also show how on-demand sophisticated side-channel security, as well as performance, functionality and usability requirements are preserved in Chunked-Cache, with small hardware and memory costs.

Appendix A Additional Experiments

A-a I-Domain Scalability

We also demonstrate how Chunked-Cache scales for a larger number of parallel domains. As described in Section IV, the design of Chunked-Cache allows to support more domains in parallel than the 16 domains we choose for our hardware implementation. Thus, we conduct scaling experiments where we run the 619.lbm_s benchmark on every I-Domain and we increase the number of I-Domains from 4 to 8, 16 and up to 32. Running more I-Domains in parallel is not possible on our evaluation platform since the gem5 full-system simulation with 32 I-Domains already consumes the complete 186 GB of available RAM which unavoidably imposes certain limitations on our experiments. Given these constraints, we selected 619.lbm_s as a benchmark because of its relatively small working set. Throughout these experiments, the NI-Domain (which runs the Linux kernel and one instance of 619.lbm_s) gets 8,192 sets assigned. The overall miss rates for the NI-Domain, when scaling from 4 to 32 I-Domains, are stable, reaching 71.45%, 71.64%, 72.06% and 71.75%, respectively. Thus, with Chunked-Cache, also a high number of I-Domains can be supported without degrading the performance of the NI-Domain (OS). Running even more domains was only limited by the memory constraints of our evaluation platform.

A-B Dynamic Set Allocation

In another experiment, we analyze how the dynamic set allocation capabilities of Chunked-Cache impact the NI-Domain and I-Domains during runtime. For this, we select a SPEC benchmark (631.deepsjeng_s) which achieves a relatively small average cache miss rate, when enough cache sets are available, in order to better demonstrate the behavior of the dynamic set allocation. We run the benchmark in 4 distinct I-Domains and as part of the NI-Domain. We simulate 24 billion cycles on our evaluation platform which corresponds to 12s worth of computing (given that we simulate processors with a clock frequency of 2 GHz). At the beginning of the experiment, the NI-Domain (D0) gets 8,192 sets assigned, the I-Domains D1-D3 512 sets each and the I-Domain D4 only 1 set. Then, during runtime, the size of D4’s chunk is modified. After 3s, the chunk size is increased to 512 sets, after 6s to 2048 sets and after 9s decreased to 1 set. The chunk sizes of the domains D0, D1, D2 and D3 are kept constant throughout the duration of the experiment. We collect miss rate statistics for all domains every 75ms (150,000,000 cycles) and compute the arithmetic mean over the instruction and data miss rates of the page table walker and core.

The results of the experiment are shown in Figure 13, whereby we only show the miss rates for D0, D1 and D4 since the results of D2 and D3 are very similar to those of D1. The plot clearly shows how the increase and decrease of the chunks size affects the miss rate of D4. At the beginning, when only 1 set is assigned to D4, the miss rate fluctuates heavily around a value of 80%. At the time point 3s, when 511 additional sets are assigned to D4, the miss rate almost immediately drops to around 60%, thereby catching up with the miss rates achieved by D1. After another 3s, when D4’s chunk size is increased to 2048, a low and stable miss rate of 20% is achieved. The fact that D0 experiences the same miss rate with 8,192 sets shows that applications are not always benefiting from an increased chunk size and thus, available sets are better redistributed to other benefiting domains to improve the overall system performance. After 9s, the chunk size is decreased to 1 set which again leads to a heavily fluctuating miss rate of around 80%.

Another interesting take-way from Figure 13 is that the flushing of all chunk sets, which happens after 6s, does not negatively influence the miss rate of D4, at least not when collecting the miss rate statistics at intervals of 75ms.

Fig. 13: Cache miss rate impact of Chunked-Cache on the NI-Domain (D0) and I-Domains (D1, D4) when dynamically modifying the size of the cache chunk assigned to D1.

References

  • [1] O. Aciiçmez, Ç. K. Koç, and J. Seifert (2007) On the power of simple branch prediction analysis. ACM Symposium on Information, computer and communications security, pp. 312–320. Cited by: §I.
  • [2] O. Acıiçmez, Ç. K. Koç, and J. Seifert (2007) Predicting secret keys via branch prediction. Cryptographers’ Track at the RSA Conference, pp. 225–242. Cited by: §I.
  • [3] ARM Limited (2009) ARM Security Technology – Building a Secure System using TrustZone Technology. Note: http://infocenter.arm.com/help/topic/com.arm.doc.prd29-genc-009492c/PRD29-GENC-009492C_trustzone_security_whitepaper.pdf Cited by: §I, §III-A, §III-A, §III-A, §III-B, §IV-B.
  • [4] R. Bahmani, F. Brasser, G. Dessouky, P. Jauernig, M. Klimmek, A. Sadeghi, and E. Stapf (2020) CURE: A Security Architecture with CUstomizable and Resilient Enclave. arXiv preprint arXiv:2010.15866. Cited by: §I, §II-B, §III-A, §III-A, §III-A, §III-A, §III-B, §IV-A, §VI-A, §VI-B, §VI-B, §VII-A.
  • [5] D. J. Bernstein, T. Lange, and P. Schwabe (2012) The security impact of a new cryptographic library. In International Conference on Cryptology and Information Security in Latin America, pp. 159–176. Cited by: §II-B.
  • [6] D. J. Bernstein (2005) Cache-timing attacks on AES. Cited by: §I, 2nd item.
  • [7] I. Biehl, B. Meyer, and V. Müller (2000) Differential fault attacks on elliptic curve cryptosystem. In CRYPTO, Cited by: §III-B.
  • [8] R. Bodduna, V. Ganesan, P. SLPSK, K. Veezhinathan, and C. Rebeiro (2020) Brutus: Refuting the Security Claims of the Cache Timing Randomization Countermeasure Proposed in CEASER. IEEE Computer Architecture Letters. Cited by: §I, §II-B, §VII-B.
  • [9] J. Bonneau and I. Mironov (2006) Cache-collision Timing Attacks Against AES. In International Conference on Cryptographic Hardware and Embedded Systems (CHES), Cited by: 2nd item, §III-B.
  • [10] T. Bourgeat, J. Drean, Y. Yang, L. Tsai, J. Emer, and M. Yan (2020) End-to-end Quantitative Security Analysis of Randomly Mapped Caches. In Micro, Cited by: §I, §II-B, §VII-B.
  • [11] F. Brasser, D. Gens, P. Jauernig, A. Sadeghi, and E. Stapf (2019) SANCTUARY: ARMing TrustZone with User-space Enclaves. In Annual Network and Distributed System Security Symposium (NDSS), Cited by: §III-A, §III-A, §III-B, §VI-B.
  • [12] F. Brasser, U. Müller, A. Dmitrienko, K. Kostiainen, S. Capkun, and A. Sadeghi (2017) Software Grand Exposure: SGX Cache Attacks Are Practical. In USENIX Workshop on Offensive Technologies (WOOT), Cited by: §I, §II-A.
  • [13] J. V. Bulck, D. Moghimi, M. Schwarz, M. Lippi, M. Minkin, D. Genkin, Y. Yarom, B. Sunar, D. Gruss, and F. Piessens (2020) LVI: Hijacking Transient Execution through Microarchitectural Load Value Injection. In IEEE Symposium on Security and Privacy, Cited by: §I.
  • [14] C. Canella, D. Genkin, L. Giner, D. Gruss, M. Lipp, M. Minkin, D. Moghimi, F. Piessens, M. Schwarz, B. Sunar, J. Van Bulck, and Y. Yarom (2019) Fallout: Leaking Data on Meltdown-resistant CPUs. In ACM SIGSAC Conference on Computer and Communications Security (CCS), Cited by: §I.
  • [15] G. Chen, S. Chen, Y. Xiao, Y. Zhang, Z. Lin, and T. H. Lai (2019) SgxPectre: Stealing Intel Secrets from SGX Enclaves Via Speculative Execution. In IEEE European Symposium on Security and Privacy, Cited by: §I.
  • [16] M. Chiappetta, E. Savas, and C. Yilmaz (2016) Real time detection of cache-based side-channel attacks using hardware performance counters. Applied Soft Computing 49, pp. 1162–1174. Cited by: §II-B.
  • [17] D. Cock, Q. Ge, T. Murray, and G. Heiser (2014) The last mile: An empirical study of timing channels on seL4. In CCS, Cited by: §II-B.
  • [18] S. P. E. Corporation (2017) SPEC CPU 2017. Note: https://www.spec.org/cpu2017 Cited by: §VI.
  • [19] V. Costan and S. Devadas (2016) Intel SGX Explained. Technical report Cryptology ePrint Archive. Report 2016/086. Note: https://eprint.iacr.org/2016/086.pdf Cited by: §I.
  • [20] V. Costan, I. A. Lebedev, and S. Devadas (2016) Sanctum: Minimal Hardware Extensions for Strong Software Isolation. In USENIX Security Symposium, Cited by: §II-B, §II-B, §III-A, §III-A, §III-A, §III-B, §IV-B, §VI-B, §VII-A.
  • [21] G. Dessouky, T. Frassetto, and A. Sadeghi (2020) HybCache: Hybrid Side-Channel-Resilient Caches for Trusted Execution Environments. In USENIX Security Symposium, Cited by: §I, §I, §I, §II-B, §III-B, §VII-A, §VII-A.
  • [22] C. Disselkoen, D. Kohlbrenner, L. Porter, and D. Tullsen (2017) Prime+Abort: A Timer-free High-precision L3 Cache Attack Using Intel TSX. In USENIX Security Symposium, Cited by: 2nd item.
  • [23] C. Disselkoen, D. Kohlbrenner, L. Porter, and D. Tullsen (2017) Prime+Abort: A Timer-Free High-Precision L3 Cache Attack using Intel TSX. In 26th USENIX Security Symposium (USENIX Security 17), pp. 51–67. Cited by: §II-B.
  • [24] G. Doychev, B. Köpf, L. Mauborgne, and J. Reineke (2013) CacheAudit: A Tool for the Static Analysis of Cache Side Channels. In USENIX Security Symposium, Cited by: §I.
  • [25] G. Doychev and B. Köpf (2017) Rigorous Analysis of Software Countermeasures Against Cache Attacks. In SIGPLAN Conference on Programming Language Design and Implementation (PLDI), Cited by: §I.
  • [26] D. Evtyushkin, D. Ponomarev, and N. Abu-Ghazaleh (2016) Jump over ASLR: Attacking branch predictors to bypass ASLR. IEEE/ACM International Symposium on Microarchitecture. Cited by: §I.
  • [27] D. Evtyushkin, R. Riley, N. C. Abu-Ghazaleh, D. Ponomarev, et al. (2018) BranchScope: A New Side-Channel Attack on Directional Branch Predictor. ACM Conference on Architectural Support for Programming Languages and Operating Systems, pp. 693–707. Cited by: §I.
  • [28] M. Godfrey (2013) On The Prevention of Cache-Based Side-Channel Attacks in a Cloud Environment. Master’s Thesis, Queen’s University, Ontario, Canada. Cited by: §I, §II-B, §II-B, §III-B, §VII-A.
  • [29] J. Götzfried, M. Eckert, S. Schinzel, and T. Müller (2017) Cache Attacks on Intel SGX. In European Workshop on Systems Security, Cited by: §I, §II-A.
  • [30] B. Gras, K. Razavi, H. Bos, and C. Giuffrida (2018) Translation Leak-aside Buffer: Defeating Cache Side-channel Protections with TLB Attacks. In USENIX Security Symposium, Cited by: §I.
  • [31] B. Gras, K. Razavi, E. Bosman, H. Bos, and C. Giuffrida (2017) ASLR on the Line: Practical Cache Attacks on the MMU. In Annual Network and Distributed System Security Symposium (NDSS), Cited by: §I, §I, 2nd item.
  • [32] D. Gruss, J. Lettner, F. Schuster, O. Ohrimenko, I. Haller, and M. Costa (2017) Strong and Efficient Cache Side-channel Protection Using Hardware Transactional Memory. In USENIX Security Symposium, Cited by: §II-B, §II-B.
  • [33] D. Gruss, C. Maurice, A. Fogh, M. Lipp, and S. Mangard (2016) Prefetch Side-Channel Attacks: Bypassing SMAP and Kernel ASLR. In ACM SIGSAC Conference on Computer and Communications Security (CCS), Cited by: §I, §I, 1st item.
  • [34] D. Gruss, C. Maurice, K. Wagner, and S. Mangard (2016) Flush+Flush: A Fast and Stealthy Cache Attack. In International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment (DIMVA), Cited by: §I, §I, 1st item, §II-B.
  • [35] D. Gruss, R. Spreitzer, and S. Mangard (2015) Cache Template Attacks: Automating Attacks on Inclusive Last-level Caches. In USENIX Security Symposium, Cited by: §I, §I.
  • [36] R. Guanciale, H. Nemati, C. Baumann, and M. Dam (2016) Cache Storage Channels: Alias-Driven Attacks and Verified Countermeasures. In IEEE Symposium on Security & Privacy (IEEE S&P), Cited by: §I, 2nd item.
  • [37] D. Gullasch, E. Bangerter, and S. Krenn (2011) Cache Games – Bringing Access-Based Cache Attacks on AES to Practice. In IEEE Symposium on Security & Privacy (IEEE S&P), Cited by: §I, 1st item.
  • [38] W. Hu (1991) Reducing Timing Channels with Fuzzy Time. In IEEE Computer Society Symposium on Research in Security and Privacy, Cited by: §II-B.
  • [39] W. Hu (1992) Reducing timing channels with fuzzy time. Journal of computer security 1 (3-4), pp. 233–254. Cited by: §II-B.
  • [40] (2020) Intel Skylake X. Note: https://www.7-cpu.com/cpu/Skylake_X.html Cited by: §VI-B.
  • [41] Intel Intel Software Guard Extensions. Tutorial slides. Note: Reference Number: 332680-002, revision 1.1https://software.intel.com/sites/default/files/332680-002.pdf Cited by: §I, §III-A, §III-A, §III-B, §IV-B.
  • [42] Intel (2019) Intel Integrated Performance Primitives Cryptography Developer Reference. Note: https://software.intel.com/sites/default/files/ippcp_devref_0.pdf Cited by: §II-B.
  • [43] G. Irazoqui, T. Eisenbarth, and B. Sunar (2015) S$A: A Shared Cache Attack That Works across Cores and Defies VM Sandboxing – and Its Application to AES. In IEEE Symposium on Security & Privacy (IEEE S&P), Cited by: §I, 2nd item, §II-A.
  • [44] G. Irazoqui, T. Eisenbarth, and B. Sunar (2016) Cross Processor Cache Attacks. In ACM Symposium on Information, Computer and Communications Security (ASIACCS), Cited by: §I, 1st item.
  • [45] Kaplan et al. (2016) AMD memory encryption. Note: https://developer.amd.com/wordpress/media/2013/12/AMD_Memory_Encryption_Whitepaper_v7-Public.pdf Cited by: §III-A, §III-A, §III-A, §III-B.
  • [46] M. Kayaalp, N. Abu-Ghazaleh, D. Ponomarev, and A. Jaleel (2016) A High-resolution Side-channel Attack on Last-level Cache. In IEEE/ACM Design Automation Conference (DAC), Cited by: §I, 2nd item.
  • [47] M. Kayaalp, K. N. Khasawneh, H. A. Esfeden, J. Elwell, N. Abu-Ghazaleh, D. Ponomarev, and A. Jaleel (2017) RIC: Relaxed Inclusion Caches for mitigating LLC side-channel attacks. In IEEE/ACM Design Automation Conference (DAC), Cited by: §II-B, §II-B.
  • [48] Z. Kenjar, T. Frassetto, D. Gens, M. Franz, and A. Sadeghi (2020) V0LTpwn: Attacking x86 Processor Integrity from Software. In 29th USENIX Security Symposium (USENIX Security 20), Cited by: §III-B.
  • [49] Khang T Nguyen (2016) Introduction to Cache Allocation Technology in the Intel Xeon Processor E5 v4 Family. Note: https://software.intel.com/articles/introduction-to-cache-allocation-technology Cited by: §VI-B.
  • [50] T. Kim, M. Peinado, and G. Mainar-Ruiz (2012) STEALTHMEM: System-level Protection Against Cache-based Side Channel Attacks in the Cloud. In USENIX Security Symposium, Cited by: §II-B, §II-B, §VII-A.
  • [51] V. Kiriansky, I. Lebedev, S. Amarasinghe, S. Devadas, and J. Emer (2018) DAWG: A Defense Against Cache Timing Attacks in Speculative Execution Processors. In IEEE/ACM International Symposium on Microarchitecture (MICRO), Cited by: §I, §I, §II-B, §II-B, §III-B, §IV-A, §V, §VI-B, §VII-A.
  • [52] V. Kiriansky and C. Waldspurger (2018) Speculative Buffer Overflows: Attacks and defenses. arXiv preprint arXiv:1807.03757. Cited by: §I.
  • [53] P. Kocher, J. Horn, A. Fogh, D. Genkin, D. Gruss, W. Haas, M. Hamburg, M. Lipp, S. Mangard, T. Prescher, et al. (2019) Spectre attacks: exploiting speculative execution. In 2019 IEEE Symposium on Security and Privacy (SP), pp. 1–19. Cited by: §I, §II-A.
  • [54] Kocher (1996) Timing attacks on implementations of diffie-hellman, rsa, dss, and other systems. In CRYPTO, Cited by: §III-B.
  • [55] B. Köpf, L. Mauborgne, and M. Ochoa (2012) Automatic Quantification of Cache Side-channels. In International Conference on Computer Aided Verification (CAV), Cited by: §I.
  • [56] E. M. Koruyeh, K. N. Khasawneh, C. Song, and N. Abu-Ghazaleh (2018) Spectre Returns! Speculation Attacks using the Return Stack Buffer. In USENIX Security Symposium, Cited by: §I.
  • [57] D. Lee, D. Kohlbrenner, S. Shinde, K. Asanović, and D. Song (2020) Keystone: An Open Framework for Architecting Trusted Execution Environments. In Proceedings of the Fifteenth European Conference on Computer Systems (EuroSys), Cited by: §I, §II-B, §III-A, §III-A, §III-A, §III-B, §IV-A, §VI-A, §VI-B, §VI-B, §VII-A.
  • [58] S. Lee, M. Shih, P. Gera, T. Kim, H. Kim, and M. Peinado (2017) Inferring fine-grained control flow inside SGX enclaves with branch shadowing. USENIX Security Symposium, pp. 16–18. Cited by: §I.
  • [59] M. Lipp, D. Gruss, R. Spreitzer, C. Maurice, and S. Mangard (2016) ARMageddon: Cache Attacks on Mobile Devices. In USENIX Security Symposium, Cited by: §I, §II-A.
  • [60] M. Lipp, M. Schwarz, D. Gruss, T. Prescher, W. Haas, A. Fogh, J. Horn, S. Mangard, P. Kocher, D. Genkin, et al. (2018) Meltdown: reading kernel memory from user space. In 27th USENIX Security Symposium (USENIX Security 18), pp. 973–990. Cited by: §I, §II-A.
  • [61] F. Liu, Q. Ge, Y. Yarom, F. Mckeen, C. Rozas, G. Heiser, and R. B. Lee (2016) CATalyst: Defeating Last-Level Cache Side Channel Attacks in Cloud Computing. In IEEE International Symposium on High Performance Computer Architecture (HPCA), Cited by: §I, §II-B, §III-B, §IV-A, §VI-B, §VII-A.
  • [62] F. Liu and R. B. Lee (2014) Random Fill Cache Architecture. In IEEE/ACM International Symposium on Microarchitecture (MICRO), Cited by: §II-B, §VII-B.
  • [63] F. Liu, H. Wu, K. Mai, and R. B. Lee (2016) Newcache: Secure Cache Architecture Thwarting Cache Side-Channel Attacks. In IEEE/ACM International Symposium on Microarchitecture (MICRO), Cited by: §I, §II-B, §III-B, §VII-B.
  • [64] F. Liu, Y. Yarom, Q. Ge, G. Heiser, and R. B. Lee (2015) Last-Level Cache Side-Channel Attacks Are Practical. In IEEE Symposium on Security & Privacy (IEEE S&P), Cited by: §I, §I, 2nd item, §II-A.
  • [65] G. Maisuradze and C. Rossow (2018) ret2spec: Speculative execution using Return Stack Buffers. In ACM SIGSAC Conference on Computer and Communications Security (CCS), Cited by: §I.
  • [66] R. Martin, J. Demme, and S. Sethumadhavan (2012) TimeWarp: Rethinking Timekeeping and Performance Monitoring Mechanisms to Mitigate Side-channel Attacks. In International Symposium on Computer Architecture (ISCA), Cited by: §II-B.
  • [67] R. Martin, J. Demme, and S. Sethumadhavan (2012) Timewarp: rethinking timekeeping and performance monitoring mechanisms to mitigate side-channel attacks. In 2012 39th Annual International Symposium on Computer Architecture (ISCA), pp. 118–129. Cited by: §II-B.
  • [68] A. Moghimi, T. Eisenbarth, and B. Sunar (2018) MemJam: A false dependency attack against constant-time crypto implementations in SGX. Cryptographers’ Track at the RSA Conference, pp. 21–44. Note: 10.1007/978-3-319-76953-0_2 Cited by: §I.
  • [69] A. Moghimi, G. Irazoqui, and T. Eisenbarth (2017) Cachezoom: How SGX amplifies the power of cache attacks. In International Conference on Cryptographic Hardware and Embedded Systems, pp. 69–90. Cited by: §I, §II-A.
  • [70] Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P Jouppi (2009) Cacti 6.0: A tool to model large caches.. Note: https://www.hpl.hp.com/techreports/2009/HPL-2009-85.pdf Cited by: §VI-C.
  • [71] D. A. Osvik, A. Shamir, and E. Tromer (2006) Cache Attacks and Countermeasures: The Case of AES. In The Cryptographers’ Track at the RSA Conference on Topics in Cryptology (CT-RSA), Cited by: §I, §I, 2nd item.
  • [72] D. A. Osvik, A. Shamir, and E. Tromer (2006) Cache attacks and countermeasures: the case of AES. In Cryptographers’ track at the RSA conference, pp. 1–20. Cited by: §II-B.
  • [73] M. Payer (2016) HexPADS: a platform to detect “stealth” attacks. In International Symposium on Engineering Secure Software and Systems, pp. 138–154. Cited by: §II-B.
  • [74] C. Percival (2005) Cache missing for fun and profit. BSDCan. Cited by: §II-B.
  • [75] A. Purnal, L. Giner, D. Gruss, and I. Verbauwhede (2021) Systematic Analysis of Randomization-based Protected Cache Architectures. In IEEE Symposium on Security and Privacy, Cited by: §I, §II-B, §VII-B.
  • [76] A. Purnal and I. Verbauwhede (2019) Advanced profiling for probabilistic prime+ probe attacks and covert channels in scattercache. arXiv preprint arXiv:1908.03383. Cited by: §II-B, §VII-B.
  • [77] P. Qiu, D. Wang, Y. Lyu, and G. Qu (2019) Voltjockey: breaching trustzone by software-controlled voltage manipulation over multi-core frequencies. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, pp. 195–209. Cited by: §III-B.
  • [78] M. K. Qureshi (2018) Ceaser: Mitigating Conflict-based Cache Attacks via Encrypted-Address and Remapping. In IEEE/ACM International Symposium on Microarchitecture (MICRO), Cited by: §I, §I, §II-B, §III-B, §III-B, §V, §VI, §VI-B, §VII-B.
  • [79] M. K. Qureshi (2019) New Attacks and Defense for Encrypted-Address Cache. In International Symposium on Computer Architecture (ISCA), Cited by: §I, §I, §II-B, §II-B, §III-B, §III-B, §V, §VI, §VI-B, §VII-B.
  • [80] (2018) Reading privileged memory with a side-channel. Note: https://googleprojectzero.blogspot.com/2018/01/reading-privileged-memory-with-side.html Cited by: §I.
  • [81] Roa Logic BV (2020) RV12. Note: https://github.com/RoaLogic/RV12 Cited by: §VI-C.
  • [82] G. Saileshwar and M. Qureshi (2021) MIRAGE: Mitigating Conflict-Based Cache Attacks with a Practical Fully-Associative Design. In USENIX Security Symposium, Cited by: §VII-B.
  • [83] M. Schwarz, M. Lipp, D. Moghimi, J. Van Bulck, J. Stecklina, T. Prescher, and D. Gruss (2019) ZombieLoad: Cross-Privilege-Boundary Data Sampling. In ACM SIGSAC Conference on Computer and Communications Security (CCS), Cited by: §I.
  • [84] M. Schwarz, C. Maurice, D. Gruss, and S. Mangard (2017) Fantastic timers and where to find them: high-resolution microarchitectural attacks in javascript. In International Conference on Financial Cryptography and Data Security, pp. 247–267. Cited by: §II-B.
  • [85] M. Schwarz, S. Weiser, D. Gruss, C. Maurice, and S. Mangard (2017) Malware Guard Extension: Using SGX to Conceal Cache Attacks. In Detection of Intrusions and Malware, and Vulnerability Assessment (DIMVA), Cited by: §I, §II-A.
  • [86] A. Shusterman, L. Kang, Y. Haskal, Y. Meltser, P. Mittal, Y. Oren, and Y. Yarom (2019) Robust website fingerprinting through the cache occupancy channel. In 28th USENIX Security Symposium (USENIX Security 19), pp. 639–656. Cited by: §I, 3rd item.
  • [87] S. Singh and M. Awasthi (2019) Memory Centric Characterization and Analysis of SPEC CPU2017 Suite. arXiv preprint arXiv:1910.00651. Cited by: §VI-B.
  • [88] W. Song, B. Li, Z. Xue, Z. Li, W. Wang, and P. Liu (2021) Randomized Last-Level Caches Are Still Vulnerable to Cache Side-Channel Attacks! But We Can Fix It. In IEEE Symposium on Security & Privacy (IEEE S&P), Cited by: §I, §II-B, §VII-B.
  • [89] Q. Tan, Z. Zeng, K. Bu, and K. Ren (2020) PhantomCache: Obfuscating Cache Conflicts with Localized Randomization. In NDSS, Cited by: §I, §I, §II-B, §III-B, §III-B, §V, §VI, §VI-B, §VII-B.
  • [90] A. Tang, S. Sethumadhavan, and S. Stolfo (2017) CLKSCREW: exposing the perils of security-oblivious energy management. In USENIX Security Symposium, Cited by: §I, §III-B.
  • [91] D. Trilla, C. Hernandez, J. Abella, and F. J. Cazorla (2018) Cache Side-channel Attacks and Time-predictability in High-performance Critical Real-time Systems. In IEEE/ACM Design Automation Conference (DAC), Cited by: §I, §I, §II-B, §III-B, §III-B, §V, §VII-B.
  • [92] J. Van Bulck, F. Piessens, and R. Strackx (2018) Foreshadow: Extracting the Keys to the Intel SGX Kingdom with Transient Out-of-Order Execution. USENIX Security Symposium. Cited by: §I.
  • [93] S. Van Schaik, C. Giuffrida, H. Bos, and K. Razavi (2018) Malicious Management Unit: Why Stopping Cache Attacks in Software is Harder Than You Think. In USENIX Security Symposium, Cited by: §I, 2nd item.
  • [94] S. van Schaik, A. Milburn, S. Österlund, P. Frigo, G. Maisuradze, K. Razavi, H. Bos, and C. Giuffrida (2019) RIDL: Rogue In-flight Data Load. In IEEE Symposium on Security and Privacy, Cited by: §I.
  • [95] B. C. Vattikonda, S. Das, and H. Shacham (2011) Eliminating fine grained timers in xen. In Proceedings of the 3rd ACM workshop on Cloud computing security workshop, pp. 41–46. Cited by: §II-B.
  • [96] Y. Wang, A. Ferraiuolo, D. Zhang, A. C. Myers, and G. E. Suh (2016) SecDCP: Secure Dynamic Cache Partitioning for Efficient Timing Channel Protection. In IEEE/ACM Design Automation Conference (DAC), Cited by: §I, §I, §II-B, §II-B, §III-B, §IV-A, §VI-B, §VII-A.
  • [97] Z. Wang and R. B. Lee (2007) New Cache Designs for Thwarting Software Cache-based Side Channel Attacks. In International Symposium on Computer Architecture (ISCA), Cited by: §II-B, §II-B, §III-B, §VII-A, §VII-B.
  • [98] M. Werner, T. Unterluggauer, L. Giner, M. Schwarz, D. Gruss, and S. Mangard (2019) ScatterCache: Thwarting Cache Attacks via Cache Set Randomization. In USENIX Security Symposium, Cited by: §I, §I, §I, §II-B, §III-B, §III-B, §V, §VI, §VI-B, §VII-B.
  • [99] M. Yan, B. Gopireddy, T. Shull, and J. Torrellas (2017) Secure Hierarchy-Aware Cache Replacement Policy (SHARP): Defending Against Cache-Based Side Channel Atacks. In International Symposium on Computer Architecture (ISCA), Cited by: §II-B, §II-B.
  • [100] M. Yan, R. Sprabery, B. Gopireddy, C. W. Fletcher, R. Campbell, and J. Torrellas Attack Directories, Not Caches: Side Channel Attacks in a Non-Inclusive World. Note: To appear in the Proceedings of the IEEE Symposium on Security & Privacy (IEEE S&P), May 2019 Cited by: §I, 2nd item.
  • [101] Y. Yarom and K. Falkner (2014) FLUSH+RELOAD: A High Resolution, Low Noise, L3 Cache Side-channel Attack. In USENIX Security Symposium, Cited by: §I, §I, 1st item, 2nd item.
  • [102] Y. Yarom, D. Genkin, and N. Heninger (2017) CacheBleed: a timing attack on OpenSSL constant-time RSA. Vol. 7, pp. 99–112. Cited by: §I.
  • [103] N. Zhang, K. Sun, D. Shands, W. Lou, and Y. T. Hou (2016) TruSpy: Cache Side-Channel Information Leakage from the Secure World on ARM Devices. Note: Cryptology ePrint Archive, Report 2016/980https://eprint.iacr.org/2016/980 Cited by: §I, §II-A, §II-B, §VII-A.
  • [104] Y. Zhang, A. Juels, M. K. Reiter, and T. Ristenpart (2012) Cross-VM Side Channels and Their Use to Extract Private Keys. In ACM SIGSAC Conference on Computer and Communications Security (CCS), Cited by: §II-A.
  • [105] S. Zhao, Q. Zhang, Y. Qin, W. Feng, and D. Feng (2019) SecTEE: A Software-based Approach to Secure Enclave Architecture Using TEE. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, pp. 1723–1740. Cited by: §II-B.