A Lightweight Isolation Mechanism for Secure Branch Predictors

05/17/2020 ∙ by Lutan Zhao, et al. ∙ 0

Recently exposed vulnerabilities reveal the necessity to improve the security of branch predictors. Branch predictors record history about the execution of different programs, and such information from different processes are stored in the same structure and thus accessible to each other. This leaves the attackers with the opportunities for malicious training and malicious perception. Instead of flush-based or physical isolation of hardware resources, we want to achieve isolation of the content in these hardware tables with some lightweight processing using randomization as follows. (1) Content encoding. We propose to use hardware-based thread-private random numbers to encode the contents of the branch predictor tables (both direction and destination histories) which we call XOR-BP. Specifically, the data is encoded by XOR operation with the key before written in the table and decoded after read from the table. Such a mechanism obfuscates the information adding difficulties to cross-process or cross-privilege level analysis and perception. It achieves a similar effect of logical isolation but adds little in terms of space or time overheads. (2) Index encoding. We propose a randomized index mechanism of the branch predictor (Noisy-XOR-BP). Similar to the XOR-BP, another thread-private random number is used together with the branch instruction address as the input to compute the index of the branch predictor. This randomized indexing mechanism disrupts the correspondence between the branch instruction address and the branch predictor entry, thus increases the noise for malicious perception attacks. Our analyses using an FPGA-based RISC-V processor prototype and additional auxiliary simulations suggest that the proposed mechanisms incur a very small performance cost while providing strong protection.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In modern processors, branch prediction is crucial in effectively exploiting parallelism of sequential programs for high-performance execution. However, recently exposed vulnerabilities reveal the necessity to improve the security of branch predictors in mainstream commercial processors [kocher2018spectre, evtyushkin2018branchscope, evtyushkin2015covert, aciiccmez2007power]. Take Spectre V2 vulnerability [kocher2018spectre] as an example: Through purposeful training of the branch predictors, an adversary can change the control flow of a victim to incorrect speculative paths which in turn reveals unauthorized data. Another example is the BranchScope attack [evtyushkin2018branchscope], where history information in the branch predictor can be perceived by the attacker to infer sensitive data after the victim has executed. The root cause of these vulnerabilities is that modern processors generally adopt the design principle of resource sharing, and branch predictor is one of the typical examples. From a security perspective, resource sharing means a possible attack surface. Branch predictors record history about the execution of different programs, and such information from different processes is stored in the same structure and thus accessible to each other. This leaves the attackers with the opportunities for malicious training and malicious perception.

One kind of defensive idea is to avoid letting key instructions leave traces in the branch predictors. For example, one countermeasure [agosta2007countermeasures] is to convert secret-dependent branch instructions into indirect jumps or other computation instructions to prevent the leakage of sensitive information. As another example, Intel processors allow software to set IA32_SPEC_CTRL.IBRS to 1 after a transition to a more privileged predictor mode [Intel2018specmitigations]. Predicted targets of indirect branches executed in that predictor mode cannot be set by execution in a less privileged predictor mode. However, it is difficult to guarantee full coverage of sensitive branches through manual or automated analyses.

In contrast, isolation is a fundamental way to improve the security of branch predictors. Existing proposals can be largely separated into two categories. ① Logical isolation aims at preventing attacks on the shared hardware resource. Attaching the thread ID to each entry can help eliminate malicious reuse across threads. But it cannot avoid contention. Another mechanism is to flush the whole history table each time a context switch or privilege switch (e.g., system call). Evtyushkin et al. proposed to do so by software method [evtyushkin2016understanding]. Clearly, this approach introduces a large context switching cost (e.g., 1.2ms per switch in the experiment). Naturally, hard-wiring flush support can bring the cost of flushing down, which was simulated by Sangho et al. [lee2017inferring]. Even with lowered costs to perform the flush operation, decrease of the performance benefits of branch predictors is yet another cost, which becomes significant in SMT architectures. ② Physical isolation can be implemented by allocating separate branch tables for different threads or privilege levels. BRB [vougioukas2019brb] is a state-of-art hardware implementation that provides individual history tables for different programs. Although BRB tries to limit hardware cost, it is in general impractical to assign separate tables to all threads and privilege level combinations. Also, physical isolation alone is insufficient as storage is still multiplexed in time by different threads.

In short, existing methods are not satisfactory. We thus set out to seek a more lightweight solution that achieves isolation of the content with similar or better qualities than these prior approaches. To the best of our knowledge, this paper is the first to propose the XOR-based lightweight isolation mechanism for branch predictors, which can effectively prevent branch-predictor induced malicious code execution and prevent spying on the history information of branch predictors. Our approach has the advantage of simple implementations and better suitability for SMT architecture than predictor flushing upon context switch. It is also more cost effective than private predictors or hardware backups. Overall, this paper makes the following contributions:

  • We propose a lightweight XOR-based Isolation mechanism (XOR-BP) for branch predictors. Upon context switch or privilege changes, the hardware dynamically generates a new thread-private random number. When updating the branch predictor (direction and/or address), the predictor’s update is XORed with the thread-private random number before saved to the table. Similarly, when reading the predictor table, the current random number will be used again to decode the stored result. By changing the thread-private random number periodically (upon context switch or privilege changes), this lightweight mechanism effectively achieves content-level isolation among multiple threads.

  • We propose a randomized index mechanism of the branch predictor (Noisy-XOR-BP). Based on the XOR-BP, another thread-private random number is used together with the branch instruction address as the input to compute the index of the branch predictor. This randomized indexing mechanism disrupts the correspondence between the branch instruction address and the branch predictor entry, thus increases the noise for malicious perceived attacks.

  • We present a detailed evaluation for our proposed designs. For single-threaded core, we implemented the above isolation mechanisms on an FPGA prototyping system of a RISC-V out-of-order processor for a realistic evaluation of the performance impact. For SMT microarchitectures, we modeled and evaluated the isolation mechanisms on a Gem5-based simulator, and we investigate its performance impacts for different branch predictors, including the latest TAGE_SC_L [seznec_TAGE_SC_L] predictors.

In the following, we analyze existing attacks on branch predictors (Section  2); discuss the threat model (Section  3); introduce the defense strategy (Section  4) and the design details of the countermeasure (Section  5); analyze the performance impacts (Section  6); summarize related work (Section  7); and conclude this paper (Section  8);

2 Understanding Attacks Via Branch Predictors

2.1 Two types of attacks due to resource sharing

Conventional branch predictor design allows different processes to use the same hardware resources for branch prediction. This creates side channel just like those exploited in a cache based side channel attack. it allows the attacker to prime the predictor in a certain fashion to facilitate the revelation of a victim’s sensitive information. Additionally, the attacker can achieve malicious training in order to influence the victim’s (speculative) execution, which in turn enables or enhances the victim’s information leak. A number of different types of attacks have been constructed:

(1) Reuse based attacks: In structures such as PHT (Pattern History Table), different programs directly access the common resource. Entries set by one process influence another. This type of attack is therefore analogous to reuse based cache attacks [liu2014random]. There are three typical examples.

① BranchScope attack [evtyushkin2018branchscope]: An attacker first locates the shared PHT entry of the secret-dependent branch of the victim and sets its saturating counter to a specific state, such as Weak Taken. The contents of the branch predictor are updated after the victim’s target branch instruction is executed. After switching back to the attacker’s program, the victim’s update to the branch predictor manifests as a measurable difference in execution time. The attacker can thus sense the direction of the target instruction and infer the victim’s execution path.

② Pure malicious training like Spectre V1 and V2 [kocher2018spectre]: Instead of exploiting the side channel of predictor to obtain leaked information, an attacker carefully trains the shared predictor, either PHT or BTB, to induce victim thread into incorrect speculative paths which in turn reveals unauthorized data via cache side channel.

③ Branch Shadowing attack [lee2017inferring] makes a shadow of victim code and measures whether target address in BTB left by victim branch accelerates the execution of shadow branch, by which the attacker can observe the direction of victim branch in SGX.

(2) Contention based attacks: In structures more similar to a cache, such as the BTB, an attack similar to contention based cache attack can be mounted. One condition for constructing such attacks is that contention result in eviction of the old record. An attacker can learn about the execution of the target branch instructions of a victim by sensing whether contention happens or not for the corresponding entry of branch prediction table. Because different branches in typical PHT use and update the same table, rather than evicting others’ history, there is no contention based attacks.

In an SBPA attack [aciiccmez2007power, aciiccmez2007predicting], the attacker first occupies all the entries in the same set (multiple ways) in the BTB corresponding to the victim’s target branch instruction. When the victim executes, the target branch will have a BTB miss and thus predicted as Not Taken. According to the BTB’s update mechanism, the BTB will be updated if and only if the target branch is Taken. An update of the BTB will evict an entry primed by the attacker, resulting in measurable difference of execution time, thus allowing the attacker to learn the execution results of the target branch of the victim.

An attacker can traverse its own address space and measuring execution time to infer the eviction of its branch from the BTB. In the case where only one branch is routinely evicted, then the victim must have a branch with the same index and (partial) tag. This allows the attacker to infer virtual addresses of other threads and break KASLR [evtyushkin2016jump].

In summary, the fundamental source of these attacks is that current branch predictors do not have thorough isolation between different processes and privileges.

2.2 Common steps for attacks on branch predictor

By analyzing these two types of attacks, the following common steps can be observed during a typical attack:

Step 1: Locate phase. An attacker first needs to locate the entry of branch predictor for the victim’s target branch instruction. Given the rather fixed indexing design of typical predictor implementation, this is relatively straightforward.

Step 2: Prime phase. Next, the attacker needs to prepare state for the target entry. This includes priming the whole set in order to sense eviction; or set a particular value in order to achieve either malicious training, or to permit future observation of the content changes due to the execution of the target branch.

Step 3: Probe phase (optional). After the target branch is executed, if the attacker is to obtain information about the target branch, it needs to probe the status of the target entry, usually via execution time analysis.

3 Threat Model

This paper has the following assumptions: The attacker thread and the victim thread can run on the same processor core. An attacker can know the source code and address layout of the victim. An attacker has the ability to run the victim program in single-step mode [evtyushkin2018branchscope].

This paper focuses on the defense against the reuse based attacks and the contention based attacks, which cause malicious training, priming, and perception across different processes and privileges. In addition, branch predictors may also have side-channel leakage in the event of a mis-prediction. It should be noted, however, misprediction and thus mis-speculation are perhaps inevitable. This paper does not consider the defense against all speculative execution related vulnerabilities.

A typical branch predictor consists of a set of history/pattern tables, which work together to predict the direction of a conditional branch (Pattern History Table, or PHT), the addresses of indirect branches (Branch Target Buffer, or BTB), and the addresses of function return instructions (Return Address Stack, or RAS). Many commercial processors have adopted a thread-private RAS structure for SMT core, but they still use a shared design for PHT and BTB [Intel2018manual]. Therefore, this paper mainly studies how to isolate PHT and BTB. Nevertheless, our proposed methods still apply to shared RAS.

4 Defense Strategy

This paper focuses on securing branch predictors against the reuse based and contention based attacks. As can be seen from the analysis in Section 2, these attacks primarily exploit the design vulnerability that branch predictors allow concurrent and fine-grain shared accesses from multiple threads and multiple privilege spaces. Our approach is to provide sufficient isolation in the predictor between threads or privilege levels. We first examine existing isolation proposals.

4.1 Analysis of existing isolation mechanisms

(1) Logical isolation aims at preventing attacks on physically shared branch predictor and only allows its owner to access. This could take the form of adding some sort of thread ID to each entry. For example, a PHT of a commercial processor typically has around 4K entries and each being a 2-bit counter. Adding ASID information (12bit in an Intel processor [Intel2018manual]) to every entry is a costly approach. However, there still exists contention based attacks if the attacker can discern whether its history has been evicted by victim branch. Flushing the predictor completely upon context switches or privilege changes is an alternative logical isolation mechanism, which is named as Complete Flush in this paper. We investigate its performance impacts on our evaluation platform, which is described in details in Section 6.

Figure 1: Performance overhead of flushing branch predictor on single-threaded processor. (flush-4M means the predictor is flushed every 4 million cycles).

There are three interesting observations:

Observation 1: The performance impact of flush methods on single-threaded core is insignificant. For a 2GHz single-threaded core, when the context switching frequency is 250Hz (typical context-switching frequency in Linux), the average performance loss of flushing predictors upon switching is less than 1% in comparison with the baseline (without isolation) as shown in Figure  1. It indicates that the program usually has enough time in each scheduled execution window to amortize the warm-up of branch predictor, which is consistent with other findings [lee2017inferring].

Observation 2: The performance loss of Complete Flush on an SMT core gets worse since flushing from different threads interfere with each other. Furthermore, Complete Flush cannot prevent attacks on SMT cores. Figure  2 shows a significant increase in performance loss to flush in an SMT core compared to a single-threaded core. Increasing the number of threads causes more performance degradation. And it cannot provide sufficient isolation to prevent sharing and preemption of history from other hardware threads.

Figure 2: Performance overhead of flushing branch history on an SMT core.

Observation 3: At the cost of complex hardware implementations, a more precise flush mechanism is better but not sufficient. And this method may introduce new security issue for PHT. It is tempting to think that if we can replace a whole-structure flush with a more precise flush mechanism, the performance degradation problem will be solved. Unfortunately, this does not appear to be the case. We have simulated a design in which each entry of branch prediction resource is augmented with thread ID properly managed to ensure only entries used by a particular thread are flushed when the thread is swapped out. We show performance comparison of such a more precise flush mechanism (dubbed Precise Flush) vs the more basic variant (Complete Flush) in Figure 3. We see that the performance loss does reduce but remains elevated. Considering the extra storage space and complexity involved to carry out the more precise flush, it is hard to see this as a satisfactory solution, especially for a 2-bit pattern history table. Furthermore, such a flush mechanism still cannot protect against a contention based attack.

Figure 3: Comparison between Complete Flush and Precise Flush in SMT-2 (Normalized to baseline without any mechanism).

(2) Physical isolation means allocating separate branch tables for different threads and different privilege levels, which aims at eliminating malicious contention and reuse. One approach is providing additional hardware resources to back up (certain portions of) predictor table upon context switches [dhodapkar2001saving]. And backed up table contents will be recovered for the thread swapped in. Nevertheless, there exist two limits:

① Physical isolation is likely to incur non-trivial resource overhead. A milder version of physical isolation could be part of a solution with acceptable overheads. For instance, BRB strives to keep a small (1-3KB) portion of the state backed up [vougioukas2019brb]. ② Constrained by hardware overhead, the number of back up tables is limited. It is impossible for every thread to have private history table. Then there may be attacks in subsequent software programs using the same back up table.

4.2 Our design philosophy

Both flush-based logical and physical isolation mechanisms have their own design space. A solution needs to satisfy the following criteria to achieve a secure and yet practical branch predictor design:

  • Enabling isolation between different processes and different privilege spaces;

  • Versatile to accommodate multiple branch predictors;

  • Lightweight implementation to minimize changes to existing branch predictors;

  • Tolerance of interference between SMT hardware threads.

Instead of isolating based on the flush or separated resources, we want to achieve isolation of the content in these hardware tables with some lightweight processing using randomization as follows.

  • Content encoding:. We propose to use hardware-based thread-private random numbers to encode the contents of the branch predictor tables (both direction and destination histories). Specifically, the data is encoded (think xor for now) with the key before written into the table and decoded after read from the table. This process is similar to the use of a key to encrypt data, where the key is a hardware-generated thread-private random number that changes periodically (upon context switch or privilege changes). Such a mechanism obfuscates the information adding difficulties to cross-process or cross-privilege analysis and perception. It achieves a similar protection effect of Precise Flush, but adds little in terms of space or time overheads.

  • Index encoding: In addition to entry encoding, we also use randomization of indexes, which sets up additional obstacles for an attacker in all three steps of locating, priming, and probing the predictor. Index encoding also requires little change to conventional branch predictor designs and incurs low overheads.

Figure 4: Microarchitecture implementation of content encoding and index encoding. The red modules are designed for XOR-BP; the combination of red the green ones are for Noisy-XOR-BP. We take Gshare architecture as the example to describe our design for PHT.

5 Lightweight Content Isolation Mechanisms

The general idea is straightforward: we transform both index and table content with thread-private keys; these keys change under certain conditions. We now discuss the implementation issues. We start with the example of BTB using the xor operation as the encoding/decoding (XOR-BTB).

5.1 Implementation of XOR-BTB

Each active hardware thread context will be allocated a thread-private random number as the key to encode or decode information stored in the table, such as tag and target address. The simplest coding operation is XOR. (We will see later that this can be exchanged for stronger isolation.)

Lookup: When predicting the target of a branch, the processor usually employs partial PC bits to index the BTB and compares the most significant bits of the PC with the tag of each way. If a match is found, the target address saved in the entry will be taken out as the target of the branch. In XOR-BTB, the higher bits of the PC are XORed with the Content Key for tag match (Figure 4 a). Finally, the stored target address (0x40bc9f21) is XORed with the Content Key to produce the actual predicted target (0x80004000). These additional steps are shown in red in the figure.

When a different thread ( with key ) executes a branch, it will not obtain the original tag or target address updated by thread due to the difference in their keys. This provides a logical isolation of the content of BTB among different threads.

Update: In a typical out-of-order processor, when an indirect branch instruction reaches the stage of execution, the actual destination will be compared to the predicted address. If they are different, it is considered as a BTB misprediction and the corresponding content in the BTB needs to be updated. In XOR-BTB, both the tag and the target address are encoded (XORed with the thread’s content key) before saved in the BTB. In Figure 4 (a), the actual target address 0x80004000 is XORed with the current content key (0xacbcdf21) to produce the encrypted address 0x40bc9f21, which is stored in the BTB.

Note that in XOR-BTB, the tag of BTB is also encoded lest an attacker could use performance counters as a covert channel to sense possible resource contention [evtyushkin2015covert]. For instance, Intel processors provide BTB-related performance counters [Intel2017perf], enabling an attacker to observe the case which hits the BTB but has a misprediction due to incorrect address.

5.2 Extending to other tables or predictors

The general idea discussed above is applicable to any table structure in a branch predictor regardless of the specific algorithm involved and the detailed organization of the underlying tables. Also note that the encoding and decoding operations need not even be done on a single logical entry of the table. For instance, in a simple 2-bit PHT, a logical entry contains only two bits. Using XOR on these two bits (XOR-PHT) may not yield sufficient obfuscation (Figure 4(b)). Instead, the encoding and decoding can operate on word basis of an arbitrary length, regardless of the logical meaning of bits (Enhanced-XOR-PHT). For instance, shown in Figure  5,  a 4K-entry PHT with 2-bits per entry can be considered as a 256-entry array of 32-bit words. We could then encode and decode these 32-bits word using 32-bit keys. Indeed, the physical implementation of the table using SRAM is most likely using a wider row already. An alternative view of this issue is that different logical entries nearby in the PHT can use different keys for encoding/decoding.

Figure 5: Microarchitecture implementation of Enhanced-XOR-PHT.

Update: After the branch instruction is committed, the saturating counter needs to be updated, whether the results of prediction is correct or not. There are two typical ways for such an update depending whether a branch reorder buffer (BROB) is used to store the counter value.

  • If a BROB is used, we take the counter value from the buffer, update it, encode the updated value with the key, and store it to PHT;

  • If not, the original counter needs to be read out of the PHT (and decoded) first before being updated, re-encoded, and written back.

5.3 Randomized index

The content randomization mechanism encrypts branch history contents, but the history information still resides at predictable locations in the tables. We propose to add index randomization (Noisy-XOR-BP) which can further increase the noise and help thwart attacks like contention-based ones. As discussed earlier in Section 2, an attacker would first locate the entry corresponding to the branch predictor (Locate phase) of the target branch instruction, and then train or analyze the corresponding entry at the Prime and Probe phases. An important prerequisite for this process is that an attacker can locate the entry based on the address of the branch instruction. That is, the indexing mechanism of the existing branch predictor is relatively fixed, which gives an attacker the opportunity to pinpoint.

The purpose of the Noisy-XOR-BP mechanism is to break the simple and fixed indexing mechanism of existing branch predictors. Like XOR-BP, Noisy-XOR-BP dynamically assigns another random index key private to each thread. The index key is XORed with the lower part of the PC to generate the index whenever there is any table lookup (green part in Figure 4). Since only the encoded index is used for actual table lookup, a contention-based attack will only obtain information of the encoded index.

Note that this index key must be updated in a timely manner. Otherwise, an attacker would still have the opportunity to eventually construct branches which shares the same index as the target branch of victim. In the subsequent description, both Noisy-XOR-BTB and Noisy-XOR-PHT include content and index encoding. In particular, Noisy-XOR-PHT encodes content with Enhanced-XOR-PHT mechanism.

In terms of the thread-private random number, the generation, updating, and exception handling are the same with that used in content randomization. In practice, the hardware random number generator can generate a single random number whose different (possibly overlapping) portions are used as keys in content and index randomization.

5.4 Implementation issues

The isolation of content in our design depends on the security of the key. We need to prevent information leakage not only between different programs, but also between different privilege levels (such as user mode and kernel mode) when running the same program. Thus when a new thread is switched in or when the thread’s privilege level changes, we change the key as follows: We use a dedicated hardware register per hardware thread to record the key. Such a thread private register is invisible to software. Once a context switch or a privilege switch occurs, a new random number will be generated and updated to this private register. All subsequent accesses would use the updated key. Correspondingly, OS and hypervisor also have their own keys. Since there are many mature methods for hardware random generation [intel2014drng, liberty2013truerng], we assume these random numbers can be generated using a dedicated hardware mechanism.

Now consider the SBPA attack on BTB as an example. The attacker first primes target addresses of a certain set in BTB. These address are XORed with the current private key of the attacker before stored in BTB. The attacker then waits for context switch to allow the victim execution to leave footprint in the prime set before context-switched back for the Probe stage. At this point, the private key has be automatically changed by the hardware. The result is that the attack will sense misses in the BTB for all the primed addresses regardless of whether the victim evicted any entry in the set.

Finally, note that while we use XOR throughout the discussion, the only requirement for the encoding operation is that they are easily reversible so that both encode and decode operations are lightweight enough to not cause critical path timing problems. Adding shifting and/or scrambling in the process, or using small lookup tables are all possible options.

5.5 Security analysis

Defense Mechanism Single-threaded core SMT core
Reuse Contention Reuse Contention
BTB Complete Flush Defend Defend No Protection No Protection
Precise Flush Defend Defend Defend1 No Protection
XOR-BTB Defend Defend Mitigate No Protection
Noisy-XOR-BTB Defend Defend Defend Mitigate
PHT Complete Flush Defend Defend No Protection Defend
Precise Flush Defend Defend Defend No Protection2
XOR-PHT Mitigate Defend No Protection Defend
Enhanced-XOR-PHT Defend Defend Mitigate Defend
Noisy-XOR-PHT Defend Defend Mitigate Defend
  • Precise Flush needs thread ID, with which branches in different hardware thread cannot use the others’ history.

  • Adding thread ID brings significant hardware overhead for PHT. Besides, there are contention based attacks on SMT.

Table 1: Security comparison.

Table  1 summarizes the security of different isolation mechanisms. Complete Flush and Precise Flush represent two costly and rather impractical mechanisms. Yet, because flushing only happens during context or privilege switches, they still fail to protect against certain attacks in SMT cores. In contrast, XOR-based isolation mechanisms are both more secure and more light-weight than these flush-based mechanisms. The specific analysis is detailed as below.

  1. Security of Noisy-XOR-BTB

In this analysis, we assume a BTB is W-way set associative with S-bit set index and T-bit tag per entry. Three major scenarios are analyzed.

Scenario 1: Reuse based attacks on single-threaded and SMT cores can be defended. For a reuse-based attack to work, the entry left by one party is being translated in multiple ways before being used by another party, which makes it exceedingly difficult to maintain controlled manipulation. Take malicious training for example. The attacker wants to lay traps in the BTB to direct the victim to a meaningful location. First, these entries need to result in a BTB hit to be even considered. For that the (partial) tag left by the attacker has to match the encoded tag of a victim branch. The chance for one entry to have a BTB hit is

. The attacker can certainly lay many such traps, increasing the overall probability to some degree. But there is a second hurdle to clear: The content of the entry will need to lead to a meaningful location, one that contains the malicious code. Since the content is XORed with another unknown key, the probability of leading the victim to a specific address is

, N being the number of address bits. Again, the attacker can certainly prepare many such traps and extend the attack over many intervals. But overall, the chance of success is against a very large denominator .

The general analysis applies to the case of an SMT core as well. The slight advantage to an attacker there is that they can continue to re-plant new traps in the BTB, whereas in a single-threaded core, those entries are gradually evicted and the attack strength reduces with time in a context switch interval.

Scenario 2: Contention based attacks on single-threaded cores can be defended. Since there are context switches between prime phase and probe phase, a thread’s own history in the previous phase becomes unrecognizable in the latter phase. Thus the attacker cannot sense the conflict caused by the target branch.

Scenario 3: Contention based attacks on SMT cores can be mitigated. In a contention-based attack carried out on a conventional SMT core, an attack like Jump [evtyushkin2016jump] can observe evictions of its own entry in BTB and thus infer the address of taken branch executed by the victim. In our system, different hardware thread has different private key for indexing. Without the key of the victim thread, the attacker can no longer make the inference of the branch address.

However, it is conceivable to extend certain attacks such as SBPA as follows. The attacker can prime the entire BTB with its own entries and invoke single-step mode to force the victim to execute a single branch of interest. Consequently, the mere change in any BTB content indicates a BTB update due to a taken branch. In such a hypothetical case, no encryption of content or the index of BTB can be of any help as the attacker is only sensing the fact there is an update to BTB, without any interest in the content or index of the update. We note that this type of attack should be handled differently altogether and is rather beyond the scope of this paper. We only highlight one possible approach here to suggest that it also can be thwarted. The attack described above reduces the demand on the information content from the BTB (or PHT) but relies on highly precise control of victim’s execution (a single branch). A reasonable counter measure is for the system to detect extreme reduction of execution speed, and subsequently bypass update of any microarchitectural resources completely as these updates are unlikely to matter for execution speed.

1/* Shared pointer and function */
2static void (*p)();  // A function pointer shared by
3        // the attacker thread and the victim thread
4void shared_interface(){
5  p();  // Execution history is recorded in BTB
6}
7 —————————————————–
8/* Attacker Thread */
9{ // p points to attacker_function() in attacker thread
10  shared_interface();  // Train to execute attacker’s p
11  clflush(p);
12  clflush(&side_line); // FLUSH side_line
13  sleep(1);      // Switch to victim thread in Line 22
14  a=side_line;   // Assess the time to RELOAD side_line
15}
16void attacker_function(){
17    a=side_line; // Leave observable traces
18}
19 —————————————————–
20/* Victim Thread */
21{   // p points to victim_function() in victim thread
22  shared_interface();
23  
24}
25void victim_function(){
26  a=sec;
27}
PoC Code piece of BTB attacks.
  1. Security of Noisy-XOR-PHT

As mentioned in Section  2, there is no contention based attack in PHT because a branch history updates the older history, rather than eviction. Security for two typical scenarios is analyzed as following.

Scenario 4: Reuse based attacks on single-threaded cores can be defended. First, the attacker cannot manipulate the victim deterministically to execute the wrong path speculatively. Because the private key changes at each context switch, the attacker cannot train the status to any deterministic direction for the target victim thread. Second, Noisy-XOR-PHT poses strict requirements on attacks to perceive the direction of target branch. An attacker can no longer infer the direction through one Prime-Probe operation.

There is one corner case: if an attacker can find a reference branch which employs the same private key as the target branch, and if this reference branch’s direction is easily known (e.g., it is biased), then the attacker can indirectly infer the target branch’s direction. Finding such a reference branch is easy if we use a simple XOR scheme with a fixed key width. The root cause is the fixed mapping relationship between the branch instruction address and content keys index within a thread. Breaking this fixed mapping relationship can be done in a number of relatively simple ways while still allowing easy decoding for normal use. For instance, the index can be selected from PC dynamically and shifted based on the index key.

Scenario 5: Reuse based attacks on SMT cores can be mitigated. In XOR-PHT, an attacker may obtain sensitive information from analyzing the update sequences from target branch only. With Noisy-XOR-PHT, the attacker needs to traverse every entry. This traversal prolongs Prime-Probe operation, resulting in the decrease of information leakage bandwidth. Furthermore, signals, interrupts, and exceptions are used for high resolution in existing attacks [evtyushkin2016jump, lee2017inferring, evtyushkin2018branchscope]. All of them involve kernel operation and thus the change of private key, which defends attacks effectively.

1/* Shared function */
2void shared_interface(int i){
3  if(i<array_size) // Accessible to attacker and victim
4    a=sec;
5  else
6    a=side_line;   // Leave observable traces
7}
8 —————————————————–
9/* Attacker Thread */
10{
11  for(int i=array_size;i<2000+array_size;i++){
12    shared_interface(i); // Train Line 3 to Not Taken
13  }
14
15  clflush(&array_size);
16  clflush(&side_line);   // FLUSH side_line
17
18  sleep(1);     // Switch to victim thread to Line 24
19  a=side_line; // Assess the time to RELOAD side_line
20}
21 —————————————————–
22/* Victim Thread */
23{
24  
25  shared_interface(x);
26  
27}
PoC Code piece of PHT attacks.
Figure 8: Implementation of Noisy-XOR-PHT on two example predictors. All the tables here use the shared index key and content key, and of course each table can also have their own index key and content key. (a) Tournament: the first level holds 11 bits of branch pattern history for up to 2048 branches. This 10-bit pattern picks from one of 2048 prediction counters. The global predictor is an 8192-entry table of 2-bit saturating counters indexed by the path (or global) history of the last 12 branches. The choice prediction, or chooser, is also an 8192-entry table of 2-bit prediction counters indexed by the path history. (b) TAGE_SC_L: the TAGE component consists of a base predictor (16Kbits prediction, 4Kbits hysteresis), two bank-interleaved tagged tables featuring respectively ten 12-bit, 1K-entry banks, and twenty 16-bit, 1K-entry banks. The respective tag widths are 8 and 11 bits, a 3000-bit global history length, a 27-bit global path length, sixteen 5-bit USEALT counters, a 19-bit counter for monitoring the allocation policy. The loop predictor features 256 entries and is 4-way associative (256 x 52 bits). The Multi-GEHL statistical corrector consists of seven GEHL-like components respectively indexed using the global conditional branch history (two tables), the global history of the backward branches (two tables), an IMLI counter indexed table, another IMLI-based GEHL predictor and three local history GEHL components respectively with 256-entry, 16-entry, and 16-entry local history.
  1. Attack & defense experiments

To evaluate the effectiveness of our mechanism, we conduct experiments with proof-of-concept (PoC) attacks for BTB and PHT respectively on our FPGA-based processor prototype (configuration is introduced in Section  6). The PoC codes (detailed in Listing 6 and 7) proceed in a realistic scenario: cross-thread, within the same address domain.

In our experiment, we repeat the attack 10000 iterations. In the PHT attack, we consider 100 attempts of training as one iteration. A successful attack means that the victim branch jumps to the trained direction more than 90 times. For the baseline processor without any defense mechanism, the accuracy of training BTB and PHT is 96.5% and 97.2%, respectively. The accuracy on an Intel E5-2697 v4 processor is higher than 99.9%. With XOR-based Isolation, the accuracy of training both BTB and PHT decreases to less than 1%.111The 1% apparently successful attacks stem from the limitations of the RISCV experimental platform and software noises. For example, we determine the success of attack by observing Flush+Reload cache side channels. However, flushing a cache line precisely is not supported in RISCV instruction set, so we employ evicting the whole cache with large arrays. This presents false positive measurement noises on successful attack. However, an adversary cannot exploit these noises to construct attacks. In summary, our mechanism introduces effective protection against these attacks.

6 Evaluation

6.1 Methodology

Parameter Configurations
FPGA prototype Gem5 simulation
ISA RISC-V ALPHA
Frequency 2GHz 2.5GHz
(FPGA runs at 50MHz)
Processor type 4-decode,4-issue, 8-decode,8-issue,
4-commit 8-commit
Pipeline depth 10 stages 19 stages
ROB/LDQ/STQ 64/16/16 entries 352/128/72 entries
Issue Queue 20/16/10 (mem/int/flt) 120
BTB 256 2-way 1024 4-way
PHT TAGE: 33 KB TAGE_SC_L: 66.6KB
6 4096 entries or LTAGE: 32KB
history length: or Tournament: 6.3KB
12, 27, 44, 63, 90, 130 or Gshare: 2KB
ITLB/DTLB 8/8 entries 64/64 entries
L1 ICache 32KB, 8-way, 64B line 32KB, 4-way, 64B line
L1 DCache 32KB, 8-way, 64B line 48KB, 4-way, 64B line
L2 Cache 1MB, 16-way, 64B line 512KB, 16-way,64B line
L3 Cache None 4MB, 32-way, 64B line
Table 2: OoO Processor Core Configurations.

We built an FPGA prototype based on the open-source Berkeley Out-of-Order RISC-V processor (BOOM) as the main platform for experimental analyses 

[AsanovicBOOM, WatermanRISCV]. The main parameters of this system are shown in Table 2. On this platform, we implemented the XOR-BP and Noisy-XOR-BP mechanisms to evaluate the performance impacts. Since our processor prototype does not yet support SMT or some branch predictors, we also modeled an out-of-order SMT processor using the cycle-level Gem5 simulator [binkert2011gem5]. This SMT core is modeled after the latest Intel Sunny Cove core [SunnyCove] as shown in Table  2. In addition to Gshare, we experimented with three different branch predictors: Tournament [kessler1999the], LTAGE [seznec_LTAGE], and TAGE_SC_L [seznec_TAGE_SC_L]. The modifications of Tournament and TAGE_SC_L predictor with Noisy-XOR-PHT are shown in Figure 8. In configuring TAGE_SC_L, we increased the size of the base and the loop predictor moderately as that improves prediction accuracy in our benchmarks.

On the RISC-V based FPGA platform, we randomly selected 12 combinations (case1 to case12 for the single-threaded core in Table 3) from SPEC CPU 2006 benchmark suite [henning2006spec], each consisting of a target benchmark (first in the entry) and a background benchmark (second in the entry) to build the scenario of context switch. Using the train input set, we run these combinations on the Linux operating system to evaluate the performance by measuring the execution time of target benchmark. Our Linux operating system uses the default context switching frequency of 250Hz.

For the simulated SMT-2 platform, we randomly selected 12 pairs of applications from the SPEC CPU 2006 benchmark suite (column SMT-2 in Table 3) to run concurrently. These benchmarks run in the mode of System Call Emulation (SE) with the typical context-switching frequency in Linux. Two billion instructions are used for warm-up, and then we count the execution cycles of the next two billion instructions executed by either thread.

Experimental setups with different isolation mechanisms are named as follows:

Baseline: The original OoO processor using branch predictor without any isolation mechanism.

IsolationMechanism-nM: The name of corresponding isolation mechanism is followed by the period of context switch in cycles due to timer interrupt. The standard Linux switches context every 4 milliseconds (or 8 million 2GHz CPU cycles). For example, XOR-BP-8M represents XOR-BP when the thread is switched every 8 million CPU cycles.

PredictorName-FlushMechanism: PredictorName represents the corresponding combination of branch predictor and flush mechanism, which includes CF for Complete Flush and PF for Precise Flush. For example, Gshare-CF represents a processor using Gshare branch predictor and Complete Flush isolation mechanism.

Test Number Single-threaded core SMT-2
case1 gcc+calculix zeusmp+lbm
case2 milc+povray zeusmp+dealII
case3 bzip2_source+soplex bwaves+milc
case4 namd+sphinx3 leslie3d+gromacs
case5 hmmer+GemsFDTD dealII+sjeng
case6 gobmk+libquantum gromacs+astar
case7 gromacs+GemsFDTD gobmk+h264ref
case8 mcf+astar libquantum+milc
case9 soplex+hmmer gobmk+gromacs
case10 libquantum+calculix milc+bzip2_source
case11 mcf+perlbench libquantum+omnetpp
case12 bwaves+namd zeusmp+gobmk
Table 3: Benchmark sets.

6.2 Evaluation on FPGA-based single-threaded RISC-V core

6.2.1 XOR-BP performance impacts

When an application resumes its execution after being swapped out and back again on Baseline, it will benefit from its residual state in the BTB. With XOR-BTB mechanism, however, each context switch results in a change of the content key. The branch predictor is thus unable to correctly decode any residual state. This in general leads to a performance loss relative to Baseline. Keep in mind, though, that in some special cases, the loss of residual state can actually improve performance as we will see later. Below, we evaluate the performance impacts of XOR-BTB, XOR-PHT, and their combination at different context switching frequencies.

Figure 9: Performance overhead of XOR-BTB and Noisy-XOR-BTB.

As shown in Figure 9, compared with Baseline, the average performance loss of XOR-BTB is less than 0.2%. Noisy-XOR-BTB, which adds randomized index, introduces no additional performance loss. The highest performance loss can approach 1.0% from case6 (combination of gobmk and libquantum). For the pair of gobmk and libquantum, there are many more residual BTB entries (between 500 and 800 entries) upon switching back – compared to, say, the namd and sphinx3 pair of 30-300 entries. And these entries contributed to the high prediction accuracy of BTB, which reaches 85.2% and 99.3% while running on Baseline. For such cases, the effect of flushing BTB (with key change) usually results in a noticeable performance penalty.

For case2 (the combination of milc and povray), flushing the BTB has the unusual effect of actually improving the execution speed. It turns out, while the miss rate of BTB increases due to flushing, it helps to overturn some incorrect direction predictions as the processor in our implementation simply reverts to fall-through prediction when the target is unavailable.

Figure 10: Performance overhead of XOR-PHT and Noisy-XOR-PHT.

The performance overhead of XOR-PHT is shown in Figure 10, where we see that the average overhead is less than 1.1%. The performance loss decreases gradually with the increase of the context switch interval. Following a key change, the same thread would now face essentially randomized PHT content. However, considering the 2-bit counter takes a short amount of time to warm up, the average performance overhead is relatively insignificant. For example, the static conditional branch instruction ratios of case1 (gcc+calculix) are relatively high at 12.1% and 8.1% respectively and their accuracy of PHT prediction are 90.1% and 94.0% respectively, which explains that the performance loss of the combination is the highest of our twelve cases. Another example is case7 of single-threaded execution. The proportions of static conditional branch instructions are just 4.8% and 7.6% for gromacs and GemsFDTD respectively, and the accuracy of PHT prediction is 88.9% for gromacs, and thus every time the context switches, gromacs will scratch the results of GemsFDTD training, so the XOR-PHT mechanism has little impact on the combination.

Figure 11: Performance overhead of XOR-BP and Noisy-XOR-BP.

When combining the protection on BTB and the direction predictor, the performance impact is largely additive. The performance impact of the resulting configuration Noisy-XOR-BP is shown in Figure 11. Overall, the average performance loss is less than 1.3%. The largest performance loss is about 2.5% (case1), largely due to cost in PHT isolation seen earlier.

Overall, the additional impact of the XOR de/encoding mechanism is very small. And the main performance impact comes from PHT isolation. It should be noted that this low performance overhead is similar to the results shown in Figure 1, and that the flush-based approach is not expensive for single-threaded core.

Test Number of Test Number of
Number privilege switches Number privilege switches
case1 4.9 case7 1.7
case2 7.0 case8 2.0
case3 1.9 case9 1.8
case4 2.0 case10 2.7
case5 1.7 case11 3.5
case6 1.6 case12 1.9
Table 4: The number of privilege switches per million cycles.

6.2.2 Breakdown of two types of switches

Take Noisy-XOR-BP-12M as an example, Table 4 shows the number of privilege changes per million cycles. We found that the number of privilege changes is much larger than the number of context switches (0.08 per million cycles). Similar phenomena can been observed in other cases. It is not difficult to understand that in a time slice, the execution of the program may incur multiple privilege changes due to system calls, exception handling etc. This indicates that the main factor in determining the impact of isolation may be the characteristics of the program itself, not the timer interruption frequency setting. This also explains one observation in Figure 11: As the timer switch frequency changes, there is no significant fluctuation in the performance loss of Noisy-XOR-BP for most test cases.

Figure 12: Performance cost of three isolation mechanisms on four different predictors an SMT core. CF and PF refer to complete and precise flush, respectively.

6.3 Simulation-based evaluation on SMT cores

Figure 12 shows performance impacts of three defence mechanisms (Complete flush, Precise flush, and Noisy-XOR-BP) on four different branch predictors (Gshare, Tournament, LTAGE, and TAGE-SC-L, with a measured baseline MPKI of 8.45, 5.17, 4.10, and 3.99 respectively). Each bar is showing the performance degradation compared to the same predictor without any protection. Three observations can be made.

  1. There is a non-trivial range of performance impacts. In some cases, the combination of application’s sensitivity to branch predictor behavior and the frequency of key changes can result in more than 20% performance degradation. But on an average, the performance cost for protection, at a few percent, is quite reasonable.

  2. Although there are exceptions, in general, Noisy-XOR-BP incurs lower performance impact than both flush mechanisms. Compared to complete flush, performance loss due to Noisy-XOR-BP is 26 to 37% lower. Keep in mind that the flush mechanisms are more costly to implement, and provides less protection against attacks than our proposal.

  3. Systems with a more accurate predictor tends to show more performance impact due to protection. But on average, the increase is not dramatic: It goes from 2.3% for the least accurate predictor to 4.9% to the most accurate.

6.4 Hardware Cost Estimation

The XOR-BP and Noisy-XOR-BTB are implemented at Register-Transfer Level (RTL) on RISC-V processor. Based on TSMC 28nm technology, we use Synopsys ASIC design flow and synthesis tools to assess the timing and area cost. Table  5 shows the timing and area overhead of Noisy-XOR-BP with different configurations. And the numbers are compared with original BTB and PHT. It can be seen that Noisy-XOR-BP has minor area and timing cost. For example, in case of 2-way BTB with 256 entries each way, the timing cost of Noisy-XOR-BTB is increased by 0.94% and area cost is increased by 0.15% (Synthesized with TT corner using Design Compiler).

BTB (2w128 means 2-way with 128 entries in each way)
2w128 2w256 2w512
Timing 0.70% 0.94% 1.46%
Area 0.24% 0.15% 0.13%
PHT (TAGE Predictor)
1024 entries 2048 entries 4096 entries
per table per table per table
Timing 2.10% 1.98% 2.01%
Area 0.11% 0.09% 0.03%
Table 5: Area and timing evaluation.

7 Related Work

7.1 Comparison with cache side channel attacks

There are two type of cache side channels: contention based channels and reuse based channels [liu2014random]. For contention based channels, there is a meaningful map between addresses and entries in the cache or table to allow the attacker to achieve manipulation. Randomizing the mapping, therefore, is one way to increase the barrier of attack. RPCache [wang2007new], CEASER [qureshi2018ceaser], and ScatterCache [wernerscattercache] all use some form of randomization. For a reuse based channel, a typical attack is the Flush+Reload [yarom2014flush]. Through flushing specific lines, the attacker can force lengthy memory accesses on certain addresses shared between the attacker and the victim. Countermeasures to such attacks include limiting the use of flush or de-correlate caching from demand access via Random Fill Cache [liu2014random].

Unlike Cache, branch prediction tables need both index and content encoding. The content encoding only affects the accuracy of the branch predictor, not the correctness of the programs.

7.2 Countermeasures for branch predictor attacks

Several countermeasures have been proposed. First, we can reduce information leakage through the side channel. For instance, branches that conclude sensitive information can be transformed into safe instructions that do not leave a mark in the branch predictors [agosta2007countermeasures]. Limiting performance counter usage can reduce the information obtained by the attacker [guide2011intel]. InvisiSpec [yan2018invisispec], SafeSpec [khasawneh2019safespec], MuonTrap [sam2020muontrap], Conditional Speculation [li2019conditional], Efficient Invisible Speculative Execution [sakalis2019efficient], CleanupSpec [saileshwar2019cleanupspec], STT [yu2019speculative], ConTExT [schwarz2020context], Speculative Data-Oblivious Execution [jiyong2020speculative] and SpecShield [barber2019specshield] prevent speculative computation from generating visible microarchitectural state. Alternatively, Retpoline [turner2018retpoline], FENCE [Intel2018Mitigations, ARM2018Mitigations, AMD2018Mitigations], and Context-Sensitive Fencing [taram2019context] mitigate the chance that untrusted code is speculatively executed.

Second, the predictor table can be flushed to contain randomized new results. Performing this by software during context switch can bring non-trivial overhead [evtyushkin2016understanding]. Such expensive operations can be limited to only the sensitive processes [hu1992lattice]. The impact on performance and prediction accuracy of flushing predictor table in hardware has been studied [evers1996using, pasricha2003improving]. Not surprisingly, the longer the context switch interval, the smaller the impacts. These observations are consistent with ours.

Third, using more dedicated hardware is a general approach to isolate states from different processes. For example, sensitive applications in SGX can be allocated with their own branch predictor tables [evtyushkin2018branchscope]. Earlier work on performance improvement considered saving and restoring compressed branch prediction information [dhodapkar2001saving] or providing thread-private branch predictors on SMT processors [ramsay2003exploring]. BRB is a proposal to retain partial predictor state in on-chip SRAM to swap in with context [vougioukas2019brb].

Finally, detecting (and subsequently freezing or killing) malicious processes is a possible general defense mechanism [evtyushkin2018branchscope]. While some specific methods have been discussed [gianvecchio2007detecting], accurately detecting malicious processes remains a difficult, not-yet-practical approach.

8 Conclusion

Branch prediction is crucial in modern high-performance microprocessors. But the basic design that allows sharing of branch predictors between threads leaves opportunities for malicious training and perception.

Instead of traditional approaches based on flushing or using separate hardware resources, this paper explores isolation of the content in hardware tables via (1) content and (2) index encoding. In content encoding, we use hardware-based thread-private random numbers to encode both direction and destination histories. With index encoding, we use a another thread-private random number as part of the input to compute the index of the tables, disrupting the correspondence between the branch instruction address and its table entry. The combination of the actions achieves excellent logical isolation of branch histories between threads and privilege levels while adding little in terms of circuit area or delay of critical circuit path.

Our evaluation using an FPGA-based RISC-V processor prototype and additional auxiliary simulations demonstrate that the proposed mechanism adds less than 5% slowdown on average. Compared to flush-based protection mechanisms that have less protection in SMT cores and can incur far more circuit area, our design reduces performance penalty by about 20-50% depending on the branch predictor.

References