You Shall Not Bypass: Employing data dependencies to prevent Bounds Check Bypass

05/22/2018 ∙ by Oleksii Oleksenko, et al. ∙ 0

A recent discovery of a new class of microarchitectural attacks called Spectre picked up the attention of the security community as these attacks can overcome many traditional mechanisms of defense, such as bounds checking. One of the attacks - Bounds Check Bypass - can neither be efficiently solved on system nor architectural levels, and requires changes in the application itself. So far, the proposed mitigations involved serialization, which reduces the usage of CPU resources and causes high overheads. In this work, we propose a method of only delaying the vulnerable instructions, without the necessity to completely serialize execution. Our prototype implemented as an LLVM pass causes 60 the full serialization causing 440

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In 2017, multiple research groups independently discovered a new class of microarchitectural attacks, later called Spectre [4]. These attacks target speculative execution, a feature of modern processors that improves CPU utilization by executing certain code paths speculatively, before CPU knows, which of the paths is correct. For example, if an application has a conditional jump, the CPU could start executing one of the branches before it knows the value of the condition. It may later find out that the prediction was wrong, at which point it will discard the computed results. However, the CPU will not discard the changes in the microarchitectural state, including the cached data. The Spectre attacks take advantage of this property to circumvent the existing protection mechanisms and leak secret data.

The original Spectre paper [4] described two attacks: Bounds Check Bypass (BCB) and Branch Target Injection (BTI). While the later attack has been patched by a microcode update, the former stays unfixed. BCB is viewed as an application vulnerability and Intel explicitly states that its mitigation is a responsibility of software developers [2].

To mitigate Bounds Check Bypass, Intel official guidelines [2] suggest using the LFENCE instruction as an explicit serialization point. The application developer has to identify vulnerable parts of the application and manually harden them with LFENCEs to prevent speculation. However, as demonstrated by the long history of memory errors, vulnerabilities can stay unnoticed (and unpatched) for a long time.

To eliminate possibilities for an attack, we have to protect the entire application. A naive way of doing so would be to add LFENCEs after every conditional branch. Although being effective from the security standpoint, this approach significantly reduces the CPU utilization and causes high overheads. Our experiments show a runtime overhead of up to 440% on Phoenix benchmarks [5].

1i = input[0]; 2if (i < size) { 3    secret = foo[i]; 4 5    baz = bar[secret]; }
(a) Example in C
a1 = f0(input) if (condition)     secret = read(a1)     a2 = f2(secret)     read_or_write(a2)
(b) Generalized code pattern
Figure 1: Code snippets vulnerable to Bounds Check Bypass.

In this report, we discuss and evaluate approaches to preventing speculation on a more fine-grained level: using data dependencies.

1i = input[0]; 2 3 4if (i < 42) { 5 6 7 8    address = i * 8; 9    secret = *address; 10 11 12    baz = 100; 13    baz += *secret;} \end{lstlisting} 14        \subcaption{Vulnerable code} 15    \end{minipage} 16    \begin{minipage}[t]{0.4\columnwidth} 17        \begin{lstlisting}[frame=tb,framesep=0pt,aboveskip=10pt,belowskip=0pt,numbers=none,label=algo:txexample] 18i = input[0]; 19 20 21if (i < 42) { 22    LFENCE; 23 24 25    address = i * 8; 26    secret = *address; 27 28 29    baz = 100; 30    baz += *secret;}\end{lstlisting} 31        \subcaption{\lfence{}-based \newline serialization} 32    \end{minipage} 33        \begin{minipage}[t]{0.4\columnwidth} 34        \begin{lstlisting}[frame=tb,framesep=0pt,aboveskip=10pt,belowskip=0pt,numbers=none,label=algo:txexample] 35i = input[0]; 36 37PUSH rax; 38if (i < 42) { 39    LAHF; 40    XOR rax, r15; 41    POP rax; 42    address = i * 8; 43    secret = *address; 44    XOR r15, secret; 45    XOR r15, secret; 46    baz = 100; 47    baz += *secret;}\end{lstlisting} 48        \subcaption{\code{LAHF}-based \newline data dependency} 49    \end{minipage} 50    \begin{minipage}[t]{0.4\columnwidth} 51        \begin{lstlisting}[frame=tb,framesep=0pt,aboveskip=10pt,belowskip=0pt,numbers=none,label=algo:txexample] 52i = input[0]; 53all_ones = 0xFFFF…; 54mask = all_ones; 55if (i < 42) { 56    CMOVGE 0, mask; 57 58 59    address = i * 8; 60    secret = *address; 61    secret &= mask; 62 63    baz = 100; 64    baz += *secret;} \end{lstlisting} 65        \subcaption{Speculative \newline load hardening} 66    \end{minipage} 67    \begin{minipage}[t]{0.4\columnwidth} 68        \begin{lstlisting}[frame=tb,framesep=0pt,aboveskip=10pt,belowskip=0pt,numbers=none,label=algo:txexample] 69i = input[0]; 70 71XOR i, r15; 72if (i < 42) { 73 74 75 76address = i * 8; 77secret = *address; 78XOR r15, secret; 79XOR r15, secret; 80baz = 100; 81baz += *secret;} \end{lstlisting} 82        \subcaption{Dependency on arguments} 83    \end{minipage} 84 85    \caption{Approaches to preventing Bounds Check Bypass. Listings (c) and (e) assume that R15 is exclusively reserved for creating data dependencies.} 86    \label{fig:prevent} 87\end{figure*}

2 Bounds Check Bypass

In essence, Bounds Check Bypass is a buffer over- or under-read that succeeds even in the presence of traditional protection measures. Consider the code snippet in Figure 1a: Without the bounds check on line 2, an adversary with control over the input can force the load on line 3 to read from any address, including those beyond the array foo. Traditionally, vulnerabilities of this type were mitigated with bounds checks (such as the one on line 2) that permit the access only if the address is within the object bounds. However, there is an issue with this approach. The underlying assumption of bounds checking is that the instructions run in order, which is not the case in modern pipelined CPUs with branch prediction. Such a CPU can run the check in parallel with the vulnerable load if it predicts that the check is not likely to fail. Later, it will find out that the prediction was wrong and discard the speculated load, but, as Spectre [4] has proven, its cache traces will stay. The adversary can access the traces by launching a side-channel attack [6, 7]. Since side-channel attacks only reveal the accessed address and not the loaded value, the load on line 3 is not sufficient. In our example, only the second load (line 5) will leak the secret: The adversary will observe an access to bar[secret] after which deriving the value of secret is only a matter of subtraction. In summary, the vulnerable pattern (see Figure 1b) consists of an adversary-controlled load (line 3) followed by a memory access (line 5) based on the loaded value. A runtime check (line 2) must protect the load; otherwise, the pattern turns into a traditional buffer overflow. Data dependency? Reordering? Speculation? Figure 2: Performance (runtime) overhead with respect to native version. (Lower is better.) Figure 3: IPC (instructions/cycle) numbers for native and protected versions. (Higher is better.) Figure 4: Increase in number of instructions with respect to native version. (Lower is better.)

3 Preventing speculation

The most straightforward way of defending against Bounds Check Bypass is to prevent the speculation itself. Consider the example in Figure LABEL:fig:preventa: illegal memory access will never happen if the read (line 9) runs strictly after the comparison (line 4). There are two alternative approaches to enforcing this property: completely preventing speculation via serialization instructions, and delaying loads by adding artificial data dependencies.

3.1 Serialization

Intel documentation [2] proposes to patch vulnerable regions of code by explicitly serializing them with an LFENCE, an instruction that ensures that all prior instructions execute before it, and all later instructions—after111Before the publication of Spectre, the documentation described LFENCE as an instruction that only prevent reordering of loads and not other instructions. Afterward, Intel uncovered that LFENCE is, in fact, a full serialization instruction [2, 3].. We can force the load to wait for the comparison by adding an LFENCE in-between (see Figure LABEL:fig:preventb). To protect the application entirely, we would have to add an LFENCE after every comparison222Mind that we use comparison only as an example. In practice, any operation that modifies EFLAGS could be used as a branch condition. or, more precisely, after every conditional branch. The approach is, however, excessive because it delays all the instructions after the comparison, not only the vulnerable load. In Figure LABEL:fig:preventb, lines 8 and 12 do not access memory and can safely run in parallel with the comparison. And yet, the LFENCE delays them too.

3.2 Artificial data dependencies

A more efficient approach is to allow most of the instructions to benefit from speculation and delay only those that read from memory (we assume that any data in memory could be security sensitive). Modern CPUs already have a mechanism for ensuring that one instruction runs strictly after another without delaying the rest of the instructions: data dependency. The approach is to reuse this mechanism by adding an artificial data dependency between conditional jumps and later loads. In the previous example (Figure LABEL:fig:preventa), if we add a data dependency between the comparison (line 4) and the loads (lines 9 and 13), the load will be delayed while the benign operations on lines 8 and 12 can still benefit from parallelism.

4 Ways of introducing a data dependency

The idea behind the dependency-based approaches is to delay all instructions using the secret until the comparison is resolved by masking the secret with a value data-dependent on EFLAGS. There are two ways to get such value: either by reading EFLAGS directly (LAHF instruction) or by using a conditional move. Alternatively, the secret could be masked with comparison arguments, although it provides much weaker ordering guarantees.

4.1 Dependency via LAHF

The simplest way to introduce the dependency is to use LAHF, an instruction that stores the value of EFLAGS into RAX (Figure LABEL:fig:preventc, line 5). We could reserve a register (e.g., R15) and modify it using the stored flags (e.g., via XOR) to create a data dependency (line 6). Later, we twice XOR the secret with R15 (lines 10–11) thus making all further instructions using the secret dependent on the comparison, but without actually changing the secret’s value. The main issue of this approach is that we have to temporary store (line 3) and restore (line 7) the value of RAX every time we invoke LAHF. We cannot reserve RAX as we did with R15 because many instructions rely on this register. Correspondingly, it increases the runtime cost of the protection.

4.2 Dependency via conditional move

To avoid the cost of keeping the RAX state, we could use a conditional move, which is the approach used by Speculative Load Hardening (SLH) [1]. SLH creates the dependency via CMOV, an instruction that performs a move based on the value of one of the status flags in the EFLAGS register. In Figure LABEL:fig:preventd, the secret is masked (line 10) with a value that may be set to zero (line 5) if the comparison and conditional move mismatch (i.e., if we have a misprediction). It has a double effect. First, similarly to LAHF-based defence, SLH makes the loads data dependent on the comparison, which prevents the speculation. Second, SLH zeroes out the loaded value (lines 5 and 10) in case of misspeculation. Although it is redundant on current hardware, future generations of Intel CPUs may introduce a value prediction feature that can speculate even in the presence of data dependency. Since the mask could have only one of the two values—either all ones or zero—there is no need to make a double XOR and a single AND is sufficient (line 10).

4.3 Dependency on arguments

The approach used by SLH could be simplified even further. Instead of creating a dependency on EFLAGS, we could add a dependency on the comparison arguments (see Figure LABEL:fig:prevente, line 3). Hence, the comparison can run in parallel with the loads, while the dependency ensures that the leaky load will start only when the arguments are either in registers or in L1 cache. In this case, the speculation window will likely last only 1–2 cycles. Although this approach may prevent the leak in many cases, it does not provide any strict guarantees of ordering. If the CPU reorders the instructions such that the comparison begins after the loads (e.g., because of an internal hardware hazard), the attack can still succeed. If the attacker comes up with a way to delay the comparison reliably, it will render this strategy ineffective.

5 Evaluation

In this section, we evaluate the performance impact of the approaches descussed in 3 and 4. We used the author’s implementation of SLH and we implemented the other two approaches on our own. Experimental Setup. The experiments were carried out on a machine with a 4-core Intel processor operating at 3.3 GHz (Haswell microarchitecture) with 32GB of RAM, a 256GB SATA-based SDD, and running Linux kernel 4.15. Each core has private 32KB L1 and 256KB L2 caches, and all cores share an 8MB L3 cache. We used the largest available datasets provided by the Phoenix benchmark suite. As of compilers, we used LLVM 7.0 for SLH and LLVM 5.0 for the other two approaches.

The numbers are normalized against the native LLVM of the corresponding version. For all measurements, we report the average over ten runs and geometric mean for the “gmean” across benchmarks.

Performance. Figure 2 shows performance overheads of the LFENCE-based defense, SLH, and dependency on arguments, measured across the Phoenix benchmarks [5]. As we see, adding LFENCEs after every conditional branch is extremely expensive and causes 440% slowdown on average. Such a high overhead appears because LFENCE virtually disables speculative execution. As Figure 3 shows, the application cannot use the available instruction-level parallelism to its full extent: With the LFENCEs, the average number of instructions per cycle (IPC) drops from ~2.3 to ~0.5. As of the dependency on arguments and SLH, they delay only memory accesses and therefore, IPC drops only to ~2. IPC is not the only influencial factor, though. histogram has lower overhead with SLH, yet the IPC is lower than with the dependency on arguments. Here, the overhead is mainly caused by additional instructions (see Figure 4). SLH uses only a single AND for masking, whereas the other approaches need two XORs. Histogram, having more loads and fewer loops than other benchmarks, has this effect more pronounced. As of the extreme cases, pca

contains large loops with many arithmetic operations (mainly vectorized) on its hot path, hence the speculation has less influence on it. On the other side of the spectrum are

kmeans and string_match. Here, the high overheads are caused by tight loops on the hot path. Both SLH and the dependency on arguments force the loops to run sequentially thus reducing the level of parallelism (see Figure 3). SLH versions are slower because SLH uses a more expensive instruction to instrument conditional branches (CMOV instead of XOR).

6 Conclusion

We presented an overview of possible approaches to preventing Bounds Check Bypass by creating artificial data dependencies between conditional jumps and subsequent memory loads. Because of allowing benign instructions to run in parallel with the jumps, these approaches achieve much better utilization of the available CPU resources in comparison to serialization with LFENCEs. In our experiments, they introduce 60% overhead, while LFENCE-based defense causes 440% slowdown.

References

2 Bounds Check Bypass

In essence, Bounds Check Bypass is a buffer over- or under-read that succeeds even in the presence of traditional protection measures. Consider the code snippet in Figure 1a: Without the bounds check on line 2, an adversary with control over the input can force the load on line 3 to read from any address, including those beyond the array foo. Traditionally, vulnerabilities of this type were mitigated with bounds checks (such as the one on line 2) that permit the access only if the address is within the object bounds. However, there is an issue with this approach. The underlying assumption of bounds checking is that the instructions run in order, which is not the case in modern pipelined CPUs with branch prediction. Such a CPU can run the check in parallel with the vulnerable load if it predicts that the check is not likely to fail. Later, it will find out that the prediction was wrong and discard the speculated load, but, as Spectre [4] has proven, its cache traces will stay. The adversary can access the traces by launching a side-channel attack [6, 7]. Since side-channel attacks only reveal the accessed address and not the loaded value, the load on line 3 is not sufficient. In our example, only the second load (line 5) will leak the secret: The adversary will observe an access to bar[secret] after which deriving the value of secret is only a matter of subtraction. In summary, the vulnerable pattern (see Figure 1b) consists of an adversary-controlled load (line 3) followed by a memory access (line 5) based on the loaded value. A runtime check (line 2) must protect the load; otherwise, the pattern turns into a traditional buffer overflow. Data dependency? Reordering? Speculation? Figure 2: Performance (runtime) overhead with respect to native version. (Lower is better.) Figure 3: IPC (instructions/cycle) numbers for native and protected versions. (Higher is better.) Figure 4: Increase in number of instructions with respect to native version. (Lower is better.)

3 Preventing speculation

The most straightforward way of defending against Bounds Check Bypass is to prevent the speculation itself. Consider the example in Figure LABEL:fig:preventa: illegal memory access will never happen if the read (line 9) runs strictly after the comparison (line 4). There are two alternative approaches to enforcing this property: completely preventing speculation via serialization instructions, and delaying loads by adding artificial data dependencies.

3.1 Serialization

Intel documentation [2] proposes to patch vulnerable regions of code by explicitly serializing them with an LFENCE, an instruction that ensures that all prior instructions execute before it, and all later instructions—after111Before the publication of Spectre, the documentation described LFENCE as an instruction that only prevent reordering of loads and not other instructions. Afterward, Intel uncovered that LFENCE is, in fact, a full serialization instruction [2, 3].. We can force the load to wait for the comparison by adding an LFENCE in-between (see Figure LABEL:fig:preventb). To protect the application entirely, we would have to add an LFENCE after every comparison222Mind that we use comparison only as an example. In practice, any operation that modifies EFLAGS could be used as a branch condition. or, more precisely, after every conditional branch. The approach is, however, excessive because it delays all the instructions after the comparison, not only the vulnerable load. In Figure LABEL:fig:preventb, lines 8 and 12 do not access memory and can safely run in parallel with the comparison. And yet, the LFENCE delays them too.

3.2 Artificial data dependencies

A more efficient approach is to allow most of the instructions to benefit from speculation and delay only those that read from memory (we assume that any data in memory could be security sensitive). Modern CPUs already have a mechanism for ensuring that one instruction runs strictly after another without delaying the rest of the instructions: data dependency. The approach is to reuse this mechanism by adding an artificial data dependency between conditional jumps and later loads. In the previous example (Figure LABEL:fig:preventa), if we add a data dependency between the comparison (line 4) and the loads (lines 9 and 13), the load will be delayed while the benign operations on lines 8 and 12 can still benefit from parallelism.

4 Ways of introducing a data dependency

The idea behind the dependency-based approaches is to delay all instructions using the secret until the comparison is resolved by masking the secret with a value data-dependent on EFLAGS. There are two ways to get such value: either by reading EFLAGS directly (LAHF instruction) or by using a conditional move. Alternatively, the secret could be masked with comparison arguments, although it provides much weaker ordering guarantees.

4.1 Dependency via LAHF

The simplest way to introduce the dependency is to use LAHF, an instruction that stores the value of EFLAGS into RAX (Figure LABEL:fig:preventc, line 5). We could reserve a register (e.g., R15) and modify it using the stored flags (e.g., via XOR) to create a data dependency (line 6). Later, we twice XOR the secret with R15 (lines 10–11) thus making all further instructions using the secret dependent on the comparison, but without actually changing the secret’s value. The main issue of this approach is that we have to temporary store (line 3) and restore (line 7) the value of RAX every time we invoke LAHF. We cannot reserve RAX as we did with R15 because many instructions rely on this register. Correspondingly, it increases the runtime cost of the protection.

4.2 Dependency via conditional move

To avoid the cost of keeping the RAX state, we could use a conditional move, which is the approach used by Speculative Load Hardening (SLH) [1]. SLH creates the dependency via CMOV, an instruction that performs a move based on the value of one of the status flags in the EFLAGS register. In Figure LABEL:fig:preventd, the secret is masked (line 10) with a value that may be set to zero (line 5) if the comparison and conditional move mismatch (i.e., if we have a misprediction). It has a double effect. First, similarly to LAHF-based defence, SLH makes the loads data dependent on the comparison, which prevents the speculation. Second, SLH zeroes out the loaded value (lines 5 and 10) in case of misspeculation. Although it is redundant on current hardware, future generations of Intel CPUs may introduce a value prediction feature that can speculate even in the presence of data dependency. Since the mask could have only one of the two values—either all ones or zero—there is no need to make a double XOR and a single AND is sufficient (line 10).

4.3 Dependency on arguments

The approach used by SLH could be simplified even further. Instead of creating a dependency on EFLAGS, we could add a dependency on the comparison arguments (see Figure LABEL:fig:prevente, line 3). Hence, the comparison can run in parallel with the loads, while the dependency ensures that the leaky load will start only when the arguments are either in registers or in L1 cache. In this case, the speculation window will likely last only 1–2 cycles. Although this approach may prevent the leak in many cases, it does not provide any strict guarantees of ordering. If the CPU reorders the instructions such that the comparison begins after the loads (e.g., because of an internal hardware hazard), the attack can still succeed. If the attacker comes up with a way to delay the comparison reliably, it will render this strategy ineffective.

5 Evaluation

In this section, we evaluate the performance impact of the approaches descussed in 3 and 4. We used the author’s implementation of SLH and we implemented the other two approaches on our own. Experimental Setup. The experiments were carried out on a machine with a 4-core Intel processor operating at 3.3 GHz (Haswell microarchitecture) with 32GB of RAM, a 256GB SATA-based SDD, and running Linux kernel 4.15. Each core has private 32KB L1 and 256KB L2 caches, and all cores share an 8MB L3 cache. We used the largest available datasets provided by the Phoenix benchmark suite. As of compilers, we used LLVM 7.0 for SLH and LLVM 5.0 for the other two approaches.

The numbers are normalized against the native LLVM of the corresponding version. For all measurements, we report the average over ten runs and geometric mean for the “gmean” across benchmarks.

Performance. Figure 2 shows performance overheads of the LFENCE-based defense, SLH, and dependency on arguments, measured across the Phoenix benchmarks [5]. As we see, adding LFENCEs after every conditional branch is extremely expensive and causes 440% slowdown on average. Such a high overhead appears because LFENCE virtually disables speculative execution. As Figure 3 shows, the application cannot use the available instruction-level parallelism to its full extent: With the LFENCEs, the average number of instructions per cycle (IPC) drops from ~2.3 to ~0.5. As of the dependency on arguments and SLH, they delay only memory accesses and therefore, IPC drops only to ~2. IPC is not the only influencial factor, though. histogram has lower overhead with SLH, yet the IPC is lower than with the dependency on arguments. Here, the overhead is mainly caused by additional instructions (see Figure 4). SLH uses only a single AND for masking, whereas the other approaches need two XORs. Histogram, having more loads and fewer loops than other benchmarks, has this effect more pronounced. As of the extreme cases, pca

contains large loops with many arithmetic operations (mainly vectorized) on its hot path, hence the speculation has less influence on it. On the other side of the spectrum are

kmeans and string_match. Here, the high overheads are caused by tight loops on the hot path. Both SLH and the dependency on arguments force the loops to run sequentially thus reducing the level of parallelism (see Figure 3). SLH versions are slower because SLH uses a more expensive instruction to instrument conditional branches (CMOV instead of XOR).

6 Conclusion

We presented an overview of possible approaches to preventing Bounds Check Bypass by creating artificial data dependencies between conditional jumps and subsequent memory loads. Because of allowing benign instructions to run in parallel with the jumps, these approaches achieve much better utilization of the available CPU resources in comparison to serialization with LFENCEs. In our experiments, they introduce 60% overhead, while LFENCE-based defense causes 440% slowdown.

References

3 Preventing speculation

The most straightforward way of defending against Bounds Check Bypass is to prevent the speculation itself. Consider the example in Figure LABEL:fig:preventa: illegal memory access will never happen if the read (line 9) runs strictly after the comparison (line 4). There are two alternative approaches to enforcing this property: completely preventing speculation via serialization instructions, and delaying loads by adding artificial data dependencies.

3.1 Serialization

Intel documentation [2] proposes to patch vulnerable regions of code by explicitly serializing them with an LFENCE, an instruction that ensures that all prior instructions execute before it, and all later instructions—after111Before the publication of Spectre, the documentation described LFENCE as an instruction that only prevent reordering of loads and not other instructions. Afterward, Intel uncovered that LFENCE is, in fact, a full serialization instruction [2, 3].. We can force the load to wait for the comparison by adding an LFENCE in-between (see Figure LABEL:fig:preventb). To protect the application entirely, we would have to add an LFENCE after every comparison222Mind that we use comparison only as an example. In practice, any operation that modifies EFLAGS could be used as a branch condition. or, more precisely, after every conditional branch. The approach is, however, excessive because it delays all the instructions after the comparison, not only the vulnerable load. In Figure LABEL:fig:preventb, lines 8 and 12 do not access memory and can safely run in parallel with the comparison. And yet, the LFENCE delays them too.

3.2 Artificial data dependencies

A more efficient approach is to allow most of the instructions to benefit from speculation and delay only those that read from memory (we assume that any data in memory could be security sensitive). Modern CPUs already have a mechanism for ensuring that one instruction runs strictly after another without delaying the rest of the instructions: data dependency. The approach is to reuse this mechanism by adding an artificial data dependency between conditional jumps and later loads. In the previous example (Figure LABEL:fig:preventa), if we add a data dependency between the comparison (line 4) and the loads (lines 9 and 13), the load will be delayed while the benign operations on lines 8 and 12 can still benefit from parallelism.

4 Ways of introducing a data dependency

The idea behind the dependency-based approaches is to delay all instructions using the secret until the comparison is resolved by masking the secret with a value data-dependent on EFLAGS. There are two ways to get such value: either by reading EFLAGS directly (LAHF instruction) or by using a conditional move. Alternatively, the secret could be masked with comparison arguments, although it provides much weaker ordering guarantees.

4.1 Dependency via LAHF

The simplest way to introduce the dependency is to use LAHF, an instruction that stores the value of EFLAGS into RAX (Figure LABEL:fig:preventc, line 5). We could reserve a register (e.g., R15) and modify it using the stored flags (e.g., via XOR) to create a data dependency (line 6). Later, we twice XOR the secret with R15 (lines 10–11) thus making all further instructions using the secret dependent on the comparison, but without actually changing the secret’s value. The main issue of this approach is that we have to temporary store (line 3) and restore (line 7) the value of RAX every time we invoke LAHF. We cannot reserve RAX as we did with R15 because many instructions rely on this register. Correspondingly, it increases the runtime cost of the protection.

4.2 Dependency via conditional move

To avoid the cost of keeping the RAX state, we could use a conditional move, which is the approach used by Speculative Load Hardening (SLH) [1]. SLH creates the dependency via CMOV, an instruction that performs a move based on the value of one of the status flags in the EFLAGS register. In Figure LABEL:fig:preventd, the secret is masked (line 10) with a value that may be set to zero (line 5) if the comparison and conditional move mismatch (i.e., if we have a misprediction). It has a double effect. First, similarly to LAHF-based defence, SLH makes the loads data dependent on the comparison, which prevents the speculation. Second, SLH zeroes out the loaded value (lines 5 and 10) in case of misspeculation. Although it is redundant on current hardware, future generations of Intel CPUs may introduce a value prediction feature that can speculate even in the presence of data dependency. Since the mask could have only one of the two values—either all ones or zero—there is no need to make a double XOR and a single AND is sufficient (line 10).

4.3 Dependency on arguments

The approach used by SLH could be simplified even further. Instead of creating a dependency on EFLAGS, we could add a dependency on the comparison arguments (see Figure LABEL:fig:prevente, line 3). Hence, the comparison can run in parallel with the loads, while the dependency ensures that the leaky load will start only when the arguments are either in registers or in L1 cache. In this case, the speculation window will likely last only 1–2 cycles. Although this approach may prevent the leak in many cases, it does not provide any strict guarantees of ordering. If the CPU reorders the instructions such that the comparison begins after the loads (e.g., because of an internal hardware hazard), the attack can still succeed. If the attacker comes up with a way to delay the comparison reliably, it will render this strategy ineffective.

5 Evaluation

In this section, we evaluate the performance impact of the approaches descussed in 3 and 4. We used the author’s implementation of SLH and we implemented the other two approaches on our own. Experimental Setup. The experiments were carried out on a machine with a 4-core Intel processor operating at 3.3 GHz (Haswell microarchitecture) with 32GB of RAM, a 256GB SATA-based SDD, and running Linux kernel 4.15. Each core has private 32KB L1 and 256KB L2 caches, and all cores share an 8MB L3 cache. We used the largest available datasets provided by the Phoenix benchmark suite. As of compilers, we used LLVM 7.0 for SLH and LLVM 5.0 for the other two approaches.

The numbers are normalized against the native LLVM of the corresponding version. For all measurements, we report the average over ten runs and geometric mean for the “gmean” across benchmarks.

Performance. Figure 2 shows performance overheads of the LFENCE-based defense, SLH, and dependency on arguments, measured across the Phoenix benchmarks [5]. As we see, adding LFENCEs after every conditional branch is extremely expensive and causes 440% slowdown on average. Such a high overhead appears because LFENCE virtually disables speculative execution. As Figure 3 shows, the application cannot use the available instruction-level parallelism to its full extent: With the LFENCEs, the average number of instructions per cycle (IPC) drops from ~2.3 to ~0.5. As of the dependency on arguments and SLH, they delay only memory accesses and therefore, IPC drops only to ~2. IPC is not the only influencial factor, though. histogram has lower overhead with SLH, yet the IPC is lower than with the dependency on arguments. Here, the overhead is mainly caused by additional instructions (see Figure 4). SLH uses only a single AND for masking, whereas the other approaches need two XORs. Histogram, having more loads and fewer loops than other benchmarks, has this effect more pronounced. As of the extreme cases, pca

contains large loops with many arithmetic operations (mainly vectorized) on its hot path, hence the speculation has less influence on it. On the other side of the spectrum are

kmeans and string_match. Here, the high overheads are caused by tight loops on the hot path. Both SLH and the dependency on arguments force the loops to run sequentially thus reducing the level of parallelism (see Figure 3). SLH versions are slower because SLH uses a more expensive instruction to instrument conditional branches (CMOV instead of XOR).

6 Conclusion

We presented an overview of possible approaches to preventing Bounds Check Bypass by creating artificial data dependencies between conditional jumps and subsequent memory loads. Because of allowing benign instructions to run in parallel with the jumps, these approaches achieve much better utilization of the available CPU resources in comparison to serialization with LFENCEs. In our experiments, they introduce 60% overhead, while LFENCE-based defense causes 440% slowdown.

References

4 Ways of introducing a data dependency

The idea behind the dependency-based approaches is to delay all instructions using the secret until the comparison is resolved by masking the secret with a value data-dependent on EFLAGS. There are two ways to get such value: either by reading EFLAGS directly (LAHF instruction) or by using a conditional move. Alternatively, the secret could be masked with comparison arguments, although it provides much weaker ordering guarantees.

4.1 Dependency via LAHF

The simplest way to introduce the dependency is to use LAHF, an instruction that stores the value of EFLAGS into RAX (Figure LABEL:fig:preventc, line 5). We could reserve a register (e.g., R15) and modify it using the stored flags (e.g., via XOR) to create a data dependency (line 6). Later, we twice XOR the secret with R15 (lines 10–11) thus making all further instructions using the secret dependent on the comparison, but without actually changing the secret’s value. The main issue of this approach is that we have to temporary store (line 3) and restore (line 7) the value of RAX every time we invoke LAHF. We cannot reserve RAX as we did with R15 because many instructions rely on this register. Correspondingly, it increases the runtime cost of the protection.

4.2 Dependency via conditional move

To avoid the cost of keeping the RAX state, we could use a conditional move, which is the approach used by Speculative Load Hardening (SLH) [1]. SLH creates the dependency via CMOV, an instruction that performs a move based on the value of one of the status flags in the EFLAGS register. In Figure LABEL:fig:preventd, the secret is masked (line 10) with a value that may be set to zero (line 5) if the comparison and conditional move mismatch (i.e., if we have a misprediction). It has a double effect. First, similarly to LAHF-based defence, SLH makes the loads data dependent on the comparison, which prevents the speculation. Second, SLH zeroes out the loaded value (lines 5 and 10) in case of misspeculation. Although it is redundant on current hardware, future generations of Intel CPUs may introduce a value prediction feature that can speculate even in the presence of data dependency. Since the mask could have only one of the two values—either all ones or zero—there is no need to make a double XOR and a single AND is sufficient (line 10).

4.3 Dependency on arguments

The approach used by SLH could be simplified even further. Instead of creating a dependency on EFLAGS, we could add a dependency on the comparison arguments (see Figure LABEL:fig:prevente, line 3). Hence, the comparison can run in parallel with the loads, while the dependency ensures that the leaky load will start only when the arguments are either in registers or in L1 cache. In this case, the speculation window will likely last only 1–2 cycles. Although this approach may prevent the leak in many cases, it does not provide any strict guarantees of ordering. If the CPU reorders the instructions such that the comparison begins after the loads (e.g., because of an internal hardware hazard), the attack can still succeed. If the attacker comes up with a way to delay the comparison reliably, it will render this strategy ineffective.

5 Evaluation

In this section, we evaluate the performance impact of the approaches descussed in 3 and 4. We used the author’s implementation of SLH and we implemented the other two approaches on our own. Experimental Setup. The experiments were carried out on a machine with a 4-core Intel processor operating at 3.3 GHz (Haswell microarchitecture) with 32GB of RAM, a 256GB SATA-based SDD, and running Linux kernel 4.15. Each core has private 32KB L1 and 256KB L2 caches, and all cores share an 8MB L3 cache. We used the largest available datasets provided by the Phoenix benchmark suite. As of compilers, we used LLVM 7.0 for SLH and LLVM 5.0 for the other two approaches.

The numbers are normalized against the native LLVM of the corresponding version. For all measurements, we report the average over ten runs and geometric mean for the “gmean” across benchmarks.

Performance. Figure 2 shows performance overheads of the LFENCE-based defense, SLH, and dependency on arguments, measured across the Phoenix benchmarks [5]. As we see, adding LFENCEs after every conditional branch is extremely expensive and causes 440% slowdown on average. Such a high overhead appears because LFENCE virtually disables speculative execution. As Figure 3 shows, the application cannot use the available instruction-level parallelism to its full extent: With the LFENCEs, the average number of instructions per cycle (IPC) drops from ~2.3 to ~0.5. As of the dependency on arguments and SLH, they delay only memory accesses and therefore, IPC drops only to ~2. IPC is not the only influencial factor, though. histogram has lower overhead with SLH, yet the IPC is lower than with the dependency on arguments. Here, the overhead is mainly caused by additional instructions (see Figure 4). SLH uses only a single AND for masking, whereas the other approaches need two XORs. Histogram, having more loads and fewer loops than other benchmarks, has this effect more pronounced. As of the extreme cases, pca

contains large loops with many arithmetic operations (mainly vectorized) on its hot path, hence the speculation has less influence on it. On the other side of the spectrum are

kmeans and string_match. Here, the high overheads are caused by tight loops on the hot path. Both SLH and the dependency on arguments force the loops to run sequentially thus reducing the level of parallelism (see Figure 3). SLH versions are slower because SLH uses a more expensive instruction to instrument conditional branches (CMOV instead of XOR).

6 Conclusion

We presented an overview of possible approaches to preventing Bounds Check Bypass by creating artificial data dependencies between conditional jumps and subsequent memory loads. Because of allowing benign instructions to run in parallel with the jumps, these approaches achieve much better utilization of the available CPU resources in comparison to serialization with LFENCEs. In our experiments, they introduce 60% overhead, while LFENCE-based defense causes 440% slowdown.

References

5 Evaluation

In this section, we evaluate the performance impact of the approaches descussed in 3 and 4. We used the author’s implementation of SLH and we implemented the other two approaches on our own. Experimental Setup. The experiments were carried out on a machine with a 4-core Intel processor operating at 3.3 GHz (Haswell microarchitecture) with 32GB of RAM, a 256GB SATA-based SDD, and running Linux kernel 4.15. Each core has private 32KB L1 and 256KB L2 caches, and all cores share an 8MB L3 cache. We used the largest available datasets provided by the Phoenix benchmark suite. As of compilers, we used LLVM 7.0 for SLH and LLVM 5.0 for the other two approaches.

The numbers are normalized against the native LLVM of the corresponding version. For all measurements, we report the average over ten runs and geometric mean for the “gmean” across benchmarks.

Performance. Figure 2 shows performance overheads of the LFENCE-based defense, SLH, and dependency on arguments, measured across the Phoenix benchmarks [5]. As we see, adding LFENCEs after every conditional branch is extremely expensive and causes 440% slowdown on average. Such a high overhead appears because LFENCE virtually disables speculative execution. As Figure 3 shows, the application cannot use the available instruction-level parallelism to its full extent: With the LFENCEs, the average number of instructions per cycle (IPC) drops from ~2.3 to ~0.5. As of the dependency on arguments and SLH, they delay only memory accesses and therefore, IPC drops only to ~2. IPC is not the only influencial factor, though. histogram has lower overhead with SLH, yet the IPC is lower than with the dependency on arguments. Here, the overhead is mainly caused by additional instructions (see Figure 4). SLH uses only a single AND for masking, whereas the other approaches need two XORs. Histogram, having more loads and fewer loops than other benchmarks, has this effect more pronounced. As of the extreme cases, pca

contains large loops with many arithmetic operations (mainly vectorized) on its hot path, hence the speculation has less influence on it. On the other side of the spectrum are

kmeans and string_match. Here, the high overheads are caused by tight loops on the hot path. Both SLH and the dependency on arguments force the loops to run sequentially thus reducing the level of parallelism (see Figure 3). SLH versions are slower because SLH uses a more expensive instruction to instrument conditional branches (CMOV instead of XOR).

6 Conclusion

We presented an overview of possible approaches to preventing Bounds Check Bypass by creating artificial data dependencies between conditional jumps and subsequent memory loads. Because of allowing benign instructions to run in parallel with the jumps, these approaches achieve much better utilization of the available CPU resources in comparison to serialization with LFENCEs. In our experiments, they introduce 60% overhead, while LFENCE-based defense causes 440% slowdown.

References

6 Conclusion

We presented an overview of possible approaches to preventing Bounds Check Bypass by creating artificial data dependencies between conditional jumps and subsequent memory loads. Because of allowing benign instructions to run in parallel with the jumps, these approaches achieve much better utilization of the available CPU resources in comparison to serialization with LFENCEs. In our experiments, they introduce 60% overhead, while LFENCE-based defense causes 440% slowdown.

References

References