As arbitrary shrinking of process technology and increasing processor clock frequencies is not possible due to physical limitations, performance improvements in modern processors are made by increasing the number of cores or by optimizing the instruction pipeline. Out-of-order execution and speculative execution are among the biggest contributors to the performance and efficiency of modern processors. Out-of-order execution allows processing instructions in an order deviating from the one specified in the instruction stream. To fully utilize out-of-order execution, processors use prediction mechanisms, for branch directions and targets. This predicted control flow is commonly called speculative execution. However, predictions might be wrong, and virtually any instruction can raise a fault, a page fault. Hence, in this case, already executed instructions have to be unrolled, and their results have to be discarded. Such instructions are called transient instructions (Lipp2018meltdown, ; Kocher2019spectre, ; Vanbulck2018, ; Weisse2018foreshadowNG, ).
Transient instructions are never committed, they are never visible on the architectural level. Until the discovery of transient-execution attacks, Spectre (Kocher2019spectre, ), Meltdown (Lipp2018meltdown, ), and Foreshadow (Vanbulck2018, ; Weisse2018foreshadowNG, ), they were not considered a security problem. These attacks exploit transient execution, execution of transient instructions, to leak secrets. This is accomplished by accessing secrets in the transient-execution domain and transmitting them via a microarchitectural covert channel to the architectural domain.
The original Spectre attack (Kocher2019spectre, ) used a cache covert channel to transmit data from the transient-execution domain to the architectural domain. However, other covert channels can be used, instruction timings (Kocher2019spectre, ; Schwarz2018netspectre, ), register contention (Kocher2019spectre, ), branch-predictor state (Evtyushkin2018BranchScope, ), or the TLB (Kiriansky2018speculative, ; Schwarz2018netspectre, ). For other covert channels (Xu2011, ; Wu2012, ; Wu2014, ; Guri2015bitwhisper, ; Evtyushkin2016RNG, ; Liu2015, ; Irazoqui2016Cross, ; Ge2016, ; Gruss2016Flush, ; Maurice2017Hello, ; Pessl2016, ; Schwarz2017Timers, ), it is still unclear whether they can be used.
Several countermeasures have been proposed against transient-execution attacks, often relying on software workarounds. However, many countermeasures (Yan2018InvisiSpec, ; khasawneh2018safespec, ; Kiriansky2018dawg, ; AMDSpecAnalysis, ; IntelSpecAnalysis, ) only try to prevent the cache covert channel of the original Spectre paper (Kocher2019spectre, ). This includes the officially suggested workaround from Intel and AMD (IntelSpecAnalysis, ; AMDSpecAnalysis, ) to prevent Spectre variant 1 exploitation. However, Schwarz (Schwarz2018netspectre, ) showed that this is insufficient.
In this paper, we introduce a new type of countermeasure. Our approach, ConTExT, precisely prevents secret data from being used in the transient-execution domain without aborting or preventing transient execution. Architecturally, the secret data is still reachable. However, the secret data is not available when executing in the transient-execution domain. We show that our approach is efficient and still runs non-dependent instructions out-of-order or speculatively. Moreover, we show that our approach effectively prevents all Spectre attacks, but also very recent microarchitectural data sampling attacks (Schwarz2019ZL, ; Minkin2019, ; vanSchaik2019, ; Schwarz2019STL, ).
Implementing ConTExT in CPUs only requires repurposing one page-table entry bit (one of the currently unused ones) as a non-transient bit. Instead of the actual value, the CPU uses a dummy value (‘0’) when accessing a non-transient memory location during transient execution. Additionally, to protect register contents as well, we also introduce a non-transient bit per register. Same as for the memory locations, the CPU will use a dummy value during transient execution instead of the actual register content.
We annotate variables that can hold secrets in the source code. With compiler and linker support, we propagate this information into the binary, resulting in a separate binary section for secrets. For this section, the operating system sets the memory mapping to non-transient. We split the stack into an unprotected stack and a transient stack. The unprotected stack is marked as non-transient to be used as temporary memory by the compiler, register spills, and local variables are moved to the transient stack. Thus, there is no performance impact for local variables. Preventing leakage only requires a developer to identify the assets, secret values, inside an application. Obviously, this is much easier than identifying all code locations which potentially leak secret values.
To emulate the minimal hardware adaptions ConTExT requires, we over-approximate it via ConTExT-light, a software-only solution which partially emulates the behavior using existing features of commodity CPUs. ConTExT-light relies on the property that values stored in uncacheable memory can generally not be used inside the transient-execution domain (Eclypsium2018smm, ; Lipp2018meltdown, ), except for cases where the value is architecturally in registers, or microarchitecturally in the load buffer, store buffer, or line fill buffer. It is an over-approximation of ConTExT, yet, does not provide complete protection on commodity systems due to leakage from these two buffers. Thus, while it does not provide the same protection guarantees, it allows obtaining a loose upper bound for the worst-case performance overhead of the hardware solution. ConTExT only requires the annotation of secrets inside the program, it can be easily added to any existing C/C++ program to protect secrets from being leaked via transient-execution attacks. In contrast to the software approximation ConTExT-light, ConTExT also inherently protects against microarchitectural data sampling attacks (Schwarz2019ZL, ; Minkin2019, ; vanSchaik2019, ; Schwarz2019STL, ), as leakage is, by design, prevented on the register-level and the state of caches and buffers does not matter.
We evaluate the security of ConTExT on all known Spectre attacks. Due to its principled design, ConTExT prevents the leakage of secret data in all cases. The overhead is less than , which is lower than the overhead of the currently recommended and deployed countermeasures (IntelSpecAnalysis, ; ARMSpecAnalysis, ; AMDSpecAnalysis, ; Microsoft2018Spectre, ; Reis2018siteisolation, ; Chromium2018SiteIsolation, ; Carruth2018Hardening, ; Larabel2018stibp, ; Tkachenko2018ibrs_performance, ). To further support the performance analysis, we extended the Bochs emulator with the non-transient bits for registers and page tables and extended it with a cache simulator. With the hardware extension, the overhead of ConTExT is below for most real-world workloads.
Concurrent to our work, NVIDIA patented a closely related to our design (Boggs2019memory, ). However, they do not provide protection for registers, but only for memory locations.
The contributions of this work are:
We propose ConTExT, a hardware-software co-design for considerate transient execution, fully mitigating transient-execution attacks.
We show that on all levels only minimal changes are necessary. The proposed hardware changes can be partially emulated on commodity hardware.
We demonstrate that ConTExT prevents all known Spectre variants, even if they do not rely on the cache for the covert channel.
We evaluate the performance of ConTExT and show that the overhead is lower than the overhead of state-of-the-art countermeasures.
The remainder of this paper is organized as follows. In Section 2, we provide background information. Section 3 presents the design of ConTExT. Section 4 details our approximate proof-of-concept implementation on commodity hardware. Section 5 provides security and performance evaluations. Section 6 discusses the context of our work. We conclude our work in Section 7.
In this section, we give an overview of transient execution. We then discuss known transient execution attacks. We also discuss proposed defenses and their shortcomings.
2.1. Transient Execution
To simplify processor design and to allow superscalar processor optimizations, modern processors first decode instructions into simpler micro-operations () (Fog2016, ). With these , one optimization is not to execute them in-order as given by the instruction stream but to execute them out-of-order as soon as the execution unit and required operands are available. Even in the case of out-of-order execution, instructions are retired in the order specified by the instruction stream. This necessitates a buffer, called reorder buffer, where intermediate results from can be stored until they can be retired as intended by the instruction stream.
In general, software is almost never purely linear but contains (conditional) branches. Without speculative execution, a processor would have to wait until the branch is resolved before execution can be continued, drastically reducing performance. To increase performance, speculative execution allows a processor to predict the most likely outcome of the branch using various predictors and continue executing along that direction until the branch is resolved.
At runtime, a program has different ways to branch, conditional branches or indirect calls. Intel provides several structures to predict branches (Intel_opt, ), Branch History Buffer (BHB) (Bhattacharya2017perf, ), Branch Target Buffer (BTB) (Lee2017BranchShadowing, ; Evtyushkin2016ASLR, ), the Pattern History Table (PHT) (Fog2016, ), and Return Stack Buffer (RSB) (Fog2016, ; Maisuradze2018spectre5, ; Koruyeh2018spectre5, ). On multi-core CPUs, Ge (Ge2016, ) have shown that the branch prediction logic is not shared among physical cores, preventing one physical core from influencing the prediction on another.
Speculation is not limited to branches. Processors can, speculate on the existence of data dependencies (Horn2018spectre4, ). In the case where the prediction was correct, the instructions in the reorder buffer are retired in-order. If the prediction was wrong, the results are squashed, and a rollback is performed by flushing the pipeline and the reorder buffer. During that process, all architectural but no microarchitectural changes are reverted. Any instruction getting executed out-of-order or speculatively but not architecturally is called a transient instruction. Transient execution may have measurable microarchitectural side effects.
2.2. Transient Execution Attacks & Defenses
While transient execution does not influence the architectural state, the microarchitectural state can change. Attacks that exploit these microarchitectural state changes to extract sensitive information are called transient execution attacks. So-called Spectre-type (Kocher2019spectre, ) attacks exploit different prediction mechanisms, while Meltdown-type (Lipp2018meltdown, ; Vanbulck2018, ) attacks exploit transient execution following a CPU exception.
Kocher (Kocher2019spectre, ) first introduced two variants of Spectre attacks. The first exploits the PHT and the BHB such that the processor mispredicts the code path following a conditional branch. If the transiently executed code loads and leaks the secret, it is called a Spectre gadget. Kiriansky and Waldspurger (Kiriansky2018speculative, ) extended this attack from loads to stores, enabling transient buffer overflows and, thus, extending the number of possible Spectre gadgets.
Variant 2 (Kocher2019spectre, ) targets indirect branches and poisons the BTB with attacker-chosen destinations, leading to transient execution of the code at this attacker-chosen destination. An attacker mistrains the processor by performing indirect branches within the attacker’s own address space to the address of the chosen address, regardless of what resides at this location. Chen (Chen2018SGXpectre, ) showed that this can also be exploited in SGX.
For a memory load, the processor checks the store buffer for stored values to this memory location. Variant 4 (Horn2018spectre4, ), Speculative Store Bypass, exploits when the processor transiently uses a stale value because it could not find the updated value in the store buffer, due to aliasing.
SpectreRSB (Koruyeh2018spectre5, ) and ret2spec (Maisuradze2018spectre5, ) are Spectre variants targeting the RSB, a small hardware stack of recent return addresses pushed during recent call instructions. When a ret is executed, the top of the RSB is used to predict the return address. An attacker can force misspeculation in various ways, by overfilling the RSB, or by overwriting the return address on the software stack.
All of the attacks discussed above have three things in common. First, they all use transient execution to access data that they would not access in normal, considerate execution. Second, they use this data to influence the microarchitectural state which can be observed using microarchitectural attacks, (Yarom2014, ). Third, all are executed locally on the victim machine, requiring the attacker to run code on the machine. Schwarz (Schwarz2018netspectre, ) extended the original Spectre attack with a remote component and demonstrated that the microarchitectural state of the AVX2 unit can be used instead of the cache state to leak data.
Meltdown-type attacks exploit deferred handling of exceptions and do not exploit misspeculation but use other techniques to execute instructions transiently. Between the occurrence of an exception and it being raised, instructions can be executed transiently that access data retrieved by the faulting instructions. The original Meltdown attack (Lipp2018meltdown, ) exploited the deferred pagefault following a user/supervisor bit violation, allowing to leak arbitrary memory. A variation of this attack allows an attacker to read system registers (ARMSpecAnalysis, ; IntelSpecAnalysis, ). Van Bulck (Vanbulck2018, ; Weisse2018foreshadowNG, ) demonstrated that this problem also applies to other page-table bits, namely the present and the reserved bits. Canella Canella2019 analyzed different exception types, based on Intel’s (Intel_vol3, ) classification of exceptions as faults, traps, and aborts. They found that all known Meltdown variants so far have exploited faults, but not traps or aborts.
Since the discovery of Spectre, many different defenses have been proposed. The easiest and most radical solution would be to entirely (or selectively) disable speculation at the cost of a huge decrease in performance (Kocher2019spectre, ). Intel and AMD proposed a similar solution by using serializing instructions on both outcomes of a branch (AMDSpecAnalysis, ; IntelSpecAnalysis, ). Evtyushkin (Evtyushkin2018BranchScope, ) proposed to allow a developer to annotate branches that could leak sensitive data, which are then not predicted. Unfortunately, on Intel CPUs, serializing branches does not prevent microarchitectural effects such as powering up AVX units, or TLB fills (Schwarz2018netspectre, ).
For mitigating the RSB attack vector, Intel proposes RSB stuffing(Intel2018retpoline, ). Upon each context switch, the RSB is filled with the address of a benign gadget.
Google Chrome limits the amount of data that can be extracted by introducing site isolation (Chromium2018SiteIsolation, ). Site isolation relies on process isolation, each site is executed in its own process. Thus, Spectre attacks cannot leak secrets of other sites. Speculative Load Hardening (Carruth2018Hardening, ) and YSNB (Oleksenko2018ysnb, ) are similar proposals, both limiting speculation by introducing data dependencies between the array access and the condition.
SafeSpec (khasawneh2018safespec, ) and InvisiSpec (Yan2018InvisiSpec, ) introduce additional shadow hardware for speculation. The results of transient instructions are only made visible to the actual hardware when the processor determined that the prediction was correct. Both methods require major changes to the hardware.
DAWG (Kiriansky2018dawg, ) is another proposal requiring major hardware changes. The idea is to partition the cache to create protection domains which are disjoint across ways and metadata partitions. Additionally to hardware changes, the approach requires changes to the replacement policy and cache coherence protocol to incorporate the protection domain.
All local Spectre variants so far use either (Yarom2014, ; Kocher2019spectre, ; Horn2018spectre4, ; Koruyeh2018spectre5, ; Maisuradze2018spectre5, ) or (Osvik2006, ; Trippel2018MeltdownPrime, ) to extract information from the covert channel, requiring access to a high-resolution timer. Thus, a defense mechanism is to reduce the accuracy of timers (Microsoft2018edge, ; Pizlo18, ; Chromium18mitigations, ; Wagner18firefox, ) and eliminate methods to construct different timers (Schwarz2017Timers, ).
To mitigate Spectre variant 2, both Intel and AMD extended the ISA with mechanisms to control indirect branches (AMDspex_whitepaper, ; IntelMitigations, ), namely Indirect Branch Restricted Speculation (IBRS), Single Thread Indirect Branch Prediction (STIBP), and Indirect Branch Predictor Barrier (IBPB). With IBRS, the processor enters a special mode and predictions cannot be influenced by operations outside of it. STIBP restricts the sharing of branch prediction mechanisms among hyperthreads. IBPB allows to flush the BTB. Future processors implement enhanced IBRS (Intel2018retpoline, ), a hardware mitigation for Spectre variant 2. With retpoline (Turner2018retpoline, ), Google proposes an alternative technique to protect against branch poisoning by ensuring that the return instruction predicts to a benign endless loop through the RSB.
To mitigate Spectre variant 4, Intel provides a microcode update to disable the speculation on the store buffer check (IntelMitigations, ). The new feature, called Speculative Store Buffer Disable (SSBD), is also supported by AMD (AMDssbd_whitepaper, ). ARM introduced a new barrier (SSBB) which prevents loads after the barrier from bypassing a store using the same virtual address before the barrier (ARMSpecAnalysis, ). Future ARM CPUs will feature a configuration control register that prevents the re-ordering of stores and loads. This feature is called Speculative Store Bypass Safe (SSBS) (ARMSpecAnalysis, ).
So far, all the proposed defense mechanisms against Spectre attacks either require substantial hardware changes or only consider cache-based covert channels. In the latter case, an attacker can circumvent the defense by using a different covert channel. This focus on cache covert channels only and the huge decrease in performance caused by state-of-the-art Spectre defenses shows the necessity for development of efficient and effective defenses.
To mitigate Meltdown, Gruss (Gruss2017KASLR, ) proposed KAISER, a kernel modification unmapping most of the kernel space while running in user mode (Gruss2017KASLR, ). The idea of KAISER has been integrated into all major operating systems, in Linux as KPTI (LWN_kpti, ), in Windows as KVA Shadow (Ionescu2017Twitter, ), and in Apple’s xnu kernel as double map (Levin2012, ). With the PCID and ASID support of modern processors, the performance overheads appear acceptable for real-world use cases (Gregg2018kpti, ). Additionally, to mitigate Foreshadow (Vanbulck2018, ) on SGX enclaves, microcode updates are necessary. To mitigate Foreshadow-NG (Weisse2018foreshadowNG, ), several further steps need to be implemented for full mitigation. The kernel must use non-present page-table entries more carefully, not store the swap disk page frame number there for swapped-out pages. When using EPTs (extended page tables), the hypervisor must make sure that the L1 cache does not contain any secrets when switching into a virtual machine. Hence, the defenses for Meltdown are complete but expensive.
2.3. Taint Analysis
Taint tracking is used to track data-flow dependencies on a hardware level (Chow2004, ; Song2008, ), binary-level (Cheng2006, ; Schwartz2010, ), or source level (Shankar2001, ). Taint analysis has a wide range of security applications: detecting vulnerabilities, by tracking untrusted user input; malware analysis, analyzing information flows in binaries; test case generation, automatically generating inputs. This can be either done statically (Arzt2014, ; Wang2008Still, ) or dynamically (Newsome2005, ; Qin2006, ).
Dynamic taint analysis allows to track the information flow between sources and sinks (Schwartz2010, ). Any value that depends on data derived from a tainted source, user input, is considered tainted. Values that are not derived from tainted sources are considered untainted. A policy defines how taint flows as the program executes and how new taints are introduced. Over-approximation can occur when tainting a value that is not derived from a taint source.
Taint tracking has also been proposed on a hardware level (Venkataramani2008, ), yet not in the context of speculative execution.
3. Design of ConTExT
In this section, we present the design of ConTExT, a considerate transient execution technique.
The idea of ConTExT is to introduce a new type of memory mappings, namely non-transient mappings. The non-transient option indicates that the mapping contains secrets which must not be accessed within the transient-execution domain. Consequently, non-transient values must not be used in transient operations, neither directly nor in a derived form. Thus, there cannot be any perturbations of the microarchitectural CPU state which might disclose non-transient values via side channels. To track whether a value is non-transient and must be protected, registers also track the non-transient state. To ensure not only the original but also derived values are protected, this information is propagated to the results of operations using these values, until the secret is destroyed, by overwriting it.
A processor with ConTExT mitigates all speculative execution and microarchitectural data sampling attacks, as the processor cannot use non-transient registers anymore. However, code that exposes information architecturally already is considered out of scope, branching based on secrets.
ConTExT is a multi-level countermeasure which works on the application-, compiler-, operating-system-, and hardware-level. An application developer annotates secret values in the source code, which the compiler groups inside the binary and marks as secret.
Besides annotation of secrets it would also be possible to architecturally define groups of secrets, by defining all userspace memory and user input as secret as proposed by Taram (Taram2019, ). However, this can be very expensive, and consequently, related work is also investigating annotation-based protection mechanisms (Yu2019data, ).
When the operating system loads the binary, memory regions containing the annotated secrets are marked non-transient. All subsequent tracking of secrets is done by the hardware. The operating system only has to be aware of secret register states on interrupts, context switches. Other than these minimal changes, there are no additional adaptions required on any level of the software stack.
The full protection ConTExT requires small hardware changes, which retrofits mechanisms which already exist in today’s CPUs, there is no re-design required. Moreover, the change is fully backwards compatible with existing hardware and software (applications, libraries, and operating systems). As hardware changes cannot be conducted on commodity CPUs, we evaluate ConTExT based on ConTExT-light, an over-approximation which only requires software changes. As illustrated in Figure 1, ConTExT is a more considerate variant of transient execution.
ConTExT protects secrets which are stored in cache and DRAM, attackers cannot access data from memory locations marked as non-transient during transient execution, and registers if they have been filled with data from protected cache or DRAM locations or other protected registers. ConTExT-light cannot protect secrets while they are architecturally stored in registers of running threads, or microarchitecturally in the load buffer, store buffer, or in the line fill buffer. With ConTExT-light, an attacker can still leak data from these microarchitectural structures. We only use it to obtain a loose upper bound for the performance overheads of ConTExT.
ConTExT is a multi-level countermeasure consisting of 3 major components which we describe in this section:
3.1. Non-Transient Memory Mappings
We present three possible implementations of non-transient memory mappings, memory mappings which indicate that the values cannot be used during transient execution.111Concurrent to our work, NVIDIA patented a closely related to our design (Boggs2019memory, ). However, they do not provide protection for registers, but only for memory locations. All variants allow integrating ConTExT into the current architecture while maintaining backwards compatibility, if the operating system is not aware of ConTExT, the changes have no side effects. Hence, to implement ConTExT only one of the following variants has to be implemented.
Currently Reserved Page-Table Entry Bit
There is already sufficient space to store the non-transient bit in the page tables of commodity CPUs. On Intel 64-bit (IA-32e) systems, each page-table entry has 64 bits, but the defined maximum physical address only has 52 bits. However, most processors do not support full 52 bits, but only up to 46 bits, which allows working with up to 64 TB of physical RAM if the hardware supported it.
Figure 2 shows a page-table entry for x86-64. Besides the already used bits, there are the 6 bits between bit 46 and 51, which are currently reserved for future use. This future use could be the extension of the physical page number if more physical memory is supported in future CPU generations. However, it could also be the repurposing of one of the bits (e.g., the last reserved bit) as a non-transient bit. This reduces the theoretical maximum amount of supported memory by factor 2. Thus, instead of , CPUs could only support of physical memory. The repurposing of a reserved bit is automatically backwards-compatible, as the reserved bits currently have to be ‘0’. Hence, using such a bit does not have any undesirable side effects on legacy software.
Currently Ignored Page-Table Entry Bit and Control Register
An alternative to using one of the reserved bits is to use one of the ignored bits. These bits can be freely used by the operating system, thus, simply repurposing them is not possible. However, if the feature has to be actively enabled, the operating system is aware of the changed semantics of the specific ignored bit. Note that this approach was already taken for several other page-table bits, the protection key and the global bit are enabled via CR4 and they are ignored otherwise. Hence, we also propose enabling the feature using a bit in one of the CPU control registers, CR4, EFER, or XCR0. These registers are already used for enabling and disabling security-related features, such as NX (no-execute) or SMAP (supervisor mode access prevention). Moreover, these registers still have up to 54 unused control bits which can be used to enable and disable the non-transient bit.
An advantage of repurposing an ignored bit is that CPU vendors do not lose potential address space bits. That is, this approach is compatible with physical address spaces of up to in future hardware. However, the approach comes with the limitation that operating systems cannot freely use the retrofitted ignored bit anymore, as it is now used as the non-transient bit.
Memory Type using Page-Attribute Table
A third alternative is to retrofit the Page-Attribute Table (PAT), a processor feature allowing the operating system to reconfigure various attributes for classes of pages. The PAT allows specifying the memory type of a memory mapping. On x86, there are currently 6 different memory types which define the cache policy of the memory mapping.
|0||UC||Strong uncacheable, never cached|
|1||WC||Write Combining (subsequent writes are combined and written once)|
|2||NS||Non-transient, cannot read in transient execution domain|
|4||WT||Write Through (reads cached, writes written to cache and memory)|
|5||WP||Write Protected (only reads are cached)|
|6||WB||Write Back (reads/writes are cached)|
|7||UC-||Uncacheable, overwritten by MTRR|
Table 1 shows the memory types which can be set using the PAT, including our newly proposed non-transient memory type. The PAT itself provides 8 entries for memory types. Such a PAT entry is applied to a memory mapping via the 3 page-table-entry bits ‘3’ (write through), ‘4’ (uncacheable), and ‘7’ (PAT). These 3 bits combined to a 3-bit number select one of the 8 entries of the PAT.
Thus, to apply the non-transient memory type to a memory mapping, the OS sets one of the PAT entries to the non-transient memory type ‘2’. Then, this PAT entry can be applied through the existing page-table bits to any memory mapping. As the PAT supports 8 entries, and there are currently only 6 memory types (7 if the non-transient type is included), it is still possible to use all supported memory types concurrently on different pages, the approach is fully backwards-compatible.
An advantage of this approach is that no semantic changes have to be made to page-table entries, all bits in a page-table entry keep their current meaning. However, this variant may require more changes in the operating system, as Linux already utilizes all of the PAT entries (some memory types are defined twice).
3.2. Secret Tracking
Non-transient mappings ensure that non-transient memory locations cannot be accessed during transient execution. However, we still need to protect secret data that is already loaded into a register. Registers in commodity CPUs do not have a memory type or protection. Thus, we require changes to the hardware to implement protection of registers. Based on patents from Intel (Intel2014TaintTracking, ), VMWare (VMWare2013TaintTracking, ), and NVIDIA (Boggs2019memory, ), we expect such tracking features to be implemented in future CPUs. Venkataramani (Venkataramani2008, ) proposed a technique that also taints registers, however, to mitigate architecturally and functionally correct behavior rather than overly eager speculative execution.
For ConTExT, we introduce one additional non-transient bit per register, a taint (Section 2.3). The non-transient bit indicates whether the value stored in the register is non-transient or not. A register is either entirely non-transient or entirely not at all. The taint generally propagates from memory to registers and from registers to registers. The rationale behind this is that results of operations on secret data have to be considered secret as well. Accessing only parts of a tainted register, eax instead of rax, still copies the taint from the source register to the target register and taints the entire target register, as we only have a single non-transient bit per register. This is also true for taint propagation in any other use of a tainted register.
We keep taint propagation very simple and consider only instructions with registers as destination operands. If any non-transient memory location is used as a source operand to an instruction, the instruction taints the destination registers, the non-transient bit is set for every destination register. Similarly, if any non-transient register is used as a source operand to an instruction, the instruction also taints the destination registers. Thus, if a secret is loaded into a register, it is tracked through all register operations.
The taint is not propagated if the destination operand(s) are memory location(s), as all memory locations already have a non-transient bit managed by the operating system.
There are not only operations which taint registers, but also operations which untaint registers. Replacing the entire content of a register without using non-transient memory or registers, untaints the register. We do this to avoid over-tainting registers, a problem pointed out in earlier works (Slowinska2009, ). In particular, all immediate or untainted values which replace the content of a register, untaint the register. Writing a tainted register to a normal memory location, a memory location which is not marked as non-transient, also untaints the register. The rationale behind this is that if registers are spilled to normal (insecure) memory locations, a potential secret can be leaked anyway. If such a memory operation happens unintentionally, it is a bug in the program and has to be fixed at the software level. In many cases, however, this will be intentional behavior, as the programmer decided that the register does not contain a secret anymore. For instance, the output of a cryptographic cipher does not need protection from transient execution attacks. Thus, the automated untainting keeps the number of tainted registers small.
Taint Propagation across Memory Operations
As the taint bit is an additional bit for each register, it can only be propagated to other registers, not to memory. If an operation writes a secret (tainted) register to the memory, the taint bit is irrecoverably lost. While this is intended if the developer explicitly writes values to memory, it might have undesirable consequences if this happens implicitly, due to the inner workings of the compiler. In Section 3.3, we introduce the required changes to the compiler which ensure that non-transient values are never spilled to transient memory locations accidentally.
However, the compiler inevitably still has to temporarily store (insecure) registers within memory regions marked as non-transient. With the solution as described so far, we would over-approximate and taint more and more registers over time by spilling them to non-transient memory locations and reading them back from there. Hence, spilling registers is not a security problem (tainted registers are never untainted, only untainted registers are tainted), but a loss in performance due to unnecessarily tainted registers.
Optimizing Performance via Caching
To prevent this potential performance loss, we propose an additional change to the cache to reduce the impact of the taint over-approximation. We introduce one additional bit per 64 bits to the cache, 8 additional bits per cache line. This allows us to store the register-taint information transparently in the cache. Whenever a register is written to non-transient memory, the taint bit of the register is stored in the corresponding cache line. When reading from memory, the bit stored in the cache line has precedence over the information from the TLB, the cache overwrites the taint bit defined by the memory mapping. The information in the cache allows the hardware to temporarily keep track of the taint information of a register if the register value is moved to the stack. This happens, if register values are spilled on the stack, exchanged via the stack, or upon function calls.
Evicting the cache line corresponding to a register is never a security issue. An evicted cache line only loses the information that a register was not tainted. Thus, if the cache line is evicted, the registers become automatically tainted.
3.2.1. Taint Control
Besides the automated tainting and untainting of registers, ConTExT provides a privileged interface to modify the taint of registers. This interface is necessary for the operating system to save and restore taint values upon context switches.
A straightforward solution would be to introduce new instructions in the ISA. However, we try to keep the hardware changes to a minimum, especially changes which are not hidden in the microarchitecture. Hence, we propose instead to use model-specific registers (MSR) to access the taint information of registers.
To read and write the current taint information of all registers, we introduce an MSR IA32_TAINT. The taint bit of every register directly maps to one bit of this MSR, which allows the operating system to read and write all taint bits in a single operation. As there are only 56 registers (16 general purpose, 8 floating point, 32 vector) which have to be tracked, one 64-bit MSR is sufficient to read or write all taint bits at once.
MSRs can only be accessed indirectly, using an instruction (rdmsr on x86), and require registers both to specify the MSR and as source and destination operands. On an interrupt, the first thing to save should be the IA32_TAINT MSR, because it contains the taints of the previous context. However, as registers must not be clobbered in the interrupt routine, all the registers used in the interrupt handler have to be saved first. We resolve this problem by automatically copying the IA32_TAINT to an additional MSR, IA32_SHADOW_TAINT, on every interrupt. This ensures that the taint of all registers is preserved before any taint is potentially modified by a register operation in the interrupt handler. The IA32_SHADOW_TAINT can then be treated like any other register, the operating system can save it into a kernel structure upon a context switch.
When returning from an interrupt, the CPU restores the values from IA32_SHADOW_TAINT to the register taint values. Hence, with this mechanism, we ensure that an interrupt does not influence the taint value of any register. This also works for the unlikely event of nested interrupts, if an interrupt is interrupted by a different interrupt. The only critical region in such a case is if the first interrupt has not yet locally saved the IA32_SHADOW_TAINT MSR, and the second interrupt overwrites the MSR. However, as long as within this critical region (the time window between first interrupt and second interrupt) no register is untainted, there can be no leakage. In Section 3.3, we show that this situation can be avoided solely in software.
3.3. Software Support
We propose changes to applications, compilers, and operating systems to leverage the hardware extensions introduced in Section 3.1 and Section 3.2. The idea is that application developers annotate secrets in their applications. The annotations are processed by the compiler and then forwarded to the operating system to establish the correct memory mappings (Section 3.1).
The compiler parses the annotations of secrets. The secrets identified this way are allocated inside a dedicated section of the binary. The compiler marks this section as non-transient. The operating system maps this section from the binary using a non-transient memory mapping.
Besides parsing the annotations, our modified compiler ensures that it never spills data from registers marked as secret into unprotected memory. Otherwise, an attacker could leak the spilled secrets from memory. Still, it is unavoidable that the compiler spills registers to memory, to preserve register contents over function calls. Furthermore, due to the calling convention, some (possibly secret) values have to be passed over the stack. Hence, we have to assume that the stack contains secrets. As a consequence, the stack has to be mapped using a non-transient memory mapping as well.
To reduce the performance impact of a non-transient stack, we modify the compiler to only use the non-transient stack if really necessary. This non-transient stack only contains register spills, possibly function arguments, and return values. All other values are stored at a different memory location, the unprotected stack. This concept is similar to the SafeStack (Kuznetsov2014CPI, ) and our implementation even reuses parts of the SafeStack infrastructure of modern compilers. The difference to SafeStack, where only “unsafe” memory allocations (buffers) are stored on the SafeStack, is that we move all variables normally allocated on the stack to the unprotected stack. Thus, for ConTExT, only the absolute minimum is stored on the non-transient stack, return addresses. By only moving local variables to the unprotected stack, and leaving return addresses and function arguments on the stack, we do not break ABI compatibility with existing binaries. Thus, a developer can still use external libraries without recompiling them, and libraries compiled for ConTExT can be used in ordinary unprotected applications.
Moving local variables from the stack to a different memory location does not impact the runtime of the application and even gives additional protection against memory-corruption attacks (Kuznetsov2014CPI, ).
For ConTExT, the operating system is in charge of setting up non-transient memory mappings. As the operating system parses the binary, it can directly set up the non-transient memory mappings which are marked as such by the compiler. The operating system requires additional small changes. The operating system has to save and restore taint values on context switches. The hardware already saves the current taint value of all registers into the IA32_SHADOW_TAINT MSR upon interrupts. Thus, the operating system only has to read this register and save it together with all other saved registers.
As interrupts can be interrupted by other interrupts, a normal interrupt can be interrupted by a non-maskable interrupt (NMI), there is a critical section between reading the MSR and saving the result. If registers are untainted in this section, a nested interrupt would lose the taint information as it overwrites the IA32_SHADOW_TAINT MSR. However, if registers are not untainted in this section, no taint information can be lost. Hence, we have to initialize the registers required to read the MSR in a way which does not destroy the taint. For this purpose, we define that the rep prefix for arithmetic and logical operations on registers, preserves the taint. LABEL:lst:iret shows (pseudo-)assembly code which prepares the registers with the required immediate values. Generally, overwriting a register with an immediate or by using an idiom, xor rax,rax, generally untaints the register. However, the rep prefix prevents the untainting here.
In addition to the context switch, the operating system has to flush the cache when the content of a non-transient memory is initially loaded from the binary. This is important as the initial data transfer to the memory page is not done through the non-transient user-space mapping. Thus, the operating system has to either disable the cache before this operation or flush the corresponding cache lines afterwards. This functionality is already present in the x86 ISA and supported by modern operating systems, thus, there is no further change required.
4. Implementation of ConTExT
In this section, we present our implementation of both ConTExT and ConTExT-light, which we use for the evaluation (Section 5). As we cannot change real x86 hardware or emulate the hardware changes required for ConTExT on commodity hardware, we opted for a hardware simulation of our changes using a full-system emulator (Section 4.1). While this does not allow to measure performance by measuring the runtime, it allows measuring performance in the number of memory accesses, non-transient memory accesses, taint over-approximations, etc., for real-world benchmarks.
For ConTExT-light, we present a method to partially emulate the non-transient memory mapping behavior on commodity hardware by retrofitting uncacheable memory mappings. Thus, in Section 4.2, we present an open-source proof-of-concept implementation of ConTExT-light which can already be used and evaluated on commodity hardware. However, ConTExT-light does not provide complete security guarantees, as secrets can still be leaked from registers, line fill buffers, load buffers, and store buffers. ConTExT inherently protects also against the recent attacks on microarchitectural buffers (Schwarz2019ZL, ; Minkin2019, ; vanSchaik2019, ; Schwarz2019STL, ), as ConTExT prevents leakage from registers—the state of microarchitectural buffers does not matter.
4.1. Hardware Simulation
We simulated ConTExT using the open-source x86-64 emulator Bochs (Lawton1996bochs, ) to get as close as possible to functionally extending a real x86-64 processor with our features; non-transient memory mappings (Section 3.1) as well as secret tracking (Section 3.2). We incorporated hardware and behavioral changes in our ConTExT-enabled Bochs.
To support secret tracking, a few minor hardware changes are required. Mostly, these are single bits to track whether a register is non-transient. These bits are required in every page-table entry, TLB entry, and register. Furthermore, we introduce additional bits per cache line to minimize the performance cost of register spills (Section 3.2).
Page-Table Entry. To distinguish non-transient from normal memory mappings, we have to mark every memory mapping accordingly in the PTE. For backwards- and future-compatibility, repurposing one of the ignored bits is the best choice (Section 3.1). If this bit is set, we treat the memory mapping as a region which may contain secrets.
Translation Look-aside buffer. For performance reasons, modern CPUs cache page-table entries in the TLB. Consequently, we need an additional non-transient bit in the TLB, caching the bit of the page-table entry. In Bochs, caching of page-table entries is also implemented as a TLB-like structure allowing the simulated hardware to automatically transfer the added bit from the PTE to the TLB. Thus, for cached page-table entries, memory accesses use the cached non-transient bit from the TLB.
Cache. Bochs only implements an instruction cache, but no data cache which plays a vital role in our design to cache taint information (Section 3.2). Hence, we extended Bochs with data cache emulation by implementing an 8-way (inclusive) last-level cache. As the exact eviction strategy is unknown (Gruss2016Row, ), we used LRU as a good approximation as it has been used in Intel CPUs until Ivy Bridge (Gruss2016Row, ). In our emulated cache, we added 8 taint bits per cache line.
Model-Specific Registers. As described in Section 3.2, we added two new MSRs to Bochs. Accesses to IA32_TAINT are directly mapped to the taint bits of the registers, allowing the operating system to read and write all at once. To save the current taint state on interrupts (Section 3.2.1), we ensure data consistency between the two MSRs; a write to IA32_TAINT also (atomically) updates IA32_SHADOW_TAINT. This enables us to implement secure context switches (LABEL:lst:iret).
All behavioral changes are only enabled if the operating system supports and enables ConTExT using the corresponding bit in the control register (Section 3.1). However, taint tracking is enabled unconditionally as it happens implicitly without additional cost. This applies to all operations which transfer data from memory to registers or from registers to registers. In our proof-of-concept implementation, we added the taint tracking to 368 out of 557 instructions implemented in Bochs. If no memory mapping is marked as non-transient, then no register can be tainted. Thus, taint tracking simply has no effect if there is no operating system support.
In addition to the hardware emulation for ConTExT, we implemented ConTExT-light (Section 3) for Linux. Our implementation of ConTExT-light consists of two parts, a kernel module and a runtime library. For the full ConTExT, we provide a compiler extension that minimizes performance penalties of register spills.
For the proof of concept, we emulate non-transient memory mappings via uncacheable memory mappings. Uncacheable memory can generally not be accessed inside the transient execution domain (Eclypsium2018smm, ; Lipp2018meltdown, ). Lipp (Lipp2018meltdown, ) observed the only exception where memory despite being marked as uncacheable can be read during transient execution: In the case that an attacker can issue a legitimate load of the target address in parallel on another hyperthread running on the same physical core as the attack, the memory content still can be leaked. However, opposed to ConTExT, ConTExT-light does not protect secrets while they are architecturally stored in registers of running threads. Thus, the security guarantees of ConTExT-light still hold in this case.
We opted to implement the operating-system changes as a kernel module for compatibility with a wide range of kernels. The kernel module is responsible for setting up non-transient memory mappings. As our proof-of-concept implementation relies on uncacheable memory, we do not retrofit page-table bits but use the page-attribute table to declare a memory mapping as uncacheable.
The kernel module provides an interface for the runtime library (Section 4.2) to set up non-transient memory mappings. This allows keeping the changes in the kernel space minimal as most of the logic and parsing can be implemented in user space. The kernel module ensures that the page-attribute table contains an uncacheable (UC) entry by reprogramming the page-attribute table if this is not already the case. If the runtime library requests a mapping to be marked non-transient via the kernel-module interface, the page-table entry is modified to reference the UC entry in the page-attribute table. Subsequently, the corresponding TLB entry is flushed. We do not flush all cache lines of the mapping, as this would incur additional overhead. Thus, the developer (or runtime library) has to take care that values stored on pages marked as non-transient are not cached before they are marked as non-transient.
The runtime library sets up all static and dynamic non-transient memory mappings via the kernel-module interface. Our proof-of-concept runtime library supports C and C++ applications and can be included as a single header file. The header file provides a keyword, nospec, to annotate variables as secrets. This keyword ensures that the linker allocates the variables in a dedicated secret section in the ELF binary. Moreover, the header file registers a constructor function which is executed before the actual application, to initialize ConTExT at runtime.
When the application starts, the runtime library identifies all memory mappings in the secret section from the ELF binary. These memory mappings are then set to non-transient (uncacheable) using the kernel module.
The runtime library is only active on application startup and does not influence the application during runtime. During runtime, it is only used if the developer requests dynamic non-transient memory. For this purpose, the runtime library provides a malloc_secure and free_secure function. These functions mark the allocated memory immediately as non-transient.
For the full ConTExT with hardware support, we also need compiler support. We extended the LLVM compiler (DBLP:conf/cgo/LattnerA04, ) in version 8.0.0 to not use the stack for local variables, but move them to a different part of the memory which we refer to as unprotected stack. The normal stack is marked as non-transient to not leak temporary variables and function parameters the compiler puts on the stack. Thus, to reduce the performance impact, we allocate local variables which are defined by the developer in the unprotected stack, which is not marked as non-transient.
Our implementation is based on the already existing SafeStack extension (Kuznetsov2014CPI, )
. We modify the heuristics to not move only specific but all user-defined variables from thenon-transient stack to the unprotected stack (SafeStack in the original extension). Allocations coming from function parameters and registers spills are put on the non-transient stack.
In this section, we evaluate ConTExT and ConTExT-light with respect to their security properties and their performance. We evaluate ConTExT on our modified Bochs emulator, and ConTExT-light on a Lenovo T480s (Intel Core i7-8650U, DRAM) running Ubuntu 18.04.1 with kernel version 4.15.0.
We generally assume that the operating system is trusted, as it handles the non-transient memory mappings. First, we explain how ConTExT can be used to protect against all transient-execution attacks, and how current commodity hardware can be retrofitted to partially emulate ConTExT. Second, we show the limitations of ConTExT.
5.1.1. Security of ConTExT
The security guarantees of ConTExT are built on two assumptions: the application developer correctly annotated all secrets as such, and the application does not actively leak secrets (by writing them to memory locations not marked as non-transient). For the evaluation, we distinguish two cases, based on whether the secret values are used architecturally in the application or not while an attacker mounts a transient-execution attack.
Architecturally Unused Secrets
A secret is architecturally unused if the secret is only stored in a non-transient memory region, there is no part of the secret which is stored in a register, cache, or normal memory region. For example, this is the case if the secret was not used by the time of an attack. However, the application can also be in such a state although the secret has already been used in the past. If all traces of the secret in normal memory or the cache are already overwritten (or evicted), the application returns again to the state where secrets are architecturally unused.
In this state, an attacker can only target the secret itself and not an unprotected copy of it. It is clear that such an attack cannot be successful, as—per-definition—transiently executed code cannot retrieve the value from a non-transient memory region. Hence, ConTExT is secure, if its implementation fulfills this property.
Architecturally Used Secrets
If the entire secret, or parts of it, are stored in a register, cache, or a memory region not marked as non-transient, the secret is considered architecturally used. In this case, an attacker can target any unprotected copy of the secret, not only the original secret stored in the non-transient memory region. However, an attack fails if the target is marked as secret, by a non-transient memory mapping, tainted register, or tainted cache line.
If a non-transient memory region is loaded into a register, the register is tainted and, thus, it cannot be targeted. Moreover, the taint is also applied to the corresponding cache line and TLB entry. Any register-to-register operation which copies the secret, also copies the taint. Similarly, an operation which copies the secret to a non-transient memory region is also secure. Such operations include, for example, register spills to the stack, temporary storage of registers in local variables, or secrets as function arguments (depending on the calling convention). Tainted registers can only be untainted by destroying their content, overwriting them with non-secret values. Overwriting a register with an immediate or by using an idiom, xor rax,rax, generally untaints the register. Using the rep prefix on arithmetic or logical register operations preserves the taint.
Thus, registers can not be untainted while containing a secret. However, over-approximation can lead to more tainted registers than necessary.
Operations which copy the secret to a memory region not marked as non-transient could be attacked. However, such operations are never implicitly generated by the compiler, as the compiler only uses the stack as temporary memory. Thus, such an operation has to be explicitly defined by the application developer, which violates the assumption that the application does not actively leak secrets.
A remaining scenario is the context switch of the application with used secrets. In such a case, the application is stopped by the operating system, and the current register content is saved to the kernel. As the operating system is aware of register taints, and also considered trusted, it can leverage the taint saving mechanism described in Section 3.2.1. The registers can again be saved in a non-transient memory region to prevent transient-execution attacks on the saved registers. When returning from the kernel, all registers are first tainted (an over-approximation, as they are restored from a non-transient stack), but the original taint is restored just before the end of the context switch. Thus, registers containing secrets are always tainted and cannot be targeted.
5.1.2. Security Limitations of ConTExT-light
As ConTExT-light is implemented using uncacheable memory, we evaluated the security properties of uncacheable memory regarding transient execution. In our experimental setup, we mark a memory mapping as uncacheable using the PAT. Using , we verified that the memory mapping is actually uncacheable.
Unfortunately, even with uncacheable mappings, secrets can still be leaked if the data is already in a register. Similarly, if the data is currently in the load buffer or store buffer, because of other operations on the processor (prefetching, speculative execution, architectural accesses), the data can still be leaked. If the data is in the line fill buffer, again because of other operations on the processor, it can also be accessed like in a regular Spectre or Meltdown attack. Consequently, ConTExT-light does not provide the same security guarantees as ConTExT. ConTExT inherently protects against the microarchitectural data sampling attacks (Schwarz2019ZL, ; Minkin2019, ; vanSchaik2019, ; Schwarz2019STL, ) by preventing leakage from registers. The opportunistic and incorrect loading of registers from microarchitectural buffers is thus not a security problem anymore.
ConTExT can only be effective if used correctly by the application developer, if the developer marks all secrets as secret and does not actively leak secrets. However, even if used correctly, there are certain limitations which mostly result from a trade-off between performance and security. In the following paragraphs, we point out where application developers must take care to not accidentally leak secrets.
ConTExT does not support tainting registers which are used to steer the control flow, the instruction pointer or the flags register. Hence, the application developer should be careful to not introduce higher order leakage through these registers. This is a sound reasoning because: If the control flow depends on the secret, the code is inherently not side-channel resilient, other side channels such as cache attacks can already be used to extract the secret.
Instructions such as CRC32 might also leak secrets if a secret value is used as input either directly, or in combination with an attacker-known value. However, as this is again a secret-dependent operation, the developer has to ensure that this does not leak any secrets.
Another responsibility of the developer is that secret values are not actively copied to memory locations not marked as non-transient. This cannot be prevented by either the compiler or the hardware, as it is often necessary, the tainted output of a crypto operation (ciphertext) is not secret anymore and can be written to normal memory.
As ConTExT-light is only a partial emulation of ConTExT, it comes with some limitations compared to ConTExT. The largest difference to ConTExT is that secrets in registers, the load buffer, the store buffer, and the line fill buffer are not protected. Thus, if a secret is in one of these microarchitectural structures, it remains susceptible to transient-execution attacks.
We evaluated the performance of ConTExT-light as a loose upper bound for the performance overhead of ConTExT. We also evaluate the performance overhead of ConTExT based on our full-system emulation in Bochs. The SPECspeed 2017 evaluation for the unprotected stack of ConTExT is performed on an i7-8700K machine and all other evaluations are performed on an i7-8650U machine. Both systems run Ubuntu Linux 18.04.1 with kernel 4.15.0.
We evaluated the software implications of our proposed hardware changes using our modified version of Bochs and a modified Linux kernel, based on kernel version 4.15. For the Linux kernel, we only had to modify 52 lines in 9 files to support the save and restore of register taints on context switches. These small changes result in a negligible performance overhead on context switches, for syscalls.
The latency of syscalls increases by a constant value, which is 48 cycles (averaged over yscall invocations). On a standard Ubuntu Linux installation we observed between nd yscalls per second on average while performing regular office tasks. On our test system, we observe an overhead on the system load of around at this syscall rate. The highest syscall rates observed for real-world use cases at Netflix, was reported to be around yscalls per second (Gregg2018kpti, ). On our test system, we observe an overhead on the system load of around at this syscall rate.
We evaluated the impact of the unprotected stack of ConTExT using the SPECspeed 2017 integer benchmark (SpecCPU2017, ). Table 2 shows that similarly to the original SafeStack implementation (Kuznetsov2014CPI, ), the resulting performance overhead is on average and in the worst case .
|648.exchange2_s||would require Fortran runtime|
These results are not surprising, as only addresses of variables change. This only requires very little runtime code for maintaining a second stack pointer. Thus, the small performance overhead is mostly due to the setup time for the additional non-transient stack.
We furthermore evaluated the performance impact introduced by the non-transient stack. As a baseline we consider the case where we only have one non-transient stack and compare it to our design where the non-transient stack is only an additional stack to the regular unprotected one. Based on Intel Pin (Luk2005pin, ), we implemented our own plugin to trace all memory accesses. With the plugin, we evaluated how much memory the non-transient stack consumes. For this purpose we ran the GNU Core Utilities, once compiled with the unmodified compiler, and once compiled with our extended LLVM compiler. Even for these lightweight applications, we measured a reduction of average non-transient stack memory by . The modified LLVM compiler sustained an average non-transient stack usage of , whereas the applications compiled with a vanilla compiler consumed on average on the single non-transient stack. Moreover, for 64 out of the 91 tested applications (), the compiler extension reduced the non-transient stack usage to only , which is below the smallest memory region that can be set non-transient, the size of one virtual page (). The reason for these reductions is that the stack is not used anymore for storing user-defined variables. Hence, the compiler extension makes it practical to deploy ConTExT with the additional non-transient stack.
We evaluated the performance impact of ConTExT-light, both for unmodified applications as well as applications where we annotate secret values as such. For unmodified applications, we do not expect any runtime overhead, except for a constant initialization overhead.
We confirmed this assumption experimentally. The average initialization overhead when starting an application with our current non-optimized implementation is .
For applications with annotated secret values, there is a performance overhead for architectural accesses to the secret. Without ConTExT-light, the secret could be stored in L1, L2, or L3 cache, or in the main memory. Hence, the maximum overhead for a memory access is the difference between an L1 cache hit and a cache miss. The minimum overhead for a memory access is zero (cache miss in both cases). In practice we often see a cache miss instead of an L3 cache hit, which makes an average overhead of 100 cycles on our test system. We evaluated the performance by encrypting a message using OpenSSL’s RSA. We verified that indeed all memory allocations in OpenSSL use the secure functions using ltrace and single-stepping. The performance overhead we measured when annotating all buffers that may (temporarily) contain secrets in an RSA encryption is ( , ). This is not surprising as RSA performs many in-place operations in one secure buffer, and hence, higher overheads are expected. However, this overhead is in the same range as the overhead of the recommended mitigation strategy for Spectre-PHT attacks alone, 62– for serialization barriers (Carruth2018Hardening, ). Aditional overheads are caused by Spectre-BTB mitigations, STIBP (30–) (Larabel2018stibp, ) and IBRS (20–) (Tkachenko2018ibrs_performance, ), as well as mitigations for other Spectre and Meltdown variants. Hence, ConTExT is a viable alternative as its overhead are inherently lower than the ones we observe with ConTExT-light, and ConTExT-light already is in the range of state-of-the-art mitigation approaches. ConTExT improves the performance of ConTExT-light by regular caching and by hiding the latency of register loads, hence, the performance will be significantly higher.
ConTExT is not a defense for commodity systems. ConTExT requires changes across all layers. Yet, compared to all other defenses, it is the first proposal to achieve complete protection (Miller2018taxonomy, ; Canella2018, ). Concurrent to our work NVIDIA patented a similar idea (Boggs2019memory, ) However, they focus solely on the protection of memory locations, not speculating on memory that might contain secrets. In contrast to their work, we do provide protection on a register-level, allowing speculatively cache and register fills. This clearly has a lower performance impact. However, the various patents in this area (Intel2014TaintTracking, ; VMWare2013TaintTracking, ; Boggs2019memory, ) give us additional confidence of the practicality of our approach.
Naturally, ConTExT is particularly interesting in cases where isolation is not clear, to protect a sandbox environment from the sandboxed code. There are different ways to select what are secrets to protect. One extreme would be to generally mark all data secret. As this is not practical related works either restrict it to an architecturally already defined group, or let the user annotate secrets. Taram (Taram2019, ) defined all userspace memory and user input as secret. However, this can be very expensive, and consequently, Yu (Yu2019data, ) proposed a less expensive annotation-based protection mechanism. While this is an important discussion it is orthogonal to this work. Our work shows that if we can mark secrets, we can provide complete protection. From a problem which is, according to Mcilroy (Mcilroy2019, ), currently not solvable in software, ConTExT shifts the landscape such that the problem is not easy to solve, but solvable in software.
Dealing with Edge Cases
As we described in Section 5.1.3, ConTExT does not support tainting registers which are used to steer the control flow, the instruction pointer or the flags register. If the control flow depends on the secret, the code is inherently not side-channel resilient, other side channels such as cache attacks can already be used to extract the secret. Hence, we consider only cases where the secret is not leaked via the control flow. In this case, the commonality of all remaining transient-execution attacks is that the secret moves through a register. ConTExT does not prevent any operations from loading data into registers, but it prevents values from being passed on from tainted registers.
There are many elements in a processor that generally could leak data such that a register contains a secret. No matter whether the data was leaked from—the memory, the cache, the line fill buffer, the load buffer, the store buffer, or just another register—if the register is tainted, ConTExT does not execute any operation that depends the value from that register. Hence, under the assumption that the secret has to move through a register (or already be in a register), the protection ConTExT provides is complete. Only violating this assumption would allow bypassing ConTExT. To the best of our knowledge there is no mechanism on x86-64 that would allow performing an indexed array access without loading the index into a register. This supports our assumption.
As ConTExT prevents the value from being passed on from the tainted register, we do not have any edge cases around the various microarchitectural elements.
ConTExT likely cannot be implemented (efficiently) in microcode or microcode updates. The reason is that the behavior in the critical path when forwarding a value from a register to a dependent instruction has to be modified. To the best of our knowledge, there is no microcode involved in this part for performance reasons.
Our approach is oblivious to virtualization. EPTs equally contain non-transient bits. Identical to the way several other page table bits are combined (the non-executable bit), if any bit in the hierarchy is set to non-transient, the page is non-transient. Naturally, the extensions we implemented on the operating system level would have to be identically implemented on the hypervisor level. We leave this implementation effort for future work.
Implementation of the Microarchitectural Changes
While a microarchitectural implementation would be interesting, this is not necessary to see the practicality of our work. We already have the uncacheable memory mapping which are marked in the page table. Uncacheable memory is not used during speculative execution, although if it is already in a cache, line fill buffer, load buffer, or store buffer, it might be leaked. Hence, there is already a mechanism in current processors which is very similar to the one we propose. While uncacheable memory is much slower than what we propose, it clearly shows that an implementation is possible.
In this paper, we presented ConTExT, a technique to effectively and efficiently prevent the leakage of secrets during transient execution. The basic idea of ConTExT is to transform Spectre from a problem that cannot be solved purely in software (Mcilroy2019, ), to a problem that is not easy to solve, but solvable in software. For this, ConTExT requires minimal modifications of applications, compilers, operating systems, and the hardware. We implemented these in applications, compilers, and operating systems, as well as in a processor simulator.
Mitigating all transient execution attacks with a principled approach of course costs performance. We provide an approximate proof-of-concept for ConTExT which we use on commodity systems to obtain a loose upper bound for the performance overhead. As seen in our security evaluation, ConTExT is a first proposal for a principled defense tackling the root cause of transient execution attacks. ConTExT has no performance overhead for regular applications and even with the over-approximation of ConTExT-light, for security-critical applications, which below the combined overhead of recommended state-of-the-art mitigation strategies. The overhead with ConTExT is below for most real-world workloads. Our work shows that transient execution can be made secure while maintaining a high system performance.
We want to thank Jon Masters for discussions on Spectre mitigations. Concurrent to our research, Jon Masters independently published a brief sketch of a similar idea. This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 681402). It was also supported by the Austrian Research Promotion Agency (FFG) via the project ESPRESSO, which is funded by the province of Styria and the Business Promotion Agencies of Styria and Carinthia. It was also supported by the Austrian Research Promotion Agency (FFG) via the K-project DeSSnet, which is funded in the context of COMET – Competence Centers for Excellent Technologies by BMVIT, BMWFW, Styria and Carinthia. It was also supported by the Austrian Research Promotion Agency (FFG) via the competence center Know-Center (grant number 844595), which is funded in the context of COMET – Competence Centers for Excellent Technologies by BMVIT, BMWFW, and Styria. Additional funding was provided by a generous gift from Intel and a generous gift from ARM. Any opinions, findings, and conclusions or recommendations expressed in this paper are those of the authors and do not necessarily reflect the views of the funding parties.
- (1) AMD. AMD64 Technology: Speculative Store Bypass Disable, 2018. Revision 5.21.18.
- (2) AMD. Software Techniques for Managing Speculation on AMD Processors, 2018. Revison 7.10.18.
- (3) AMD. Software techniques for managing speculation on AMD processors, 2018.
- (4) ARM Limited. Vulnerability of Speculative Processors to Cache Timing Side-Channel Mechanism, 2018.
- (5) Arzt, S., Rasthofer, S., Fritz, C., Bodden, E., Bartel, A., Klein, J., Le Traon, Y., Octeau, D., and McDaniel, P. Flowdroid: Precise context, flow, field, object-sensitive and lifecycle-aware taint analysis for android apps. Acm Sigplan Notices (2014).
- (6) Bhattacharya, S., Maurice, C., Bhasin, S., and Mukhopadhyay, D. Template attack on blinded scalar multiplication with asynchronous perf-ioctl calls. ePrint 2017/968, 2017.
- (7) Boggs, D. D., Segelken, R., Cornaby, M., Fortino, N., Chaudhry, S., Khartikov, D., Mooley, A., Tuck, N., and Vreugdenhil, G. Memory type which is cacheable yet inaccessible by speculative instructions, Jan. 2019. US Patent App. 16/022,274.
- (8) Canella, C., Van Bulck, J., Schwarz, M., Lipp, M., von Berg, B., Ortner, P., Piessens, F., Evtyushkin, D., and Gruss, D. A Systematic Evaluation of Transient Execution Attacks and Defenses. arXiv:1811.05441 (2018).
- (9) Carruth, C. RFC: Speculative Load Hardening (a Spectre variant #1 mitigation), Mar. 2018.
- (10) Chen, G., Chen, S., Xiao, Y., Zhang, Y., Lin, Z., and Lai, T. H. SGXPECTRE Attacks: Leaking Enclave Secrets via Speculative Execution. arXiv:1802.09085 (2018).
- (11) Cheng, W., Zhao, Q., Yu, B., and Hiroshige, S. Tainttrace: Efficient flow tracing with dynamic binary rewriting. In ISCC (2006).
- (12) Chow, J., Pfaff, B., Garfinkel, T., Christopher, K., and Rosenblum, M. Understanding data lifetime via whole system simulation. In USENIX Security (2004).
- (13) Corporation, S. P. E. SPEC CPU 2017, https://www.spec.org/cpu2017/ 2017.
- (14) ECLYPSIUM. System Management Mode Speculative Execution Attacks, https://blog.eclypsium.com/2018/05/17/system-management-mode-speculative-execution-attacks/ May 2018.
Evtyushkin, D., and Ponomarev, D.
Covert channels through random number generator: Mechanisms, capacity estimation and mitigations.In CCS (2016).
- (16) Evtyushkin, D., Ponomarev, D., and Abu-Ghazaleh, N. Jump over aslr: Attacking branch predictors to bypass aslr. In MICRO (2016).
- (17) Evtyushkin, D., Riley, R., Abu-Ghazaleh, N. C., ECE, and Ponomarev, D. Branchscope: A new side-channel attack on directional branch predictor. In ASPLOS’18 (2018).
- (18) Fog, A. The microarchitecture of Intel, AMD and VIA CPUs: An optimization guide for assembly programmers and compiler makers, 2016.
- (19) Ge, Q., Yarom, Y., Cock, D., and Heiser, G. A Survey of Microarchitectural Timing Attacks and Countermeasures on Contemporary Hardware. Journal of Cryptographic Engineering (2016).
- (20) Gregg, B. KPTI/KAISER Meltdown Initial Performance Regressions, 2018.
- (21) Gruss, D., Lipp, M., Schwarz, M., Fellner, R., Maurice, C., and Mangard, S. KASLR is Dead: Long Live KASLR. In ESSoS (2017).
- (23) Gruss, D., Maurice, C., Wagner, K., and Mangard, S. Flush+Flush: A Fast and Stealthy Cache Attack. In DIMVA (2016).
- (24) Guri, M., Monitz, M., Mirski, Y., and Elovici, Y. Bitwhisper: Covert signaling channel between air-gapped computers using thermal manipulations. In IEEE CSF (2015).
- (25) Horn, J. speculative execution, variant 4: speculative store bypass, 2018.
- (26) Intel. Intel 64 and IA-32 Architectures Software Developers Manual, Volume 3 (3A, 3B & 3C): System Programming Guide.
- (27) Intel. Intel 64 and IA-32 Architectures Optimization Reference Manual, 2017.
- (28) Intel. Intel Analysis of Speculative Execution Side Channels, https://software.intel.com/security-software-guidance/api-app/sites/default/files/336983-Intel-Analysis-of-Speculative-Execution-Side-Channels-White-Paper.pdf July 2018.
- (29) Intel. Retpoline: A Branch Target Injection Mitigation, June 2018. Revision 003.
- (30) Intel. Speculative Execution Side Channel Mitigations, May 2018. Revision 3.0.
- (31) Ionescu, A. Windows 17035 Kernel ASLR/VA Isolation In Practice (like Linux KAISER)., https://twitter.com/aionescu/status/930412525111296000 2017.
- (32) Irazoqui, G., Eisenbarth, T., and Sunar, B. Cross processor cache attacks. In AsiaCCS (2016).
- (33) Jaeyeon, J., and Zhu, Y. Sensitive data tracking using dynamic taint analysis, https://patents.google.com/patent/US9548986B2 2014.
- (34) Khasawneh, K. N., Koruyeh, E. M., Song, C., Evtyushkin, D., Ponomarev, D., and Abu-Ghazaleh, N. SafeSpec: Banishing the Spectre of a Meltdown with Leakage-Free Speculation. arXiv:1806.05179 (2018).
- (35) Kiriansky, V., Lebedev, I., Amarasinghe, S., Devadas, S., and Emer, J. DAWG: A Defense Against Cache Timing Attacks in Speculative Execution Processors. ePrint 2018/418 (May 2018).
- (36) Kiriansky, V., and Waldspurger, C. Speculative Buffer Overflows: Attacks and Defenses. arXiv:1807.03757 (2018).
- (37) Kocher, P., Horn, J., Fogh, A., Genkin, D., Gruss, D., Haas, W., Hamburg, M., Lipp, M., Mangard, S., Prescher, T., Schwarz, M., and Yarom, Y. Spectre attacks: Exploiting speculative execution. In S&P (2019).
- (38) Kocher, P., Horn, J., Fogh, A., Genkin, D., Gruss, D., Haas, W., Hamburg, M., Lipp, M., Mangard, S., Prescher, T., Schwarz, M., and Yarom, Y. Spectre attacks: Exploiting speculative execution. In S&P (2019).
- (39) Koruyeh, E. M., Khasawneh, K., Song, C., and Abu-Ghazaleh, N. Spectre Returns! Speculation Attacks using the Return Stack Buffer. In WOOT (2018).
- (40) Kuznetsov, V., Szekeres, L., Payer, M., Candea, G., Sekar, R., and Song, D. Code-Pointer Integrity. In OSDI (2014).
- (41) Larabel, M. Bisected: The Unfortunate Reason Linux 4.20 Is Running Slower, Nov. 2018.
- (42) Lattner, C., and Adve, V. S. LLVM: A compilation framework for lifelong program analysis & transformation. In IEEE / ACM International Symposium on Code Generation and Optimization – CGO 2004 (2004), pp. 75–88.
- (43) Lawton, K. P. Bochs: A portable PC emulator for UNIX/X. Linux Journal (1996).
- (44) Leake, E. N., and Pike, G. Taint tracking mechanism for computer security, https://patents.google.com/patent/US8875288B2 2013.
- (45) Lee, S., Shih, M., Gera, P., Kim, T., Kim, H., and Peinado, M. Inferring fine-grained control flow inside SGX enclaves with branch shadowing. In USENIX Security Symposium (2017).
- (46) Levin, J. Mac OS X and IOS Internals: To the Apple’s Core. John Wiley & Sons, 2012.
- (47) Lipp, M., Schwarz, M., Gruss, D., Prescher, T., Haas, W., Fogh, A., Horn, J., Mangard, S., Kocher, P., Genkin, D., Yarom, Y., and Hamburg, M. Meltdown: Reading Kernel Memory from User Space. In USENIX Security Symposium (2018).
- (48) Liu, F., Yarom, Y., Ge, Q., Heiser, G., and Lee, R. B. Last-Level Cache Side-Channel Attacks are Practical. In S&P (2015).
- (49) Luk, C.-K., Cohn, R., Muth, R., Patil, H., Klauser, A., Lowney, G., Wallace, S., Reddi, V. J., and Hazelwood, K. Pin: building customized program analysis tools with dynamic instrumentation. In ACM SIGPLAN notices (2005).
- (50) LWN. The current state of kernel page-table isolation, https://lwn.net/SubscriberLink/741878/eb6c9d3913d7cb2b/ Dec. 2017.
- (51) Maisuradze, G., and Rossow, C. ret2spec: Speculative execution using return stack buffers. In CCS (2018).
- (52) Maurice, C., Weber, M., Schwarz, M., Giner, L., Gruss, D., Alberto Boano, C., Mangard, S., and Römer, K. Hello from the Other Side: SSH over Robust Cache Covert Channels in the Cloud. In NDSS (2017).
- (53) Mcilroy, R., Sevcik, J., Tebbi, T., Titzer, B. L., and Verwaest, T. Spectre is here to stay: An analysis of side-channels and speculative execution. arXiv:1902.05178 (2019).
- (54) Microsoft. Mitigating speculative execution side-channel attacks in Microsoft Edge and Internet Explorer, Jan. 2018.
- (55) Miller, M. Mitigating speculative execution side channel hardware vulnerabilities, Mar. 2018.
- (56) Minkin, M., Moghimi, D., Lipp, M., Schwarz, M., van Bulck, J., Genkin, D., Gruss, D., Sunar, B., Piessens, F., and Yarom, Y. Fallout: Reading Kernel Writes From User Space.
- (57) Newsome, J., and Song, D. X. Dynamic Taint Analysis for Automatic Detection, Analysis, and SignatureGeneration of Exploits on Commodity Software. In NDSS (2005).
- (58) Oleksenko, O., Trach, B., Reiher, T., Silberstein, M., and Fetzer, C. You Shall Not Bypass: Employing data dependencies to prevent Bounds Check Bypass. arXiv:1805.08506 (2018).
- (59) Osvik, D. A., Shamir, A., and Tromer, E. Cache Attacks and Countermeasures: the Case of AES. In CT-RSA (2006).
- (60) Pardoe, A. Spectre mitigations in MSVC, 2018.
- (61) Pessl, P., Gruss, D., Maurice, C., Schwarz, M., and Mangard, S. DRAMA: Exploiting DRAM Addressing for Cross-CPU Attacks. In USENIX Security Symposium (2016).
- (62) Pizlo, F. What Spectre and Meltdown mean for WebKit, Jan. 2018.
- (63) Qin, F., Wang, C., Li, Z., Kim, H.-s., Zhou, Y., and Wu, Y. LIFT: A low-overhead practical information flow tracking system for detecting security attacks. In MICRO (2006).
- (64) Reis, C. Mitigating spectre with site isolation in chrome, https://security.googleblog.com/2018/07/mitigating-spectre-with-site-isolation.html 2018.
- (65) Schwartz, E. J., Avgerinos, T., and Brumley, D. All you ever wanted to know about dynamic taint analysis and forward symbolic execution (but might have been afraid to ask). In S&P (2010), IEEE.
- (66) Schwarz, M., Canella, C., Giner, L., and Gruss, D. Store-to-Leak Forwarding: Leaking Data on Meltdown-resistant CPUs. arXiv:1905.05725 (2019).
- (67) Schwarz, M., Lipp, M., Moghimi, D., Van Bulck, J., Stecklina, J., Prescher, T., and Gruss, D. ZombieLoad: Cross-Privilege-Boundary Data Sampling. arXiv:1905.05726 (2019).
- (69) Schwarz, M., Schwarzl, M., Lipp, M., and Gruss, D. NetSpectre: Read Arbitrary Memory over Network. arXiv:1807.10535 (2018).
- (70) Shankar, U., Talwar, K., Foster, J. S., and Wagner, D. Detecting format string vulnerabilities with type qualifiers. In USENIX Security Symposium (2001).
- (71) Slowinska, A., and Bos, H. Pointless tainting?: evaluating the practicality of pointer tainting. In EuroSys (2009).
- (72) Song, D., Brumley, D., Yin, H., Caballero, J., Jager, I., Kang, M. G., Liang, Z., Newsome, J., Poosankam, P., and Saxena, P. BitBlaze: A new approach to computer security via binary analysis. In International Conference on Information Systems Security (2008).
- (73) Taram, M., Venkat, A., and Tullsen, D. Context-sensitive fencing: Securing speculative execution via microcode customization. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (2019).
- (74) The Chromium Projects. Actions required to mitigate Speculative Side-Channel Attack techniques, 2018.
- (75) The Chromium Projects. Site Isolation, 2018.
- (76) Tkachenko, V. 20-30% Performance Hit from the Spectre Bug Fix on Ubuntu, Jan. 2018.
- (77) Trippel, C., Lustig, D., and Martonosi, M. MeltdownPrime and SpectrePrime: Automatically-Synthesized Attacks Exploiting Invalidation-Based Coherence Protocols. arXiv:1802.03802 (2018).
- (78) Turner, P. Retpoline: a software construct for preventing branch-target-injection, 2018.
- (79) Van Bulck, J., Minkin, M., Weisse, O., Genkin, D., Kasikci, B., Piessens, F., Silberstein, M., Wenisch, T. F., Yarom, Y., and Strackx, R. Foreshadow: Extracting the Keys to the Intel SGX Kingdom with Transient Out-of-Order Execution. In USENIX Security Symposium (2018).
- (80) van Schaik, S., Milburn, A., Österlund, S., Frigo, P., Maisuradze, G., Razavi, K., Bos, H., and Giuffrida, C. RIDL: Rogue In-flight Data Load. In S&P (May 2019).
- (81) Venkataramani, G., Doudalis, I., Solihin, Y., and Prvulovic, M. Flexitaint: A programmable accelerator for dynamic taint propagation. In IEEE HPCA (2008).
- (82) Wagner, L. Mitigations landing for new class of timing attack, Jan. 2018.
- (83) Wang, X., Jhi, Y.-C., Zhu, S., and Liu, P. Still: Exploit code detection via static taint and initialization analyses. In Annual Computer Security Applications Conference (2008).
- (84) Weisse, O., Van Bulck, J., Minkin, M., Genkin, D., Kasikci, B., Piessens, F., Silberstein, M., Strackx, R., Wenisch, T. F., and Yarom, Y. Foreshadow-NG: Breaking the Virtual Memory Abstraction with Transient Out-of-Order Execution, 2018.
- (85) Wu, Z., Xu, Z., and Wang, H. Whispers in the Hyper-space: High-speed Covert Channel Attacks in the Cloud. In USENIX Security Symposium (2012).
- (86) Wu, Z., Xu, Z., and Wang, H. Whispers in the Hyper-space: High-bandwidth and Reliable Covert Channel Attacks inside the Cloud. IEEE/ACM Transactions on Networking (2014).
- (87) Xu, Y., Bailey, M., Jahanian, F., Joshi, K., Hiltunen, M., and Schlichting, R. An exploration of L2 cache covert channels in virtualized environments. In CCSW’11 (2011).
- (88) Yan, M., Choi, J., Skarlatos, D., Morrison, A., Fletcher, C. W., and Torrellas, J. InvisiSpec: Making Speculative Execution Invisible in the Cache Hierarchy. In MICRO (2018).
- (89) Yarom, Y., and Falkner, K. Flush+Reload: a High Resolution, Low Noise, L3 Cache Side-Channel Attack. In USENIX Security Symposium (2014).
- (90) Yu, J., Hsiung, L., El Hajj, M., and Fletcher, C. W. Data Oblivious ISA Extensions for Side Channel-Resistant and High Performance Computing. In NDSS (2019).