Revizor: Testing Black-box CPUs against Speculation Contracts

05/14/2021
by   Oleksii Oleksenko, et al.
0

Speculative vulnerabilities such as Spectre and Meltdown expose speculative execution state that can be exploited to leak information across security domains via side-channels. Such vulnerabilities often stay undetected for a long time as we lack the tools for systematic testing of CPUs to find them. In this paper, we propose an approach to automatically detect microarchitectural information leakage in commercial black-box CPUs. We build on speculation contracts, which we employ to specify the permitted side effects of program execution on the CPU's microarchitectural state. We propose a Model-based Relational Testing (MRT) technique to empirically assess the CPU compliance with these specifications. We implement MRT in a testing framework called Revizor, and showcase its effectiveness on real Intel x86 CPUs. Revizor automatically detects violations of a rich set of contracts, or indicates their absence. A highlight of our findings is that Revizor managed to automatically surface Spectre, MDS, and LVI, as well as several previously unknown variants.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

03/20/2019

EVMFuzz: Differential Fuzz Testing of Ethereum Virtual Machine

Ethereum Virtual Machine (EVM) is the run-time environment for smart con...
12/09/2019

Building Executable Secure Design Models for Smart Contracts with Formal Methods

Smart contracts are appealing because they are self-executing business a...
07/08/2020

SmartBugs: A Framework to Analyze Solidity Smart Contracts

Over the last few years, there has been substantial research on automate...
12/11/2019

Metamorphic Security Testing for Web Systems

Security testing verifies that the data and the resources of software sy...
04/22/2021

Patch Shortcuts: Interpretable Proxy Models Efficiently Find Black-Box Vulnerabilities

An important pillar for safe machine learning (ML) is the systematic mit...
07/15/2020

Data Sampling on MDS-resistant 10th Generation Intel Core (Ice Lake)

Microarchitectural Data Sampling (MDS) is a set of hardware vulnerabilit...
12/20/2021

Relational Models of Microarchitectures for Formal Security Analyses

There is a growing need for hardware-software contracts which precisely ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The instruction set architecture (ISA) specifies the functional behavior of a CPU but abstracts from its implementation details (microarchitecture). This abstraction enables rapid development of hardware optimizations without requiring changes to the software stack; unfortunately, it also obscures the security impact of these optimizations. Indeed, over the last decade researchers discovered numerous microarchitectural zero days, including Spectre-style attacks that use microarchitectural state to exfiltrate secret information obtained during transient execution [20, 23]. The problem is expected to get worse as Moore’s law subsides and CPU architects are compelled to apply ever more aggressive optimizations [34].

A number of approaches have been proposed to automate the detection of speculative leaks (e.g., Medusa [26], SpeechMiner [49], and CheckMate [38]). However, they search for instances of specific known vulnerabilities, and the ISA abstraction might still hide unknown ones.

Speculation contracts (short: contracts) [16] have been proposed as a way out of this situation by providing a specification of the microarchitectural side effects. Contracts declare which ISA operations an attacker can observe through a side channel, and which operations can speculatively change the control/data flow. For example, a contract may state: an attacker can observe addresses of memory stores and loads, and the CPU may mispredict the targets of conditional jumps. If a CPU implementation permits the attacker to observe more than that (e.g., addresses of loads after mispredicted indirect jumps), the CPU violates the contract, indicating an unspecified leak in the microarchitecture.

For software developers, contracts are a foundation for microarchitecturally-secure programming: they spell out the assumptions that are required for checking that mitigations are effective and code is free of leaks. For example, a recent survey [7]classifies existing tools for detecting speculative vulnerabilities in the language of contracts. For hardware developers, contracts can provide a target specification that describes the permitted microarchitectural effects of the CPU’s operations, without putting further constraints on the hardware implementation. Thus, contracts hold the promise to achieve for speculative vulnerabilities what consistency models have provided for memory consistency [3].

However, so far speculations contracts have only been used for establishing security guarantees of small white-box models of CPUs with toy ISAs [16]. It has been an open challenge to test contract compliance of real-world CPUs, with complex ISAs and absent (or intractable) models of the microarchitecture.

Approach. In this paper, we propose a method and a tool for testing real-world CPUs against speculation contracts. Our method, called Model-based Relational Testing (MRT), is a randomized search for contract violations, i.e., for counterexamples to contract compliance.

Here a counterexample is an instruction sequence together with a pair of inputs that produce the same observations according to the contract (contract trace), but different microarchitectural side-effects on the CPU (hardware trace). A counterexample shows that more information is leaked into the microarchitecture than what is specified by the contract.

MRT searches for contract violations by generating random instruction sequences (test cases), together with random inputs to them, and checking if any of them constitutes a counterexample. A key observation is that this check does not require an explicit model of the microarchitecture. This is because one only needs to compare traces of the same kind, that is, contract traces to contract traces, and hardware traces to hardware traces. This enables side-stepping the need for establishing a connection between them via a model of the microarchitecture (as done in [16]) and enables testing commercial black-box CPUs. However, the search for counterexamples on real-world CPUs poses a new set of challenges:

The first challenge is to cope with the intractable search space: ISAs typically include hundreds of instructions, dozens of registers, and permit large memory spaces. This creates an intractable number of possibilities for both test cases and for inputs to them. Moreover, there are no means to measure coverage for black-box CPUs, which precludes a guided search. We solve this problem by using an incremental generation process that aims to create ample opportunities for speculation: (1) We perform testing in rounds, where we start by generating short instruction sequences with few basic blocks, a small subset of registers, and where we confine all memory accesses to a narrow memory range. (2) After each round without counterexample, we invoke a diversity analysis that counts the number of tested instruction patterns that we expect to induce speculative leaks. This analysis triggers reconfiguration of the test generator to gradually expand the search space in subsequent testing rounds.

The second challenge is to obtain deterministic hardware traces from modern high-performance CPUs with complex and unpredictable microarchitectures. For this we (1) create a low-noise measurement environment where we execute test cases in complete isolation and perform a side-channel attack (e.g., Prime+Probe on the L1D cache) to detect leakage into the microarchitecture, and (2) we control the microarchitectural context using a technique we call priming: Priming collects traces for a large number of pseudorandom inputs to the same test case in sequence. In this way, execution with one input effectively sets the microarchitectural context for the next input. This enables collection of hardware traces with predictors primed in a diverse but deterministic fashion, which is key to obtaining comprehensive and stable hardware traces.

The third challenge is to generate contract traces for complex ISAs such as x86. To tackle this challenge we implement executable contracts by instrumenting an existing ISA emulator with a checkpointing mechanism similar to [31], which enables us to explore correct and mispredicted execution paths, and to record the contract-prescribed observations during program execution.

Tool & Evaluation. We implement MRF as a testing framework Revizor111Revizor is a name of a classical play by Nikolai Gogol about a government inspector arriving into a corrupt town for an incognito investigation.. The current implementation of Revizor supports only Intel x86, which we chose as a worst-case target for our method: a superscalar CPU with several unpatched microarchitectural vulnerabilities, no detailed descriptions of speculation mechanisms, and no direct control over the microarchitectural state.

We evaluated Revizor on two different microarchitectures, Skylake and Coffee Lake, and with different microcode patches. We test those targets against a sequence of increasingly permissive contracts. This gradually filters out common violations, and enables narrowing down on more subtle violations. The key highlights of our evaluation are:

  1. When testing a patched Skylake against a restrictive contract that states that speculation exposes no information, Revizor detects a violation within a few minutes. Inspection shows the violation stems from the leakage during branch prediction, i.e. a representative of Spectre V1.

  2. When testing Skylake with V4 patch disabled against a contract that permits leakage during branch prediction (and is hence not violated by V1), Revizor detects a violation due to address prediction, i.e., a representative of Spectre V4.

  3. When further weakening the contract to permit leaks during both types of speculation, Revizor still detects a violation. This violation is a novel (minor) variant of Spectre where the timing of variable-latency instructions (which is not permitted to leak according to the contract) leaks into L1D through a race condition induced by speculation.

  4. When making microcode assists possible during collection of the hardware traces, Revizor surfaces MDS [42, 5] on the same CPU and LVI-Null [41] on a CPU with microcode patch against MDS.

  5. When used to validate an assumption that stores do not modify the cache state until they retire, made in recent defence proposals [51, 44], Revizor discovered that this assumption does not hold in Coffee Lake.

  6. In terms of speed, Revizor processes over 200 test cases per hour for complex contracts, and with several hundreds of inputs per test case, which enables discovery of Spectre V1, V4, MDS, and LVI-Null in under two hours, on average. This shows that finding contract violations is practical, and sometimes even fast. The reason is that counterexample search is not akin to finding a needle in a haystack: Microarchitectural leaks manifest in many programs, and it is sufficient to generate only one of them.

Summary. In summary, starting from simple contracts, Revizor could automatically generate gadgets that represent all three of the known types of speculative leakage: speculation of control flow, address prediction, and speculation on hardware exceptions. The analysis is robust and did not produce false positives, demonstrating the practicality of testing complex real-world CPUs against speculation contracts.

The source code is publicly available under:

2 Background: Contracts

2.1 Hardware Traces and Side-channel leakage

We consider an abstract side-channel attack model whereby an adversary can use side-channels [40, 50, 32] to extract secret information about a victim program execution. Specifically, we focus on microarchitectural side-channels, such as cache timing. We define a hardware trace as a sequence of all the observations made through the side-channel after each instruction during a program execution.

We represent the hardware trace as the output of a function

that takes three input parameters: (1) the victim program ; (2) the input processed by the victim’s program (i.e., the architectural state including registers and main memory); (3) the microarchitectural context in which it executes.

The information exposed by a hardware trace depends on the assumed side-channel and threat model.

Example: ()

If the threat model includes attacks on a data cache, is composed of the cache set indexes used by ’s loads and stores. If it includes attacks on an instruction cache, contains the addresses of executed instructions.

A program leaks information via side-channels when its hardware traces depend on the inputs (): We assume the attacker knows and can manipulate , hence any difference between the hardware traces implies difference in , which effectively exposes information to the attacker.

Intuitively, hardware traces encompass the microarchitectural leaks during the program execution on a given CPU, including speculative execution. For example, the trace will record a sensitive memory access during a branch misprediction, such as the leak exploited in Spectre [20].

2.2 Legitimate Information Exposure as a Contract

We now show how speculation contracts can be used to specify the information legitimately exposed by each instruction.

A speculation contract [16] specifies the information that can be exposed by a CPU during a program execution under a given threat model. For each instruction in the CPU ISA (or a subset thereof), a contract describes the information exposed by the instruction’s (observation clause) and the externally-observable speculation that the instruction may trigger (execution clause). When a contract covers a subset of ISA, the leakage of unspecified instructions is undefined.

Observation Clause Execution Clause
Load expose: ADDRESS None
Store expose: ADDRESS None
Cond. None speculate:
Jump if(INVERTED_CONDITION){
  IP = IP + TARGET}
Other None None
Table 1: Summary of MEM-COND. Note that the execution clause describes the speculative behavior of a conditional jump, as the jump takes place (IP is updated) if the condition is false, the opposite of the non-speculative execution.
Example: ()

Consider a contract called MEM-COND (summarized in Table 1). It models an L1D cache side channel on a CPU with branch prediction. The contract prescribes that addresses of all memory access may be exposed (hence MEM), which is encoded in the observation clauses of loads and stores. It also prescribes that conditional branches may be mispredicted (hence COND), encoded in their execution clause.

A contract trace contains the sequence of all the observations the contract allows to be exposed after each instruction during a program execution, including the instructions executed speculatively. Conversely, the information that is not exposed via is supposed to be kept secret.

We represent a contract as a function that maps the program and its input to a contract trace :

1z = array1[x] # base of array1 is 0x100
2if (y < 10)
3  z = array2[y] # base of array2 is 0x200\end{lstlisting}
4    \caption{Example of Spectre V1}
5    \label{fig:example-violation}
6\end{figure}
Example: ()

Consider the program in Figure LABEL:fig:example-violation, executed with an input data={x=10,y=20}. The MEM-COND contract trace is ctrace=[0x110,0x220], representing that the load at line 1 exposes the accessed address during normal execution, and the load at line 3 exposes its address during speculative execution triggered by the branch at line 2.

A CPU complies [16] with a contract when its hardware traces (collected on the actual CPU) leak at most as much information as the contract traces. Formally, we require that whenever any two executions of any program have the same contract trace (implying the difference between inputs is not exposed), the respective hardware traces should also match.

Definition 1.
A CPU complies with a if, for all programs , all input pairs , and all initial microarchitectural states :

This approach is called relational reasoning, and is natural for expressing information flow properties [8]. In the corresponding terminology [36], Def 1 requires that any program that is non-interferent with respect to a contract must also be non-interferent on the CPU.

Conversely, a CPU violates a contract if there exists a program and two inputs that agree on their contract traces but disagree on the hardware traces. We call the triple a contract counterexample. The counterexample witnesses that an adversary can learn more information from hardware traces than what the contract specifies. A counterexample indicates a potential microarchitectural vulnerability that was not accounted for by the contract.

Example: ()

Consider a contract, called MEM-SEQ, which allows exposure of memory accesses (similarly to MEM-COND), but limits it to only non-speculative accesses. A CPU that leaks on speculatively executed branches will violate MEM-SEQ. Its counterexample is the program in Figure LABEL:fig:example-violation together with inputs data1={x=10,y=20}, data2={x=10,y=30}: The contract trace for both inputs is ctrace=[0x110]. However, if the CPU mispredicts the branch (line 2) and speculatively accesses memory (line 3), the hardware traces will diverge (htrace1=[0x110,0x220] and htrace2=[0x110,0x230]) . Yet, this is not a counterexample to MEM-COND, because its contract traces already expose the memory accesses on both paths of a branch.

2.3 Concrete Contracts of Speculation

A contract is constructed from a combination of an observation and execution clauses. We first describe individual clauses, and then show how they form concrete contracts.

Observation clauses:

  • [leftmargin=0cm,itemindent=2noitemsep]

  • MEM (Memory Address): exposes the addresses of data loads and stores. Represents a data cache timing side-channel attack.

  • CT (Constant-Time): extends MEM by additionally exposing Program Counter. Represents both data and instruction cache attacks. Based on a typical threat model for constant-time programming (hence the name), except it does not expose the execution time of variable-latency operations.

  • ARCH (Architectural Observer): extends CT by additionally exposing the values loaded from memory. Represents a same-address-space attack, such as assumed in the Speculative Taint Tracking paper [51].

Execution clauses:

  • [leftmargin=0cm,itemindent=2noitemsep]

  • SEQ: observations are only collected during sequential execution (in-order, nonspeculative). This is a model of a processor that allows speculation but constrains the information leaked during the speculation when combined with the appropriate observation clause.

  • COND: observations are also collected after conditional jump misprediction. That is, they are collected from both correct and mispredicted paths. The length of the mispredicted path is limited by a predefined speculation window.

  • BPAS: observations are collected after store bypass: all stores are speculatively skipped. The mis-speculated execution rolls back after the speculation window as in COND.

  • COND-BPAS: Combination of COND and BPAS.

Full contracts. We illustrate how the clauses form a contract with examples:

Example: ()

CT-COND exposes addresses of all memory accesses and of all control-flow transitions, including those on mispredicted paths of conditional branches. CT-COND models a CPU vulnerable to Spectre V1 attacks.

Example: ()

ARCH-SEQ exposes addresses and values of non-speculative loads and stores. There is a subtle difference from MEM-SEQ. While MEM-SEQ disallows speculative leakage of any values, ARCH-SEQ disallows leakage of only speculatively loaded values. This is equivalent to transient noninterference[51].

3 Challenges of Testing Contract Compliance

In this work, we leverage contracts to check compliance of complex commercial CPUs under realistic threat models. Assuming that a contract properly exposes the expected information leakage in a CPU, finding a counterexample would signify an unexpected, hence potentially exploitable, leakage.

While the original paper [16] proved compliance on an abstract CPU with toy assembly, testing compliance of a real hardware CPU with complex ISA poses significant challenges.

3.1 How to find a counterexample?

The search space for counterexamples is all possible programs, inputs, and all microarchitectural contexts. Such an immense search space cannot be explored exhaustively, thus requiring a targeted search.

CH1: Binary Generation. While a contract prescribes which instructions are permitted to speculate and expose information, we search for unexpected speculation and leakage, thus we need to collect traces that encompass all the instructions. Furthermore, a particular sequence of instructions is usually required to produce an observable leakage, thus we need to test different instruction sequences. Moreover, to trigger an incorrect speculation (e.g., a branch misprediction), we need to prime the microarchitectural state in diverse ways. All of it calls for a search strategy that tests diverse instruction sequences with diverse inputs, but with a priority to those that are likely to leak or to produce speculation.

CH2: Input Generation. For an input to be useful in forming a counterexample, we need another input that produces the same contract trace. Such inputs are called effective inputs. The in

effective inputs which produce a unique contract trace constitute a wasted effort as they cannot, by definition, reveal contract violation. This challenge calls for a more structured input generation approach rather than a simple random one, as the probability that multiple random inputs will produce the same contract trace is low.

3.2 How to get stable hardware traces on a real CPU?

CH3: Collection of Hardware Traces. CPUs have no direct interface to record information leaked in hardware traces, such as addresses accessed in a speculative path. Thus, we have to perform indirect sampling-based measurements, which are inevitably imprecise and incomplete.

CH4: Uncontrolled Microarchitectural State. Black-box CPUs normally have no direct way to set the microarchitectural context for test execution as required by Def 1. For example, branch predictors are not accessible architecturally, and some are not even disclosed. Moreover, speculation is depends on multiple, often unknown factors, such as fine-grained power saving [27, 35], or contention on shared resources. Thus, speculation can happen nondeterministically, and cause divergent traces without a real information leak (false positive). On the other hand, if the speculation is never triggered during the measurement, speculative leaks cannot be observed, leading to false compliance (false negative).

CH5: Noisy Measurements. The measurements are influenced by neighbour processes on the system, by hardware mechanisms (e.g., prefetching), and by inherent imprecision of the measurement tools (e.g., timing measurements). This challenge differs from CH3.2 as it affects the measurement precision rather than the program execution. The noise may result in divergence between the otherwise equivalent traces, leading to a false positive.

3.3 How to produce contract traces?

CH6: Collection of Contract Traces. All contracts in [16] are defined for a toy assembly; it is unclear how to collect traces for a contract describing a complex ISA. To allow realistic compliance check, we need work with real binaries generated via standard compiler tool chain. Hence, we need a method to automatically collect contract-prescribed observations for a given program executed with a given input.

4 Model-based Relational Testing

Figure 1: Main flow of Model-based Relational Testing.

We present Model-based Relational Testing (MRT), our approach to identifying contract violations in black-box CPUs. Here we provide a high-level description, with the technical details to follow (5). Figure 1 shows the main steps.

Test case and input generation. We sample the search space of programs, inputs and microarchitectural states to find counterexamples. The generated instruction sequences (test cases) are comprised of the ISA subset described by the contract. The test cases and respective inputs to them are generated to achieve high diversity and to increase speculation or leakage potential (5.1 and 5.2).

Diversity-guided generation. The testing process is performed in rounds, where earlier rounds exercise smaller search space (i.e., shorter instruction sequences, fewer basic blocks) to speed up testing. After each round that did not yield a counterexample, we invoke a test case diversity analysis which may trigger reconfiguration of the test generator to produce richer test cases, gradually expanding the search space (5.6).

Collecting contract traces. We implement an executable Model of the contract to allow automatic collection of contract traces for standard binaries. For this, we modify a functional CPU emulator to implement speculative control flow based on a contract’s execution clause, and to record traces based on its observation clause (5.4).

Collecting hardware traces. We collect hardware traces by executing the test case on the CPU under test and measuring the observable microarchitectural state changes during the execution according to the threat model. The executor employs several methods to achieve consistent and repeatable measurements (5.3).

Relational Analysis. We analyze the contract and hardware traces to identify potential violations of Def 1. This requires relational reasoning:

  1. We partition inputs into groups, which we call input classes. All inputs within a class have the same contract trace. Thus, input classes correspond to the equivalence classes of equality on contract traces. Classes with a single (ineffective) input are discarded.

  2. For each class, we check if all inputs within a class have the same hardware trace.

If the check fails on any of the classes, we found a counterexample that witnesses contract violation (5.5).

5 Design and Implementation

We build a tool Revizor that implements MRT for practical end-to-end testing of x86 CPUs against speculation contracts. We describe the individual components of Revizor and how they address the challenges outlined in 3.

5.1 Test Case Generator

The task of the test case generator is to sample the search space of all possible programs. As described in CH3.1, the sampling should be diverse, so that we have a chance to observe an unexpected leakage or speculation. Fully random generation, however, might lead to generating incorrect programs, e.g., with invalid control flow or memory accesses, leading to unhandled exceptions during their execution. This is why we rely on a randomized generation algorithm which imposes a certain structure on the generated instruction sequence and its memory accesses. It works as follows:

  1. Generate a random Directed Acyclic Graph (DAG) of basic blocks;

  2. Add jump instructions (terminators) at the end of basic block to ensure the control flow matches the DAG; This ensures that the control flow is confined.

  3. Add random instructions from the ISA subset selected for the test;

  4. Instrument instructions to avoid faults:

    1. mask memory addresses to confine them within a dedicated memory region, which we call sandbox;

    2. modify division operands to avoid division by zero;

  5. Compile the test case into a binary.

The total number of instructions, functions, and basic blocks per test, as well as the tested instruction (sub)set are specified by the user. We borrow the ISA description from nanoBench [2].

1OR RAX, 468722461
2AND RAX, 0b111111000000
3LOCK SUB byte ptr [R14 + RAX], 35
4JNS .bb1
5JMP .bb2
6.bb1: AND RCX, 0b111111000000
7REX SUB byte ptr [R14 + RCX], AL
8CMOVNBE EBX, EBX
9OR DX, 30415
10JMP .bb2
11.bb2: AND RBX, 1276527841
12AND RDX, 0b111111000000
13CMOVBE RCX, qword ptr [R14 + RDX]
14CMP BX, AX\end{lstlisting}
15    \caption{Randomly generated test case}
16    \label{fig:test-case-raw}
17\end{figure}
Example: ()

Figure LABEL:fig:test-case-raw shows a test case example, produced in multiple steps: \⃝raisebox{-0.9pt}{1} The generator created a DAG with three nodes. \⃝raisebox{-0.9pt}{2} Connected the nodes by placing either conditional or direct jumps (lines 4–5, 10). \⃝raisebox{-0.9pt}{3} Added random instructions until a specified size was reached (lines 1, 3, 7–9, 13, 14). \⃝raisebox{-0.9pt}{4} Masked the memory accesses and aligned to the sandbox base in R14 (lines 2, 6, 12). \⃝raisebox{-0.9pt}{5} Compiled the test case.

Improving input effectiveness. Using many hardware registers and larger sandbox results in low input effectiveness (CH3.1), as it increases the likelihood of unique contract traces that cannot be used for relational testing. To improve input effectiveness, the generator generates programs with only four registers, confines the memory sandbox to one or two 4K memory pages, and aligns memory accesses to a cache line (64B). To test different alignments, the accesses are further offset by a random value between 0 and 64 (the same within a test case but different across test cases).

5.2 Input Generator

An input is a set of values to initialize the architectural state, which includes registers (including FLAGS) and the memory sandbox. The values are random 32-bit numbers which we generate using a PRNG.

Improving input effectiveness. Higher entropy of the PRNG leads to lower input effectiveness (CH3.1), because the probability of finding colliding contract traces decreases. We performed experiments where we tune the entropy of the PRNG to maximize the number of contract classes that are covered by at least two inputs. We expect that more sophisticated techniques for creating inputs, e.g. based on symbolic execution, can further improve effectiveness.

5.3 Executor

The executor has three goals: (1) collect hardware traces when executing test cases on the CPU (CH3.2), (2) set the microarchitectural context for the execution (CH3.2), and (3) eliminate measurement noise (CH3.2).

Collecting hardware traces. To collect traces we employ methods used by side-channel attacks, but in a fully controlled environment. This allows us to record hardware traces corresponding to the measurements of a powerful worst-case attacker, and spot all consistently-observed leaks via the microarchitectural state. The process involves the following steps:

  1. Load the test case into a dedicated region of memory,

  2. Set memory and registers according to the inputs,

  3. Prepare the side-channel (e.g., prime cache lines),

  4. Invoke the test case,

  5. Measure the microarchitectural changes (e.g., probe cache lines) via the side-channel, thus producing a trace.

The measurement (steps 2–6) repeats for all inputs, thus producing a hardware trace for each test case-input pair.

Our implementation supports several measurement modes:

  • Prime +Probe [32], Flush +Reload [50], and Evict +Reload [14] modes use the corresponding attack on L1D cache.

  • In *+Assist mode, the executor includes microcode assists. It clears the “Accessed” bit in one of the accessible pages such that the first store or load triggers an assist 222This is possible in attacks on SGX enclaves, e.g., LVI [41]..

Example: ()

The hardware trace corresponding to running executor in L1D Prime+Probe mode is a sequence of bits, each representing whether a specific cache set was accessed by the test case or not. E.g., the following trace indicates observed memory accesses to sets 0,4,5: 10001100000000000000000000000000

Setting the microarchitectural context. We cannot directly control the microarchitectural context before the test execution (CH3.2). To deal with this we develop a technique called priming. Priming collect traces for a large number of pseudorandom inputs (5.2) to the same test case in a sequence. In this way, execution with one input effectively sets the microarchitectural context for the next input. This enables collection of hardware traces with predictors primed in diverse but deterministic fashion, which is key to obtaining traces that are stable enough for equality checks.

However, priming may result in undesirable artifacts. For this, recall that MRT searches for inputs and from the same input class, but with divergent hardware traces:

Due to priming, however, divergence of hardware traces can also be caused by differences in the microarchitectural contexts and . For example, earlier inputs can train branch predictors in a way that would prevent speculation for the latter inputs.

To filter such cases and verify that the divergence is caused by the difference in inputs and not by the difference in contexts, we swap and in the priming sequence, which enables us to test with the context and vice versa. That is, we test the following:

If this condition holds, we discard the divergence as a measurement artifact, otherwise we report a contract violation.

Example: ()

Consider two inputs from the same input class with different hardware traces, and assume that, in the original sequence of inputs, the first was at position 100 () and the second at 200 (). For priming, the executor tests sequences and . The executor will consider it a false positive if at position 200 produces the same trace as at position 200, and vice versa.

Eliminating measurement noise. Hardware traces in the same input class may also diverge (and thus incorrectly considered as contract violation) due to several additional sources of inconsistencies which we eliminate as follows:

  1. Eliminating measurement noise (CH3.2). The executor uses performance counters for cache attacks by reading the L1D miss counter before and after probing a cache line. It proved to give more stable results than timing readings.

  2. Eliminating external software noise (CH3.2). We run the executor as a kernel module (based on nanoBench [2]). A test is executed on a single core, with hyperthreading, prefetching, and interrupts disabled. The executor also monitors System Management Interrupts (SMI) [18] to discard those measurements polluted by an SMI.

  3. Reducing nondeterminism (CH3.2). We repeat each measurement (50 times in our experiments) after several rounds of warm-up, and discard one-off traces as likely caused by noise. We then take the union of all traces collected from the executions of a test case with the same input, which encompasses all consistently observed variants of speculative behavior under different microarchitectural contexts.

Example: ()

Consider again the test case in Figure LABEL:fig:test-case-raw. If the branch in line 6 is speculated differently across the runs, one input may produce different traces:

00001010000001000000000000000001

00001000000001000000000000000001

The first trace is with a misprediction (cache set 7), and the second without. The merged trace is: 00001010000001000000000000000001

Discarding all outliers observed only once during the test might miss rare cases that reveal real leaks. However, we found it necessary from the practical perspective: each reported violation requires manual investigation. Since the outliers turned out to be notoriously hard to reproduce and verify, we opted to focus on the leaks that are easier to distinguish from the noise.

5.4 Model

Model’s task is to automate the collection of contract traces (CH3.3). We achieve this by executing test cases on an ISA-level emulator modified according to the contract. The emulator implements the contract’s execution clause, such as exploring all speculative execution paths, followed by a rollback, and it collects observations based on the observation clause. The resulting trace is a list of observations collected while executing a test case with a single input. We base our implementation on Unicorn [33], a customizable x86 emulator, modified to implement the clauses listed in 2.3.

Observation Clauses. When the emulator executes an instruction listed in the observation clause, it records its exposed information into the trace. This happens during both normal and speculative execution, unless the contract states otherwise.

Example: ()

Consider the test case Figure LABEL:fig:test-case-raw and the contract MEM-SEQ. As prescribed by the contract, the model records the accessed addresses when executing lines 3, 7, 13 (Figure LABEL:fig:test-case-raw). Suppose, the branch (line 4) was not taken; the store (line 3) accessed 0x100; and the load (line 13) accessed 0x340. Then, the contract trace is ctrace=[0x100, 0x340].

Execution Clauses are implemented similarly to the speculation exposure mechanism introduced in SpecFuzz [31]: Upon encountering an instruction with a non-empty execution clause (e.g., a branch in MEM-COND), the emulator takes a checkpoint. The emulator then simulates speculation as described by the clause until (1) the test case ends, (2) the first serializing instruction is encountered, or (3) the maximum possible speculation depth is reached. Then, it rolls back and continues normal execution.

As multiple mispredictions may happen together, the emulator supports nested speculation through a stack of checkpoints: When starting a simulation, the checkpoint is pushed, and afterwards, the emulator rolls back to the topmost checkpoint.

Practically, however, nested speculations greatly reduce the testing speed, which is why we disable nesting by default. This artificially reduces the amount of permitted leakage by the contract, potentially causing false violations (since hardware traces would still include nested speculations). To identify such false violations, Revizor re-executes all reported violations with nesting enabled.

5.5 Analyzer

The analyzer compares traces by using relational analysis (4). As hardware traces are obtained as the union of observations collected from the same input in different microarchitectural contexts (5.3), we relax requiring equality of hardware traces to requiring only a subset relation. Specifically, we consider two traces equivalent if every observation included in one trace is also included in the other trace.

The intuition behind the heuristic is as follows. If the mismatch is caused by an inconsistent execution of a speculative path among the inputs, one of the inputs executed fewer instructions, therefore fewer observations would appear in the trace, but those that appear match. In contrast, if the mismatch is caused by a secret-dependent instruction, the traces contain the same number of observations, but their values differ. To validate this intuition, we manually examined multiple such examples and did not observe any real violation.

5.6 Test Diversity Analysis

If a testing round did not detect any violation, we need to decide how to improve the chances of finding one in the next round. As we test black-box CPUs we cannot measure coverage of the exercised CPU features to guide the test generation in the next round.

Instead, we seek to estimate the likelihood to exercise new speculative paths with the current configuration of the test case generator by analyzing the diversity of the tests we ran so far (CH

3.1). We capture diversity tests using a new measure called pattern coverage, which counts data and control dependencies that are likely to cause pipeline hazards. We expect higher pattern coverage to correlate with higher chances to surface speculative leaks. Therefore, if a testing round does not improve the pattern coverage of the tests so far, new speculative paths are unlikely to be explored. To facilitate generation of more diverse tests, Revizor then increases the number of instructions and basic blocks per test. We now discuss this approach in more details.

Patterns of instructions. We define patterns in terms of instruction pairs. To simplify the counting of pattern coverage we require that the instructions are consecutive, which corresponds to the worst case for creating hazards. We distinguish three types:

  1. A memory dependency pattern is two memory accesses to the same address. We consider 4 patterns: store-after-store, store-after-load, load-after-store, load-after-load.

  2. A register dependency pattern is when one instruction uses a result of another instruction. We consider 2 patterns: dependency over a general-purpose register, and over FLAGS.

  3. A control dependency pattern is an instruction that modifies the control flow followed by any other instruction. In this paper we consider 2 patterns: conditional and unconditional jumps. Larger instruction sets may include indirect jumps, calls, returns, etc.

We say that a program with an input matches a pattern if that pattern is found in two consecutive instructions in the corresponding instruction stream. Since a single input cannot form a counterexample, a pattern is covered if a program and two inputs in the same input class match the pattern.

Target 1 Target 2 Target 3 Target 4 Target 5 Target 6 Target 7 Target 8
CPU Skylake Skylake Coffee Lake
V4 patch off on on
Instruction Set AR AR+MEM AR+MEM+VAR AR+MEM+VAR AR+MEM+CB AR+MEM+CB+VAR AR+MEM
Executor Mode Prime+Probe Prime+Probe+Assist
Table 2: Description of the experimental setups.
Target 1 Target 2 Target 3 Target 4 Target 5 Target 6 Target 7 Target 8
CT-SEQ ✓ (V4) ✓ (V4) ✓ (V1) ✓ (V1) ✓ (MDS) ✓ (LVI-Null)
CT-BPAS ✓ (V4-var) ✓ (V1) ✓ (V1) ✓ (MDS) ✓ (LVI-Null)
CT-COND ✓ (V4) ✓ (V4) ✓ (V1-var) ✓ (MDS) ✓ (LVI-Null)
CT-COND-BPAS ✓ (V4-var) ✓ (V1-var) ✓ (MDS) ✓ (LVI-Null)
we did not repeat the experiment as a stronger contract was already satisfied. the violation represents a novel speculative vulnerability.
Table 3: Testing results. ✓ means Revizor detected a violation; means Revizor detected no violations within 24h of testing. In parenthesis are Spectre-type vulnerabilities revealed by the detected violations.

To provide opportunities for interaction between different speculation types, we count not just individual patterns, but also their combinations.

Implementation. We implement tracking of patterns as part of the Model (5.4): While collecting contract traces of a test case, the model also records the executed instructions and the addresses of memory accesses. These data are later analyzed to find the patterns in the instruction streams.

Coverage Feedback. We use pattern combination coverage as feedback to the test generator. We begin with test cases of size and with at most basic blocks, tested with inputs (e.g., 10 instructions, 2 blocks, 50 inputs per test case). We continue until all individual patterns are covered. Then, we increase the sizes by constant factors (e.g., 15 instructions, 3 blocks, 75 inputs), and continue testing until all combinations of patterns are covered, and so on.

5.7 Postprocessor

When a violation is detected, the test case is passed to the postprocessor, which minimizes the test case in three stages:

First, the postprocessor creates a minimal input sequence: It removes inputs until it finds the smallest sequence to correctly prime the microarchitectural state for the violation. Second, it creates a minimal test case: It removes one instruction at a time while checking for violations. Third, it minimizes the speculative part: It adds LFENCEs, starting from the last instruction, while checking for violations. The resulting region without fences is the location of leakage.

1AND RAX, 0b111111000000 ; LFENCE
2LOCK SUB byte ptr [R14 + RAX], 35
3JNS .bb1 ; LFENCE
4JMP .bb2
5.bb1: AND RCX, 0b111111000000
6REX SUB byte ptr [R14 + RCX], AL
7.bb2: LFENCE\end{lstlisting}
8    \caption{Minimized test case, representative of Spectre V1.}
9%    Under inspection, it appears to be an instance of Spectre V1.}
10    \label{fig:test-case-min}
11\end{figure}
Example: ()

Figure LABEL:fig:test-case-min is a minimized version of Figure LABEL:fig:test-case-raw. The highlighted region without LFENCEs is the location of leakage: The store (line 2) delays the jump (line 3), thus sufficiently prolonging the speculation. The jump mispredicts and goes to line 5. This causes a speculative execution of SUB (line 6), which has a memory operand and thus leaks the value of RCX.

6 Evaluation

In this section, we demonstrate Revizor’s ability to expose contract violations and automatically identify speculative execution vulnerabilities in two generations of Intel CPUs.

6.1 Experimental Setup

We test multiple CPUs, ISA subsets, and threat models against several contracts. The experiments are summarized in Table 2.

CPUs (rows 1 and 2). We run our experiments on two machines. The first has Intel Core i7-6700 CPU (Skylake), the second an Intel Core i7-9700 CPU (Coffee Lake). We analyze Skylake with Spectre V4 microcode patch enabled and disabled. Coffee Lake has a hardware MDS patch.

Instruction Sets (row 3). We build our test cases from the following subsets of x86333We do not consider bit count, bit test, and shift instructions because Unicorn sometimes emulates them incorrectly.:

  • AR: in-register arithmetic, including logic and bitwise;

  • MEM: memory operands and loads/stores;

  • VAR: variable-latency operations (divisions).

  • CB: conditional branches;

This totals in the following number of unique instructions: AR—325; AR+MEM—678; AR+MEM+VAR—687; AR+CB—359; AR+MEM+CB—710, AR+MEM+CB+VAR—719.

Threat Models (row 4). We tested contracts against two threat models, Prime+Probe and Prime+Probe+Assist (see 5.3). Note that we use a 4KB sandbox, and Flush/Evict+Reload would produce equivalent traces as the 64 L1D cache sets correspond to 64 memory blocks in a 4KB region.

Configuration. Generation started from 8 instructions, 2 memory accesses, and 2 basic blocks per test case; 2 bits of input entropy; 50 inputs per test case. The parameters increased over testing rounds.

6.2 Testing Results

We report our findings when testing the targets in Table 2 against different contracts. We tested each for 24 hours or until the first violation was found. The results are in Table 3.

Target 1: Baseline. As a baseline, we test the most narrow instruction set AR containing only arithmetic operations on Skylake (with V4 patch disabled) using the weakest threat model (P+P without assists). We expect the target to comply with the most restrictive contract (CT-SEQ). The experiments confirm it: Revizor did not detect violations (column 1 of Table 3). Since other contracts are more liberal, the target also complies with more liberal contracts. This experiment shows that Revizor successfully mitigates measurement noise and filters the artifacts of non-deterministic execution, producing no false violations.

Target 2: Memory Accesses. When augmenting the instruction set with memory accesses to AR+MEM (for the same CPU and threat model), Revizor detects violations of CT-SEQ and CT-COND (B.2). Upon manual inspection, we identify those violations as representative of Spectre V4 (Speculative Store Bypass) [12]. Revizor does not detect violations of CT-BPAS and CT-COND-BPAS, which is expected as they permit the store bypass.

Target 3: Variable-latency Instructions. When further augmenting the instruction set with divisions (the only variable-latency instructions in the base x86 [1]) to AR+MEM+VAR, Revizor finds violations of CT-BPAS and CT-COND-BPAS. Upon inspection, they reveal a novel variant of Spectre V4 that leaks the timing of division (not permitted to be exposed according to the contract). We discuss this variant in 6.3.

Target 4: V4 Patch. We change the experiment described in Target 3 by enabling the V4 patch on Skylake. Our experiments do not surface any contract violations, showing that the V4 patch is effective, also against the novel V4 variant.

Targets 5–6: Conditional Branches. When augmenting AR+MEM with conditional branches to AR+MEM+CB, Revizor detects violations of CT-SEQ and CT-BPAS (B.1). Upon inspection, these are representative of Spectre V1 [20]. Revizor detects no violations of CT-COND and CT-COND-BPAS, which is expected as the contracts permit exposing accesses during the execution of a mispredicted branch.

When further augmenting the instruction set with variable-latency instructions to AR+MEM+CB+VAR, Revizor detects violations of CT-COND and CT-COND-BPAS. Similar to Target 3, the violations represent novel variants of Spectre V1 (6.3).

Target 7: Microcode Assists. We now perform experiments with a different threat model, corresponding to an adversary that can cause microcode assists. To test the assists in isolation, we test AR+MEM, and we enable V4 patch to avoid violations caused by V4. Revizor now detects violations of all contracts, which we identify as representative of MDS [37, 5] (B.3).

Target 8: MDS Patch. We repeat the experiment in Target 7, but now on Coffee Lake, which has a hardware MDS patch. Revizor detected violations on it as well, which we identify as LVI-Null [41], a known vulnerability of the MDS patch (B.4).

Summary. We see that Revizor successfully discovered several known and also unknown vulnerabilities, fully automatically, without manual intervention.

6.3 Novel Variants Discovered

Revizor discovered two new types of speculative leakage of the instruction latency. As they represent variations on Spectre V1 and V4, and the existing defences prevent them, we did not report them to Intel. Yet they should be considered when developing future defences, hence we describe them next.

The latency of some operations (e.g., division) depends on their operand values. The timing difference exposes the values to the attacker who can measure the program’s execution time. However, as Revizor discovered, the timing can also impact the cache state, thus leaking through caches as well.

Figure LABEL:fig:v1-var shows a simplified version of the V1 variant (the complete test case is B.5). The key observation is that leakage happens due to a race condition:

  • if the variable-latency operation (line 1) is faster than the branch instruction (line 2), then the memory access (line 3) could leave a speculative cache trace.

  • otherwise, the speculation will be squashed before the operation completes, and the memory access will not be executed.

As such, the hardware traces expose not only the accessed address, but also the latency of the operation at line 1.

The discovered V4 variant exploits the same race condition; we expect it to affect other speculative vulnerabilities as well.

1b = variable_latency(a)
2if (...)  # misprediction
3  c = array[b] # executed if the latency is short\end{lstlisting}
4%        \subcaption{}
5%    \end{minipage}
6%    \begin{minipage}[t]{0.45\columnwidth}
7%        \begin{lstlisting}[numbers=none]
8%if (...)
9%  b = store[a]
10%
11%
12%        \end{lstlisting}
13%        \subcaption{Speculative Store Eviction}
14%    \end{minipage}
15
16    \caption{New Spectre V1 variant (V1-Var), found by \sys{}.}
17    \label{fig:v1-var}
18\end{figure}

6.4 Validating Assumptions about Speculative Execution

Several defence proposals (STT [51], KLEESpectre [44]) assumed that stores do not modify the cache state until they retire. We used Revizor to validate this assumption. We modified CT-COND to capture this assumption in the contract trace, and tested our CPUs against it (A.5). Revizor discovered no violations in Skylake, but found a counterexample on Coffee Lake (B.6). It looked similar to Spectre V1, except the trace was left by a speculative store. This is an evidence that the assumption was wrong and speculative stores can modify the cache state.

6.5 Detection Time

Detection time
Contract-permitted V4-type V1-type MDS-type LVI-type
leakage (Target 2) (Target 5) (Target 7) (Target 8)
None 73’25” (.7) 4’51” (.9) 5’35” (.7) 7’40” (1.1)
V4 N/A 3’48” (.7) 6’37” (.8) 3’06” (1.0)
V1 140’42” (.6) N/A 7’03” (.8) 3’22” (.3)
Table 4: Detection time: the testing time elapsed before the first detected violation. The numbers are mean over 10 measurements; in parentheses are coefficients of variation. Most vulnerabilities are automatically detected within minutes. The second and third rows show that the detection is fast even with multiple leakage types in a test case (details in 6.5).

We next measure the time required to find a counterexample. We test each of the targets in 6.2 that had violations (Targets 2, 5, 7, 8) 444We did not measure the detection time of the variants discovered in Targets 3 and 6 as they are too rare for repeated measurements. against CT-SEQ for 10 times and report the average time until the first violation (row 1 of Table 4).

Revizor detected most violations in under 10 minutes while still using short test cases. This demonstrates the importance of diversity-driven feedback. Revizor took longer to find V4-like violations as they require a longer speculation window, and the hardware predictor is less prone to misprediction.

Coping with multiple types of speculation leakages. We measure how fast Revizor detects a violation when two types of speculative leakage are present in the test case, but one of them is permitted by the contract; that is, Revizor has to detect an unexpected leakage while ignoring an expected leakage. The second and the third row shows the detection time when Spectre V1 and V4 respectively are permitted by the contract and are present in the test case, while testing against CT-BPAS (V4 patch disabled). We observe that these additional leakages did not hinder detection of the vulnerablities, albeit sometimes slowing down the detection due to reduced input effectiveness.

Number of Inputs to Violation. We analyze the number of random inputs that are required to surface a violation of CT-SEQ with manually-written test cases representing Spectre and MDS vulnerabilities. Table 5 reports an average of 100 experiments, each with a different input generation seed. Revizor detected all violations with few inputs (i.e., less than a second), illustrating the importance of further research on targeted test case generation.

Violation V1 V1.1 V2 V4 V5-ret MDS-LFB MDS-SB
Type [20] [19] [20] [12] [25, 21] [42, 37] [5]
# Inputs 6 6 4 62 2 2 12
Table 5: Detection of known vulnerabilities on manually-written test cases. # Inputs is the average minimal number of random inputs necessary to surface a violation.

6.6 Contract Sensitivity

1a = array1[b] 2if (...) 3  c = array2[a]\end{lstlisting} 4        \subcaption{\ctseq{} violation} 5    \end{minipage} 6    \begin{minipage}[t]{0.49\columnwidth} 7        \begin{lstlisting}[numbers=none] 8if (...) 9  a = array1[b] 10  c = array2[a]        \end{lstlisting} 11        \subcaption{\archseq{} violation} 12    \end{minipage} 13    \caption{Subtle difference in sensitivity of different contracts.} 14    \label{fig:arch-leakage} 15\end{figure} The classic Spectre V1 exploit [20] relies on two speculative loads, where the address of the second leaks the value loaded by the first, as in Figure LABEL:fig:arch-leakageb. Hardware defenses based on speculative taint tracking (STT) [51, 47] prevent such leaks, but they do not intend to prevent leaks of non-speculatively loaded data, as in Figure LABEL:fig:arch-leakagea. As both examples would violate CT-SEQ, this contract cannot be used to test STT-like defenses. Instead, we implement ARCH-SEQ (described in 2.3), which only forbids leakage of speculatively loaded data. When testing Skylake against ARCH-SEQ, Revizor indeed reports violations corresponding to the classic V1 gadget (Figure LABEL:fig:arch-leakageb) and does not report violations for Figure LABEL:fig:arch-leakagea.

7 Scope and Limitations

False contract violations (false positives). If the model incorrectly emulates ISA, it leads to false positives. Due to this reason, we excluded from the tests some instructions that are not implemented correctly in Unicorn. Non-determinism in the executor may also cause false positives. However, we inspected a few counterexamples in each of the experiments described in 6 and found no false positives. False contract conformance (false negatives). In several tests, Revizor did not detect violations (Table 3). This indicates, but does not prove, the absence of leaks: it merely shows that the explored space contained no counterexamples. Deeper testing is an open question left to future work. Generation of effective inputs. Revizor applies several restrictions to improve input effectiveness (CH3.1). This limits the test diversity, and might cause false negatives. To eliminate them, future work may develop a targeted generation method that ensures effectiveness via program analysis, similar to Spectector [15] or Scam-V [28]. Pattern coverage. We used hazards as a proxy for speculation. Yet a hazard is not a sufficient condition for speculative leakage. For example, to trigger Spectre V1, branch predictor must be mistrained, and the speculation must be long enough to leave a trace. These preconditions are hard to control on commercial CPUs and, thus, high pattern coverage does not guarantee that speculation was exercised. Improved heuristics to estimate the speculation opportunities in generated test cases might lead to better results. Instruction set. Revizor currently supports a subset of x86. A real-world testing campaign would require a complete set. Other side-channels. Revizor currently supports only attacks on L1D caches. For other side-channels, we have to implement them within the executor (e.g., execution port attacks require reading of the port load). For certain speculative attacks, the executor would have to be modified (e.g., Meltdown requires handling of page faults).

8 Related Work

Several papers investigated testing against speculative vulnerabilities. Nemati et al. [29] showed a method for detecting Spectre V1 in ARM CPUs. Medusa [26] is a fuzzer for detecting variants of MDS. SpeechMiner [49] is a tool to analyze speculative vulnerabilities. All of them target specific attacks, while Revizor detects vulnerabilities without prior knowledge. ABSynthe [13] and Osiris [46] automatically discover new side channels. It makes them complementary to Revizor, which discovers unknown leakage on a known side channel. White-box testing was previously investigated as well. Zhang et al. [52] proposed to annotate Verilog with information-flow properties, and thus enable property verification. Fadiheh et al. [10] proposed a SAT-based bounded model checker to find covert channels in RTL designs (in our terminology, they check RTL against ARCH-SEQ). CheckMate [38] searches for pre-defined vulnerability patterns in CPU designs. These tools are not applicable to testing of commercial black-box CPUs. Formal models for (parts of) ISA were developed in several works [4, 9, 11], although models of microarchitectural aspects are only emerging. For instance, Coppelia [53] generates software exploits for CPU designs. Random testing (fuzzing) was applied traditional side channels. CT-Fuzz [17] detects information leaks in software by measuring hardware traces. DiffFuzz [30] detects leaks by varying secret values and searching for a variation in the traces. Both use relational analysis, similar to MRT, but they do not consider speculation, and focus on testing software. Nemati et al. [28] fuzzed a CPU against a model of side-channel leakage. Their approach is similar to MRT, but their leakage model does not encompass speculation, and is applied to a simpler, in-order CPU (Cortex-A53). Architectural Fuzzing. Several tools fuzz for violations of ISA. RFuzz [22] is a tool for fuzzing on the RTL level. TestRIG [48] is a tool for random testing of RISC-V designs. They cannot detect microarchitectural information leakage. Coverage. Architectural testing community developed several coverage metrics: RFuzz [22] used a metric based on control multiplexers. Hardware Fuzzing Pipeline [39] used a metric based on HDL line coverage. These metrics are inapplicable to black-box testing as they require CPU introspection. Happens-before models [24] are loosely related to pattern coverage, as a dependency pattern is a happens-before pattern with data- or control-dependent nodes. Speculative leaks in software. Several tools target detection of speculative leaks in software [45, 17, 15, 31, 6, 43]. They all rely on (sometimes implicit) assumptions about the speculation in hardware. Revizor gives a first principled foundation for validating such assumption on black-box CPUs.

9 Conclusion

We presented Model-based Relational Testing (MRT), a technique to detect violations of speculation contracts in black-box CPUs. We implemented MRT in a framework called Revizor, and used it to test Intel CPUs against a wide range of contracts. Our experiments show that Revizor effectively finds contract violations without reporting false positives. The detected violations include known vulnerabilities such Spectre, MDS, and LVI, as well as novel variants. This demonstrates that MRT is a promising approach for third-party assessment of microarchitectural security in black-box CPUs.

Our work opens several avenues for future research, such as white-box analysis of emerging CPUs and mechanisms for secure speculation, coverage, and targeted testing, for which the open-source release of Revizor will provide a solid foundation.

References

  • [1] A. Abel and J. Reineke (2019) uops.info: Characterizing latency, throughput, and port usage of instructions on Intel microarchitectures. In ASPLOS, Cited by: §6.2.
  • [2] A. Abel and J. Reineke (2020) nanoBench: A low-overhead tool for running microbenchmarks on x86 systems. In ISPASS, Cited by: item 2, §5.1.
  • [3] J. Alglave (2012) A formal hierarchy of weak memory models. Formal Methods in System Design. Cited by: §1.
  • [4] A. Armstrong, T. Bauereiss, B. Campbell, A. Reid, K. E. Gray, R. M. Norton, P. Mundkur, M. Wassell, J. French, C. Pulte, S. Flur, I. Stark, N. Krishnaswami, and P. Sewell (2019) ISA Semantics for ARMv8-a, RISC-V, and CHERI-MIPS. In POPL, Cited by: §8.
  • [5] C. Canella, D. Genkin, L. Giner, D. Gruss, M. Lipp, M. Minkin, D. Moghimi, F. Piessens, M. Schwarz, B. Sunar, J. Van Bulck, and Y. Yarom (2019) Fallout: leaking data on Meltdown-resistant CPUs. In CCS, Cited by: item iv, §6.2, Table 5.
  • [6] S. Cauligi, C. Disselkoen, K. v. Gleissenthall, D. Stefan, T. Rezk, G. Barthe, D. Tullsen, D. Stefan, T. Rezk, and G. Barthe (2020) Constant-Time Foundations for the New Spectre Era. In PLDI, Cited by: §8.
  • [7] S. Cauligi, C. Disselkoen, D. Moghimi, G. Barthe, and D. Stefan (2021) SoK: practical foundations for spectre defenses. External Links: 2105.05801 Cited by: §1.
  • [8] M. R. Clarkson and F. B. Schneider (2010) Hyperproperties. Journal of Computer Security. Cited by: §2.2.
  • [9] U. Degenbaev (2012) Formal Specification of the x86 Instruction Set Architecture. Ph.D. Thesis, Universität des Saarlandes. Cited by: §8.
  • [10] M. R. Fadiheh, D. Stoffel, C. W. Barrett, S. Mitra, and W. Kunz (2019) Processor hardware security vulnerabilities and their detection by unique program execution checking. In DATE, Cited by: §8.
  • [11] S. Goel, W. A. Hunt, and M. Kaufmann (2017) Engineering a Formal, Executable x86 ISA Simulator for Software Verification. In Provably Correct Systems, Cited by: §8.
  • [12] P. Z. Google (2018) Speculative Execution, Variant 4: Speculative Store Bypass. Note: https://bugs.chromium.org/p/project-zero/issues/detail?id=1528Accessed: May, 2021 Cited by: §6.2, Table 5.
  • [13] B. Gras, C. Giuffrida, M. Kurth, H. Bos, and K. Razavi (2020) ABSynthe: Automatic Blackbox Side-channel Synthesis on Commodity Microarchitectures. In NDSS, Cited by: §8.
  • [14] D. Gruss, R. Spreitzer, and S. Mangard (2015) Cache Template Attacks: Automating Attacks on Inclusive Last-Level Caches. In Usenix Security, Cited by: 1st item.
  • [15] M. Guarnieri, B. Köpf, J. F. Morales, J. Reineke, and A. Sanchez (2020) SPECTECTOR: Principled Detection of Speculative Information Flows. In S&P, Cited by: §7, §8.
  • [16] M. Guarnieri, B. Köpf, J. Reineke, and P. Vila (2021) Hardware-software contracts for secure speculation. In S&P, Cited by: §1, §1, §1, §2.2, §2.2, §3.3, §3.
  • [17] S. He, M. Emmi, and G. Ciocarlie (2020) ct-fuzz: Fuzzing for Timing Leaks. In ICST, Cited by: §8, §8.
  • [18] Intel Corporation (2019) Intel® 64 and IA-32 Architectures Software Developer’s Manual. Cited by: item 2.
  • [19] V. Kiriansky and C. Waldspurger (2018) Speculative Buffer Overflows: Attacks and Defenses. arXiv. External Links: 1807.03757 Cited by: Table 5.
  • [20] P. Kocher, J. Horn, A. Fogh, D. Genkin, D. Gruss, W. Haas, M. Hamburg, M. Lipp, S. Mangard, T. Prescher, M. Schwarz, and Y. Yarom (2019) Spectre Attacks: Exploiting Speculative Execution. In S&P, Cited by: §1, §2.1, §6.2, §6.6, Table 5.
  • [21] E. M. Koruyeh, K. N. Khasawneh, C. Song, and N. Abu-Ghazaleh (2018) Spectre Returns! Speculation Attacks using the Return Stack Buffer. In WOOT, Cited by: Table 5.
  • [22] K. Laeufer, J. Koenig, D. Kim, J. Bachrach, and K. Sen (2018) RFUZZ: Coverage-directed fuzz testing of RTL on FPGAs. In ICCAD, Cited by: §8, §8.
  • [23] M. Lipp, M. Schwarz, D. Gruss, T. Prescher, W. Haas, A. Fogh, J. Horn, S. Mangard, P. Kocher, D. Genkin, Y. Yarom, and M. Hamburg (2018) Meltdown: Reading Kernel Memory from User Space. In Usenix Security, Cited by: §1.
  • [24] D. Lustig, M. Pellauer, and M. Martonosi (2014) PipeCheck: specifying and verifying microarchitectural enforcement of memory consistency models. In MICRO, Cited by: §8.
  • [25] G. Maisuradze and C. Rossow (2018) ret2spec: Speculative Execution Using Return Stack Buffers. In CCS, Cited by: Table 5.
  • [26] D. Moghimi, M. Lipp, B. Sunar, and M. Schwarz (2020) Medusa: Microarchitectural Data Leakage via Automated Attack Synthesis Background Superscalar Memory Architecture. In Usenix Security, Cited by: §1, §8.
  • [27] A. Naveh, E. Rotem, A. Mendelson, S. Gochman, R. Chabukswar, K. Krishnan, and A. Kumar (2006) Power and thermal management in the intel core duo processor.. Intel Technology Journal. Cited by: §3.2.
  • [28] H. Nemati, P. Buiras, A. Lindner, R. Guanciale, and S. Jacobs (2020) Validation of Abstract Side-Channel Models for Computer Architectures. In CAV, Cited by: §7, §8.
  • [29] H. Nemati, R. Guanciale, P. Buiras, and A. Lindner (2020) Speculative Leakage in ARM Cortex-A53. arXiv. External Links: 2007.06865 Cited by: §8.
  • [30] S. Nilizadeh, Y. Noller, and C. S. Pasareanu (2019) DIFFUZZ: Differential Fuzzing for Side-Channel Analysis. In ICSE, Cited by: §8.
  • [31] O. Oleksenko, B. Trach, M. Silberstein, and C. Fetzer (2020) SpecFuzz: Bringing Spectre-type vulnerabilities to the surface. In Usenix Security, Cited by: §1, §5.4, §8.
  • [32] D. A. Osvik, A. Shamir, and E. Tromer (2006) Cache Attacks and Countermeasures: The Case of AES. In CT-RSA, Cited by: §2.1, 1st item.
  • [33] N. A. Quynh and D. H. Vu (2015) Unicorn: Next generation CPU emulator framework. In BlackHat USA, Cited by: §5.4.
  • [34] J. Rodrigo, S. Vicarte, P. Shome, N. Nayak, C. Trippel, A. Morrison, D. Kohlbrenner, and C. W. Fletcher (2021) Opening Pandora’s Box: A Systematic Study of New Ways Microarchitecture Can Leak Private Data. In ISCA, Cited by: §1.
  • [35] E. Rotem, E. Weissmann, B. Ginzburg, A. Naveh, N. Shulman, and R. Ronen (2019) Mechanism for saving and retrieving micro-architecture context. Google Patents. Note: US Patent App. 16/259,880 Cited by: §3.2.
  • [36] A. Sabelfeld and A. C. Myers (2003) Language-based information-flow security. IEEE Journal on selected areas in communications 21 (1), pp. 5–19. Cited by: §2.2.
  • [37] M. Schwarz, M. Lipp, D. Moghimi, J. Van Bulck, J. Stecklina, T. Prescher, and D. Gruss (2019) ZombieLoad : cross-privilege-boundary data sampling. In CCS, Cited by: §6.2, Table 5.
  • [38] C. Trippel, D. Lustig, and M. Martonosi (2018) CheckMate: Automated Exploit Program Generation for Hardware Security Verification. In MICRO, Cited by: §1, §8.
  • [39] T. Trippel, K. G. Shin, A. Chernyakhovsky, G. Kelly, D. R. Opentitan, and M. Hicks (2021) Fuzzing Hardware Like Software. External Links: 2102.02308v1 Cited by: §8.
  • [40] E. Tromer, D. A. Osvik, and A. Shamir (2010) Efficient Cache Attacks on AES, and Countermeasures. Journal of Cryptology. Cited by: §2.1.
  • [41] J. Van Bulck, D. Moghimi, M. Schwarz, M. Lipp, M. Minkin, D. Genkin, Y. Yarom, B. Sunar, D. Gruss, F. Piessens, and K. Leuven (2020) LVI: Hijacking Transient Execution through Microarchitectural Load Value Injection. In S&P, Cited by: item iv, §6.2, footnote 2.
  • [42] S. van Schaik, A. Milburn, S. Österlund, P. Frigo, G. Maisuradze, K. Razavi, H. Bos, and C. Giuffrida (2019) RIDL: Rogue In-flight Data Load. In S&P, Cited by: item iv, Table 5.
  • [43] M. Vassena, K. V. Gleissenthall, R. G. Kici, D. Stefan, and R. Jhala (2020) Automatically eliminating speculative leaks from cryptographic code with blade. CoRR. Cited by: §8.
  • [44] G. Wang, S. Chattopadhyay, A. K. Biswas, T. Mitra, and A. Roychoudhury (2020) KLEESpectre: Detecting information leakage through speculative cache attacks via symbolic execution. TOSEM. Cited by: item v, §6.4.
  • [45] G. Wang, S. Chattopadhyay, I. Gotovchits, T. Mitra, and A. Roychoudhury (2019) oo7: Low-overhead Defense against Spectre Attacks. IEEE Transactions on Software Engineering. Cited by: §8.
  • [46] D. Weber, A. Ibrahim, H. Nemati, M. Schwarz, and C. Rossow (2021) Osiris: automated discovery of microarchitectural side channels. In Usenix Security, Cited by: §8.
  • [47] O. Weisse, I. Neal, K. Loughlin, T. F. Wenisch, and B. Kasikci (2019) NDA: preventing speculative execution attacks at their source. In MICRO, Cited by: §6.6.
  • [48] J. Woodruff, A. Joannou, P. Rugg, H. Xia, J. Clarke, H. Almatary, P. Mundkur, R. Norton-Wright, B. Campbell, S. Moore, and P. Sewell (2018) TestRIG: Framework for testing RISC-V processors with Random Instruction Generation. Note: https://github.com/CTSRD-CHERI/TestRIGAccessed: May, 2021 Cited by: §8.
  • [49] Y. Xiao, Y. Zhang, and R. Teodorescu (2020) SpeechMiner: A Framework for Investigating and Measuring Speculative Execution Vulnerabilities. In NDSS, Cited by: §1, §8.
  • [50] Y. Yarom and K. Falkner (2014) Flush+Reload: A High Resolution, Low Noise, L3 Cache Side-channel Attack. In Usenix Security, Cited by: §2.1, 1st item.
  • [51] J. Yu, M. Yan, A. Khyzha, A. Morrison, J. Torrellas, and C. W. Fletcher (2019) Speculative Taint Tracking (STT): A Comprehensive Protection for Speculatively Accessed Data. In MICRO, Cited by: item v, 3rd item, §6.4, §6.6, Example 6.
  • [52] D. Zhang, Y. Wang, G. E. Suh, and A. C. Myers (2015) A hardware design language for timing-sensitive information-flow security. In ASPLOS, Cited by: §8.
  • [53] R. Zhang, C. Deutschbein, P. Huang, and C. Sturton (2018) End-to-End Automated Exploit Generation for Validating the Security of Processor Designs. In MICRO, Cited by: §8.

Appendix A Complete Contract Specifications

This appendix details all contracts used in the paper. The concrete instructions included in each class depend on the target instructions set.

a.1 Mem-Seq

1observation_clause: MEM 2  - instructions: MemReads 3    format: READ (ADDR), DEST 4    observation: expose(ADDR) 5  - instructions: MemWrites 6    format: WRITE SRC, (ADDR) 7    observation: expose(ADDR) 8execution_clause: SEQ 9  None

a.2 Mem-Cond

1observation_clause: MEM 2  - instructions: MemReads 3    format: READ (ADDR), DEST 4    observation: expose(ADDR) 5  - instructions: MemWrites 6    format: WRITE SRC, (ADDR) 7    observation: expose(ADDR) 8execution_clause: COND 9  - instructions: CondBranches 10    format: COND_BR CONDITION (DEST) 11    execution: {   # branch misprediction 12      if NOT CONDITION then 13          IP := IP + DEST 14      fi 15    }

a.3 Ct-Seq

1observation_clause: CT 2  - instructions: MemReads 3    format: READ (ADDR), DEST 4    observation: expose(ADDR) 5  - instructions: MemWrites 6    format: WRITE SRC, (ADDR) 7    observation: expose(ADDR) 8  - instructions: CondBranches 9    format: COND_BR CONDITION (DEST) 10    observation: {    # expose branch targets 11      if CONDITION then 12          expose(DEST) 13      else 14          expose(IP) 15      fi 16    } 17execution_clause: SEQ 18  None

a.4 Ct-Cond

1observation_clause: CT 2  - instructions: MemReads 3    format: READ (ADDR), DEST 4    observation: expose(ADDR) 5  - instructions: MemWrites 6    format: WRITE SRC, (ADDR) 7    observation: expose(ADDR) 8  - instructions: CondBranches 9    format: COND_BR CONDITION (DEST) 10    observation: { 11      if CONDITION then 12          expose(DEST) 13      else 14          expose(IP) 15      fi 16    } 17execution_clause: COND 18  - instructions: CondBranches 19    format: COND_BR CONDITION (DEST) 20    execution: { 21      if NOT CONDITION then 22          IP := IP + DEST 23      fi 24    }

a.5 Ct-Cond-NonspeculativeStore

1observation_clause: CT-NonspeculativeStore 2  - instructions: MemReads 3    format: READ (ADDR), DEST 4    observation: expose(ADDR) 5  - instructions: MemWrites 6    format: WRITE SRC, (ADDR) 7    observation: {   # do not expose speculative stores 8      if NOT IN_SPECULATION then 9        expose(ADDR) 10      fi 11    } 12  - instructions: CondBranches 13    format: COND_BR CONDITION (DEST) 14    observation: { 15      if CONDITION then 16          expose(DEST) 17      else 18          expose(IP) 19      fi 20    } 21execution_clause: COND 22  - instructions: CondBranches 23    format: COND_BR CONDITION (DEST) 24    execution: { 25      if NOT CONDITION then 26          IP := IP + DEST 27      fi 28    }

a.6 Ct-Bpas

1observation_clause: CT 2  - instructions: MemReads 3    format: READ (ADDR), DEST 4    observation: ADDR 5  - instructions: MemWrites 6    format: WRITE SRC, (ADDR) 7    observation: ADDR 8  - instructions: CondBranches 9    format: COND_BR CONDITION (DEST) 10    observation: { 11      if CONDITION then 12          expose(DEST) 13      else 14          expose(IP) 15      fi 16    } 17execution_clause: BPAS 18  - instructions: MemWrites 19    format: WRITE SRC, (ADDR) 20    execution: 21       # speculatively skip stores 22      IP := IP + instruction_size()

a.7 Ct-Cond-Bpas

1observation_clause: CT 2  - instructions: MemReads 3    format: READ (ADDR), DEST 4    observation: expose(ADDR) 5  - instructions: MemWrites 6    format: WRITE SRC, (ADDR) 7    observation: expose(ADDR) 8  - instructions: CondBranches 9    format: COND_BR CONDITION (DEST) 10    observation: { 11      if CONDITION then 12          expose(DEST) 13      else 14          expose(IP) 15      fi 16    } 17execution_clause: COND-BPAS 18  - instructions: CondBranches 19    format: COND_BR CONDITION (DEST) 20    execution: { 21      if NOT CONDITION then 22          IP := IP + DEST 23      fi 24    } 25  - instructions: MemWrites 26    format: WRITE SRC, (ADDR) 27    execution: 28       # speculatively skip stores 29      IP := IP + instruction_size()

a.8 Ctr-Seq

1observation_clause: 2  - instructions: RegReads   # expose register values 3    format: READ SRC, DEST 4    observation: expose(SRC) 5  - instructions: MemReads 6    format: READ (ADDR), DEST 7    observation: expose(ADDR) 8  - instructions: MemWrites 9    format: WRITE SRC, (ADDR) 10    observation: expose(ADDR) 11  - instructions: CondBranches 12    format: COND_BR CONDITION (DEST) 13    observation: { 14      if CONDITION then 15          expose(DEST) 16      else 17          expose(IP) 18      fi 19    } 20execution_clause: 21  None

a.9 Arch-Seq

1observation_clause: 2  - instructions: RegReads 3    format: READ SRC, DEST 4    observation: expose(SRC) 5  - instructions: MemReads 6    format: READ (ADDR), DEST 7    observation: { 8      expose(ADDR) 9      expose(read(ADDR))   # expose loaded values 10    } 11  - instructions: MemWrites 12    format: WRITE SRC, (ADDR) 13    observation: expose(ADDR) 14  - instructions: CondBranches 15    format: COND_BR CONDITION (DEST) 16    observation: { 17      if CONDITION then 18          expose(DEST) 19      else 20          expose(IP) 21      fi 22    } 23execution_clause: 24  None

Appendix B Examples of Contract Violations

This appendix shows examples of contract counterexamples generated by Revizor. Appendix B.1 is the original generated test case, and the rest are automatically minimized versions.

b.1 Instance of Spectre V1

1LEA R14, [R14 + 12] # instrumentation 2MFENCE # instrumentation 3.test_case_main.entry: 4JMP .bb0 5.bb0: 6CMOVNL ECX, ECX 7AND RBX, 0b0111111000000 # instrumentation 8ADD RBX, R14 # instrumentation 9ADC dword ptr [RBX], -67100032 10NOT RAX 11JP .bb1 # < ------------------ mispredicted branch 12JMP .test_case_main.exit 13.bb1: 14AND RBX, 1048197274 15ADD AX, 5229 16AND RCX, 0b0111111000000 # instrumentation 17ADD RCX, R14 # instrumentation 18ADC EAX, dword ptr [RCX] # < - speculative leak 19{load} ADD RCX, RCX 20.test_case_main.exit: 21LEA R14, [R14 - 12] # instrumentation 22MFENCE # instrumentation

b.2 Minimized Instance of Spectre V4

1LEA R14, [R14 + 8] # instrumentation 2MFENCE # instrumentation 3.test_case_main.entry: 4JMP .bb0 5.bb0: 6MOV RDX, 0 # instrumentation 7OR BX, 0x1c # instrumentation 8AND RAX, 0xff # instrumentation 9CMOVNZ BX, BX 10AND RBX, 0b0111111000000 # instrumentation 11ADD RBX, R14 # instrumentation 12XOR AX, AX 13AND RBX, 0b0111111000000 # instrumentation 14ADD RBX, R14 # instrumentation 15SETNS byte ptr [RBX]  # < ------ delayed store 16AND RAX, 0b0111111000000 # instrumentation 17ADD RAX, R14 # instrumentation 18MOVZX EDX, byte ptr [RAX]  # < - store bypass 19AND RDX, 0b0111111000000 # instrumentation 20ADD RDX, R14 # instrumentation 21AND RCX, qword ptr [RDX]   # < - leakage 22LEA R14, [R14 - 8] # instrumentation 23MFENCE # instrumentation

b.3 Minimized Instance of MDS

1LEA R14, [R14 + 48] # instrumentation 2MFENCE # instrumentation 3.test_case_main.entry: 4JMP .bb0 5.bb0: 6ADC EAX, -2030331421 7AND RAX, 0b1111111000000 # instrumentation 8ADD RAX, R14 # instrumentation 9MOV CX, word ptr [RAX]  # < ---- microcode assist 10AND RCX, 0b1111111000000 # instrumentation 11ADD RCX, R14 # instrumentation 12SUB EAX, dword ptr [RCX]   # < - speculative leak 13LEA R14, [R14 - 48] # instrumentation 14MFENCE # instrumentation

b.4 Minimized Instance of LVI-NULL

1LEA R14, [R14 + 8] # instrumentation 2MFENCE # instrumentation 3.test_case_main.entry: 4JMP .bb0 5.bb0: 6SUB AX, 27095 7AND RAX, 0b1111111000000 # instrumentation 8ADD RAX, R14 # instrumentation 9ADD BL, byte ptr [RAX]  # < -- zero injection 10AND RBX, 0b1111111000000 # instrumentation 11ADD RBX, R14 # instrumentation 12OR dword ptr [RBX], -1193072838 # < - leak 13LEA R14, [R14 - 8] # instrumentation 14MFENCE # instrumentation

b.5 Minimized Instance of Spectre V1-Var

1LEA R14, [R14 + 28] # instrumentation 2MFENCE 3.test_case_main.entry: 4JMP .bb0 5.bb0: 6NOP 7NOP 8CDQ 9SETZ CL 10ADD EDX, 117 11REX ADD BL, BL 12SETNLE AL 13SUB RBX, RBX 14TEST AL, 29 15MOV RDX, 0 # instrumentation 16OR RBX, 0x6d # instrumentation 17AND RAX, 0xff # instrumentation 18DIV RBX  # < ---- variable-latency operation 19{disp32} JNO .bb1 20.bb1: 21AND RCX, 0b111111000000 # instrumentation 22ADD RCX, R14 # instrumentation 23MOVZX EDX, byte ptr [RCX] 24AND RAX, RAX 25AND RAX, 0b111111000000 # instrumentation 26ADD RAX, R14 # instrumentation 27SBB qword ptr [RAX], 39412116 28TEST ECX, ECX 29AND RAX, 0b111111000000 # instrumentation 30ADD RAX, R14 # instrumentation 31MOV qword ptr [RAX], 81640764 32REX NEG AL 33CMC 34OR RDX, 37323177 35JNP .bb2  # < ------- mispredicted branch 36JMP .test_case_main.exit 37.bb2: 38REX SBB AL, AL 39SBB EAX, 74935583 40AND RDX, 0b111111000000 # instrumentation 41ADD RDX, R14 # instrumentation 42CMOVS RDX, qword ptr [RDX]     # < - leak 43AND RAX, 0b111111000000 # instrumentation 44ADD RAX, R14 # instrumentation 45MOV qword ptr [RAX], 23088010 46AND RBX, 0b111111000000 # instrumentation 47ADD RBX, R14 # instrumentation 48LOCK AND word ptr [RBX], 5518 # < - leak 49.test_case_main.exit: 50LEA R14, [R14 - 28] # instrumentation 51MFENCE # instrumentation

b.6 Minimized Instance of Speculative Store Eviction

1LEA R14, [R14 + 32] # instrumentation 2MFENCE # instrumentation 3.test_case_main.entry: 4JMP .bb0 5.bb0: 6AND RBX, 0b0111111000000 # instrumentation 7ADD RBX, R14 # instrumentation 8LOCK AND dword ptr [RBX], EAX 9JZ .bb1 # < ---------------- mispredicted branch 10JMP .test_case_main.exit 11.bb1: 12AND RAX, 0b0111111000000 # instrumentation 13ADD RAX, R14 # instrumentation 14MOV qword ptr [RAX], 3935 # < - speculative leak 15.test_case_main.exit: 16LEA R14, [R14 - 32] # instrumentation 17MFENCE # instrumentation

b.7 Minimized Arch-Seq Violation

1LEA R14, [R14 + 60] # instrumentation 2MFENCE # instrumentation 3.test_case_main.entry: 4JMP .bb0 5.bb0: 6{load} REX MOV DL, DL 7MOVSX RBX, BX 8{store} AND BX, BX 9ADD EAX, 11839320 10AND RDX, 0b0111111000000 # instrumentation 11ADD RDX, R14 # instrumentation 12SETS byte ptr [RDX] 13TEST AX, 6450 14AND RDX, 0b0111111000000 # instrumentation 15ADD RDX, R14 # instrumentation 16SUB word ptr [RDX], CX 17{disp32} JB .bb1   # < ----- branch misprediction 18JMP .test_case_main.exit 19.bb1: 20SUB AX, -29883 21AND RAX, 0b0111111000000 # instrumentation 22ADD RAX, R14 # instrumentation 23ADD EAX, dword ptr [RAX] # < -- speculative load 24AND RAX, 0b0111111000000 # instrumentation 25ADD RAX, R14 # instrumentation 26CMOVLE RBX, qword ptr [RAX] # < - speculative leak 27.test_case_main.exit: 28LEA R14, [R14 - 60] # instrumentation 29MFENCE # instrumentation

3 Challenges of Testing Contract Compliance

In this work, we leverage contracts to check compliance of complex commercial CPUs under realistic threat models. Assuming that a contract properly exposes the expected information leakage in a CPU, finding a counterexample would signify an unexpected, hence potentially exploitable, leakage.

While the original paper [16] proved compliance on an abstract CPU with toy assembly, testing compliance of a real hardware CPU with complex ISA poses significant challenges.

3.1 How to find a counterexample?

The search space for counterexamples is all possible programs, inputs, and all microarchitectural contexts. Such an immense search space cannot be explored exhaustively, thus requiring a targeted search.

CH1: Binary Generation. While a contract prescribes which instructions are permitted to speculate and expose information, we search for unexpected speculation and leakage, thus we need to collect traces that encompass all the instructions. Furthermore, a particular sequence of instructions is usually required to produce an observable leakage, thus we need to test different instruction sequences. Moreover, to trigger an incorrect speculation (e.g., a branch misprediction), we need to prime the microarchitectural state in diverse ways. All of it calls for a search strategy that tests diverse instruction sequences with diverse inputs, but with a priority to those that are likely to leak or to produce speculation.

CH2: Input Generation. For an input to be useful in forming a counterexample, we need another input that produces the same contract trace. Such inputs are called effective inputs. The in

effective inputs which produce a unique contract trace constitute a wasted effort as they cannot, by definition, reveal contract violation. This challenge calls for a more structured input generation approach rather than a simple random one, as the probability that multiple random inputs will produce the same contract trace is low.

3.2 How to get stable hardware traces on a real CPU?

CH3: Collection of Hardware Traces. CPUs have no direct interface to record information leaked in hardware traces, such as addresses accessed in a speculative path. Thus, we have to perform indirect sampling-based measurements, which are inevitably imprecise and incomplete.

CH4: Uncontrolled Microarchitectural State. Black-box CPUs normally have no direct way to set the microarchitectural context for test execution as required by Def 1. For example, branch predictors are not accessible architecturally, and some are not even disclosed. Moreover, speculation is depends on multiple, often unknown factors, such as fine-grained power saving [27, 35], or contention on shared resources. Thus, speculation can happen nondeterministically, and cause divergent traces without a real information leak (false positive). On the other hand, if the speculation is never triggered during the measurement, speculative leaks cannot be observed, leading to false compliance (false negative).

CH5: Noisy Measurements. The measurements are influenced by neighbour processes on the system, by hardware mechanisms (e.g., prefetching), and by inherent imprecision of the measurement tools (e.g., timing measurements). This challenge differs from CH3.2 as it affects the measurement precision rather than the program execution. The noise may result in divergence between the otherwise equivalent traces, leading to a false positive.

3.3 How to produce contract traces?

CH6: Collection of Contract Traces. All contracts in [16] are defined for a toy assembly; it is unclear how to collect traces for a contract describing a complex ISA. To allow realistic compliance check, we need work with real binaries generated via standard compiler tool chain. Hence, we need a method to automatically collect contract-prescribed observations for a given program executed with a given input.

4 Model-based Relational Testing

Figure 1: Main flow of Model-based Relational Testing.

We present Model-based Relational Testing (MRT), our approach to identifying contract violations in black-box CPUs. Here we provide a high-level description, with the technical details to follow (5). Figure 1 shows the main steps.

Test case and input generation. We sample the search space of programs, inputs and microarchitectural states to find counterexamples. The generated instruction sequences (test cases) are comprised of the ISA subset described by the contract. The test cases and respective inputs to them are generated to achieve high diversity and to increase speculation or leakage potential (5.1 and 5.2).

Diversity-guided generation. The testing process is performed in rounds, where earlier rounds exercise smaller search space (i.e., shorter instruction sequences, fewer basic blocks) to speed up testing. After each round that did not yield a counterexample, we invoke a test case diversity analysis which may trigger reconfiguration of the test generator to produce richer test cases, gradually expanding the search space (5.6).

Collecting contract traces. We implement an executable Model of the contract to allow automatic collection of contract traces for standard binaries. For this, we modify a functional CPU emulator to implement speculative control flow based on a contract’s execution clause, and to record traces based on its observation clause (5.4).

Collecting hardware traces. We collect hardware traces by executing the test case on the CPU under test and measuring the observable microarchitectural state changes during the execution according to the threat model. The executor employs several methods to achieve consistent and repeatable measurements (5.3).

Relational Analysis. We analyze the contract and hardware traces to identify potential violations of Def 1. This requires relational reasoning:

  1. We partition inputs into groups, which we call input classes. All inputs within a class have the same contract trace. Thus, input classes correspond to the equivalence classes of equality on contract traces. Classes with a single (ineffective) input are discarded.

  2. For each class, we check if all inputs within a class have the same hardware trace.

If the check fails on any of the classes, we found a counterexample that witnesses contract violation (5.5).

5 Design and Implementation

We build a tool Revizor that implements MRT for practical end-to-end testing of x86 CPUs against speculation contracts. We describe the individual components of Revizor and how they address the challenges outlined in 3.

5.1 Test Case Generator

The task of the test case generator is to sample the search space of all possible programs. As described in CH3.1, the sampling should be diverse, so that we have a chance to observe an unexpected leakage or speculation. Fully random generation, however, might lead to generating incorrect programs, e.g., with invalid control flow or memory accesses, leading to unhandled exceptions during their execution. This is why we rely on a randomized generation algorithm which imposes a certain structure on the generated instruction sequence and its memory accesses. It works as follows:

  1. Generate a random Directed Acyclic Graph (DAG) of basic blocks;

  2. Add jump instructions (terminators) at the end of basic block to ensure the control flow matches the DAG; This ensures that the control flow is confined.

  3. Add random instructions from the ISA subset selected for the test;

  4. Instrument instructions to avoid faults:

    1. mask memory addresses to confine them within a dedicated memory region, which we call sandbox;

    2. modify division operands to avoid division by zero;

  5. Compile the test case into a binary.

The total number of instructions, functions, and basic blocks per test, as well as the tested instruction (sub)set are specified by the user. We borrow the ISA description from nanoBench [2].

1OR RAX, 468722461
2AND RAX, 0b111111000000
3LOCK SUB byte ptr [R14 + RAX], 35
4JNS .bb1
5JMP .bb2
6.bb1: AND RCX, 0b111111000000
7REX SUB byte ptr [R14 + RCX], AL
8CMOVNBE EBX, EBX
9OR DX, 30415
10JMP .bb2
11.bb2: AND RBX, 1276527841
12AND RDX, 0b111111000000
13CMOVBE RCX, qword ptr [R14 + RDX]
14CMP BX, AX\end{lstlisting}
15    \caption{Randomly generated test case}
16    \label{fig:test-case-raw}
17\end{figure}
Example: ()

Figure LABEL:fig:test-case-raw shows a test case example, produced in multiple steps: \⃝raisebox{-0.9pt}{1} The generator created a DAG with three nodes. \⃝raisebox{-0.9pt}{2} Connected the nodes by placing either conditional or direct jumps (lines 4–5, 10). \⃝raisebox{-0.9pt}{3} Added random instructions until a specified size was reached (lines 1, 3, 7–9, 13, 14). \⃝raisebox{-0.9pt}{4} Masked the memory accesses and aligned to the sandbox base in R14 (lines 2, 6, 12). \⃝raisebox{-0.9pt}{5} Compiled the test case.

Improving input effectiveness. Using many hardware registers and larger sandbox results in low input effectiveness (CH3.1), as it increases the likelihood of unique contract traces that cannot be used for relational testing. To improve input effectiveness, the generator generates programs with only four registers, confines the memory sandbox to one or two 4K memory pages, and aligns memory accesses to a cache line (64B). To test different alignments, the accesses are further offset by a random value between 0 and 64 (the same within a test case but different across test cases).

5.2 Input Generator

An input is a set of values to initialize the architectural state, which includes registers (including FLAGS) and the memory sandbox. The values are random 32-bit numbers which we generate using a PRNG.

Improving input effectiveness. Higher entropy of the PRNG leads to lower input effectiveness (CH3.1), because the probability of finding colliding contract traces decreases. We performed experiments where we tune the entropy of the PRNG to maximize the number of contract classes that are covered by at least two inputs. We expect that more sophisticated techniques for creating inputs, e.g. based on symbolic execution, can further improve effectiveness.

5.3 Executor

The executor has three goals: (1) collect hardware traces when executing test cases on the CPU (CH3.2), (2) set the microarchitectural context for the execution (CH3.2), and (3) eliminate measurement noise (CH3.2).

Collecting hardware traces. To collect traces we employ methods used by side-channel attacks, but in a fully controlled environment. This allows us to record hardware traces corresponding to the measurements of a powerful worst-case attacker, and spot all consistently-observed leaks via the microarchitectural state. The process involves the following steps:

  1. Load the test case into a dedicated region of memory,

  2. Set memory and registers according to the inputs,

  3. Prepare the side-channel (e.g., prime cache lines),

  4. Invoke the test case,

  5. Measure the microarchitectural changes (e.g., probe cache lines) via the side-channel, thus producing a trace.

The measurement (steps 2–6) repeats for all inputs, thus producing a hardware trace for each test case-input pair.

Our implementation supports several measurement modes:

  • Prime +Probe [32], Flush +Reload [50], and Evict +Reload [14] modes use the corresponding attack on L1D cache.

  • In *+Assist mode, the executor includes microcode assists. It clears the “Accessed” bit in one of the accessible pages such that the first store or load triggers an assist 222This is possible in attacks on SGX enclaves, e.g., LVI [41]..

Example: ()

The hardware trace corresponding to running executor in L1D Prime+Probe mode is a sequence of bits, each representing whether a specific cache set was accessed by the test case or not. E.g., the following trace indicates observed memory accesses to sets 0,4,5: 10001100000000000000000000000000

Setting the microarchitectural context. We cannot directly control the microarchitectural context before the test execution (CH3.2). To deal with this we develop a technique called priming. Priming collect traces for a large number of pseudorandom inputs (5.2) to the same test case in a sequence. In this way, execution with one input effectively sets the microarchitectural context for the next input. This enables collection of hardware traces with predictors primed in diverse but deterministic fashion, which is key to obtaining traces that are stable enough for equality checks.

However, priming may result in undesirable artifacts. For this, recall that MRT searches for inputs and from the same input class, but with divergent hardware traces:

Due to priming, however, divergence of hardware traces can also be caused by differences in the microarchitectural contexts and . For example, earlier inputs can train branch predictors in a way that would prevent speculation for the latter inputs.

To filter such cases and verify that the divergence is caused by the difference in inputs and not by the difference in contexts, we swap and in the priming sequence, which enables us to test with the context and vice versa. That is, we test the following:

If this condition holds, we discard the divergence as a measurement artifact, otherwise we report a contract violation.

Example: ()

Consider two inputs from the same input class with different hardware traces, and assume that, in the original sequence of inputs, the first was at position 100 () and the second at 200 (). For priming, the executor tests sequences and . The executor will consider it a false positive if at position 200 produces the same trace as at position 200, and vice versa.

Eliminating measurement noise. Hardware traces in the same input class may also diverge (and thus incorrectly considered as contract violation) due to several additional sources of inconsistencies which we eliminate as follows:

  1. Eliminating measurement noise (CH3.2). The executor uses performance counters for cache attacks by reading the L1D miss counter before and after probing a cache line. It proved to give more stable results than timing readings.

  2. Eliminating external software noise (CH3.2). We run the executor as a kernel module (based on nanoBench [2]). A test is executed on a single core, with hyperthreading, prefetching, and interrupts disabled. The executor also monitors System Management Interrupts (SMI) [18] to discard those measurements polluted by an SMI.

  3. Reducing nondeterminism (CH3.2). We repeat each measurement (50 times in our experiments) after several rounds of warm-up, and discard one-off traces as likely caused by noise. We then take the union of all traces collected from the executions of a test case with the same input, which encompasses all consistently observed variants of speculative behavior under different microarchitectural contexts.

Example: ()

Consider again the test case in Figure LABEL:fig:test-case-raw. If the branch in line 6 is speculated differently across the runs, one input may produce different traces:

00001010000001000000000000000001

00001000000001000000000000000001

The first trace is with a misprediction (cache set 7), and the second without. The merged trace is: 00001010000001000000000000000001

Discarding all outliers observed only once during the test might miss rare cases that reveal real leaks. However, we found it necessary from the practical perspective: each reported violation requires manual investigation. Since the outliers turned out to be notoriously hard to reproduce and verify, we opted to focus on the leaks that are easier to distinguish from the noise.

5.4 Model

Model’s task is to automate the collection of contract traces (CH3.3). We achieve this by executing test cases on an ISA-level emulator modified according to the contract. The emulator implements the contract’s execution clause, such as exploring all speculative execution paths, followed by a rollback, and it collects observations based on the observation clause. The resulting trace is a list of observations collected while executing a test case with a single input. We base our implementation on Unicorn [33], a customizable x86 emulator, modified to implement the clauses listed in 2.3.

Observation Clauses. When the emulator executes an instruction listed in the observation clause, it records its exposed information into the trace. This happens during both normal and speculative execution, unless the contract states otherwise.

Example: ()

Consider the test case Figure LABEL:fig:test-case-raw and the contract MEM-SEQ. As prescribed by the contract, the model records the accessed addresses when executing lines 3, 7, 13 (Figure LABEL:fig:test-case-raw). Suppose, the branch (line 4) was not taken; the store (line 3) accessed 0x100; and the load (line 13) accessed 0x340. Then, the contract trace is ctrace=[0x100, 0x340].

Execution Clauses are implemented similarly to the speculation exposure mechanism introduced in SpecFuzz [31]: Upon encountering an instruction with a non-empty execution clause (e.g., a branch in MEM-COND), the emulator takes a checkpoint. The emulator then simulates speculation as described by the clause until (1) the test case ends, (2) the first serializing instruction is encountered, or (3) the maximum possible speculation depth is reached. Then, it rolls back and continues normal execution.

As multiple mispredictions may happen together, the emulator supports nested speculation through a stack of checkpoints: When starting a simulation, the checkpoint is pushed, and afterwards, the emulator rolls back to the topmost checkpoint.

Practically, however, nested speculations greatly reduce the testing speed, which is why we disable nesting by default. This artificially reduces the amount of permitted leakage by the contract, potentially causing false violations (since hardware traces would still include nested speculations). To identify such false violations, Revizor re-executes all reported violations with nesting enabled.

5.5 Analyzer

The analyzer compares traces by using relational analysis (4). As hardware traces are obtained as the union of observations collected from the same input in different microarchitectural contexts (5.3), we relax requiring equality of hardware traces to requiring only a subset relation. Specifically, we consider two traces equivalent if every observation included in one trace is also included in the other trace.

The intuition behind the heuristic is as follows. If the mismatch is caused by an inconsistent execution of a speculative path among the inputs, one of the inputs executed fewer instructions, therefore fewer observations would appear in the trace, but those that appear match. In contrast, if the mismatch is caused by a secret-dependent instruction, the traces contain the same number of observations, but their values differ. To validate this intuition, we manually examined multiple such examples and did not observe any real violation.

5.6 Test Diversity Analysis

If a testing round did not detect any violation, we need to decide how to improve the chances of finding one in the next round. As we test black-box CPUs we cannot measure coverage of the exercised CPU features to guide the test generation in the next round.

Instead, we seek to estimate the likelihood to exercise new speculative paths with the current configuration of the test case generator by analyzing the diversity of the tests we ran so far (CH

3.1). We capture diversity tests using a new measure called pattern coverage, which counts data and control dependencies that are likely to cause pipeline hazards. We expect higher pattern coverage to correlate with higher chances to surface speculative leaks. Therefore, if a testing round does not improve the pattern coverage of the tests so far, new speculative paths are unlikely to be explored. To facilitate generation of more diverse tests, Revizor then increases the number of instructions and basic blocks per test. We now discuss this approach in more details.

Patterns of instructions. We define patterns in terms of instruction pairs. To simplify the counting of pattern coverage we require that the instructions are consecutive, which corresponds to the worst case for creating hazards. We distinguish three types:

  1. A memory dependency pattern is two memory accesses to the same address. We consider 4 patterns: store-after-store, store-after-load, load-after-store, load-after-load.

  2. A register dependency pattern is when one instruction uses a result of another instruction. We consider 2 patterns: dependency over a general-purpose register, and over FLAGS.

  3. A control dependency pattern is an instruction that modifies the control flow followed by any other instruction. In this paper we consider 2 patterns: conditional and unconditional jumps. Larger instruction sets may include indirect jumps, calls, returns, etc.

We say that a program with an input matches a pattern if that pattern is found in two consecutive instructions in the corresponding instruction stream. Since a single input cannot form a counterexample, a pattern is covered if a program and two inputs in the same input class match the pattern.

Target 1 Target 2 Target 3 Target 4 Target 5 Target 6 Target 7 Target 8
CPU Skylake Skylake Coffee Lake
V4 patch off on on
Instruction Set AR AR+MEM AR+MEM+VAR AR+MEM+VAR AR+MEM+CB AR+MEM+CB+VAR AR+MEM
Executor Mode Prime+Probe Prime+Probe+Assist
Table 2: Description of the experimental setups.
Target 1 Target 2 Target 3 Target 4 Target 5 Target 6 Target 7 Target 8
CT-SEQ ✓ (V4) ✓ (V4) ✓ (V1) ✓ (V1) ✓ (MDS) ✓ (LVI-Null)
CT-BPAS ✓ (V4-var