Systematic Prevention of On-Core Timing Channels by Full Temporal Partitioning

02/24/2022
by   Nils Wistoff, et al.
0

Microarchitectural timing channels enable unwanted information flow across security boundaries, violating fundamental security assumptions. They leverage timing variations of several state-holding microarchitectural components and have been demonstrated across instruction set architectures and hardware implementations. Analogously to memory protection, Ge et al. have proposed time protection for preventing information leakage via timing channels. They also showed that time protection calls for hardware support. This work leverages the open and extensible RISC-V instruction set architecture (ISA) to introduce the temporal fence instruction fence.t, which provides the required mechanisms by clearing vulnerable microarchitectural state and guaranteeing a history-independent context-switch latency. We propose and discuss three different implementations of fence.t and implement them on an experimental version of the seL4 microkernel and CVA6, an open-source, in-order, application class, 64-bit RISC-V core. We find that a complete, systematic, ISA-supported erasure of all non-architectural core components is the most effective implementation while featuring a low implementation effort, a minimal performance overhead of approximately 2

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

page 5

page 6

page 7

page 10

05/01/2020

Prevention of Microarchitectural Covert Channels on an Open-Source 64-bit RISC-V Core

Covert channels enable information leakage across security boundaries of...
01/24/2019

Can We Prove Time Protection?

Timing channels are a significant and growing security threat in compute...
10/12/2018

Time Protection: the Missing OS Abstraction

Timing channels enable data leakage that threatens the security of compu...
11/20/2020

SIMF: Single-Instruction Multiple-Flush Mechanism for Processor Temporal Isolation

Microarchitectural timing attacks are a type of information leakage atta...
10/07/2019

Iodine: Verifying Constant-Time Execution of Hardware

To be secure, cryptographic algorithms crucially rely on the underlying ...
04/28/2021

Timing Covert Channel Analysis of the VxWorks MILS Embedded Hypervisor under the Common Criteria Security Certification

Virtualization technology is nowadays adopted in security-critical embed...
02/17/2020

A Lightweight ISA Extension for AES and SM4

We describe a lightweight RISC-V ISA extension for AES and SM4 block cip...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Background

Timing Channels

Information channels that let isolated applications communicate but are not intended for information transfer are called covert channels [Lampson_73]. Microarchitectural timing channels are covert channels that leverage the timing behaviour of microarchitectural components to transfer information. They may generally occur whenever multiple applications compete for shared hardware resources [Ge2018Survey]. Famous examples for exploitable hardware resources are data caches, where the latency of a memory access varies depending on how the cache was previously used [Hu_92, percival2005cache]. Attacks were also demonstrated for further components such as instruction caches [Aciiccmez2007ICache], branch predictors [Aciiccmez2007BP], and tlb [Gras2018TLB].

Threat Model

We examine covert-channel leakage under a confinement scenario [Lampson_73]: An untrusted program possesses a secret, and the os encapsulates the program’s execution in a security domain that only allows communication across defined channels to trusted components (e.g., an encryption service). The untrusted program contains a Trojan that is actively trying to leak the secret via a covert channel. Note that a Trojan could not only hide in malicious code, but can be constructed by control-flow hijacking of innocent code through exploiting bugs or speculatively executed gadgets as in a Spectre attack [Kocher2018spectre]. A second, unconfined, and also untrusted security domain contains a spy which is trying to read the secret leaked by the Trojan. This setup is illustrated in Document.

[width=0.95]figures/timing_channel.pdf

Figure : Threat model: a Trojan actively leaking data via shared hardware resources.

The intentional leakage by the Trojan represents the worst-case. If we can prevent this attack, we preclude any other leakage using the same mechanism including side channels, where leakage originates from an unwitting victim rather than a Trojan. We assume that the Trojan and spy time-share one processor core, meaning cross-core leakage is out of the scope of this paper. We only consider microarchitectural timing channels. Covert channels that abuse other characteristics, such as power draw, are not covered in this work.

Time Protection

Time protection is a principled approach to preventing timing channels [Ge2019a]. While the established notion of memory protection prevents interference between security domains through unauthorised memory accesses, time protection aims to prevent interference that affects observable timing behaviour. Time protection requires that all shared hardware resources, including non-architectural ones, must be partitioned between security domains, either temporally (secure time multiplexing) or spatially. Ge et al. show that (physically-addressed) off-core caches can be effectively partitioned through cache colouring [kessler1992colouring], which leverages the associative cache lookup to force different partitions into disjoint subsets of the cache. They demonstrate that colouring is effective in preventing cache channels in both intra-core and cross-core attacks and comes with low overhead. Spatial partitioning is generally impractical for on-core resources. For performance reasons, on-core resources are limited and are designed to be well utilised by a single program, so partitioning approaches usually result in unacceptable performance degradation. Furthermore, on-core resources are generally indexed by virtual addresses, which cannot be coloured by the os. This leaves temporal partitioning as the only viable approach for on-core resources.

Security Requirements

Based on the findings of Ge et al. [Ge2019a], we propose two security requirements for on-core time protection by temporal partitioning. First, before handing a resource to a different domain, it must be reset to a state that is independent of execution history. Therefore, the os must have the means to reset all microarchitectural state, which in practical terms requires an extension to the hardware-software contract to refer (in an abstract way) to such non-architectural state. Ge et at. specifically show that contemporary Intel and Arm processors lack the mechanisms required for implementing time protection [Ge2018]. For temporal partitioning, hardware must provide a (set of) mechanism(s) that allow a reset of all non-architectural state that depends on previous execution and may impact future timing. A second requirement for a time protection is a secret-independent context switch latency. This requirement is particularly (but not exclusively) relevant for processors featuring a write-back L1 cache: before we can reset this component, we need to write back all dirty cache lines. The number of dirty cache lines, and hence the latency of said reset and the whole context switch routine, directly depends on previous execution and can be probed to transfer information. The latency of the context switch routine, including the reset of the non-architectural state mentioned above, needs to be independent of the previous execution.

Implementing On-Core Time Protection

In the following, we will present different approaches to implement time protection for shared on-core hardware resources. We will use the architecture for our analysis because of its openness, extensibility, and availability of modifiable open-source implementations such as CVA6 (see Document). Document will describe a baseline system where time protection will be implemented using existing resources on unmodified hardware without additional architectural features. In Document, we will propose an isa extension that provides software with the necessary means to partition shared on-core hardware resources temporally (Document of on-core time protection), and we will discuss three different implementation approaches. Finally, Document will address Document of time protection, enabling a context-switch latency that does not depend on execution history.

Baseline Architecture

Ge et al. [Ge2019a] report that neither the x86 nor the Arm architecture provides sufficient mechanisms for implementing time protection. Arm provides targeted L1 cache flushes but no mechanism for flushing other microarchitectural state. The x86 architecture provides branch control mechanisms for clearing the state of the branch predictors [Intel2018IBC]. For partitioning the L1 cache on this architecture, the authors implemented software flushing by touching all cache lines, similar to the prime phase of the prime-and-probe attack. Such an approach is expensive and obviously brittle, as it must make assumptions on the replacement policy which may not hold in reality. Unsurprisingly, they find that this defence is incomplete, leaving residual channels that the os is unable to close. With , the situation is presently worse, as the specification of cache management is still under discussion. While implementations generally support some cache management, this is not yet standardised. To explore this aspect, we implement a “software only” defence (in the following referred to as Software, SW), where the os uses only mechanisms defined in the isa as presently specified. This basically forces the os to resort to the priming approach in an attempt to erase any microarchitectural state left by the Trojan’s execution.

Temporal Fence Instruction

As we show in Document, this “software only” defence is insufficient since some microarchitectural timing channels remain. Moreover, it comes at a great performance overhead. Therefore, we propose the temporal fence instruction, , to let an os access the hardware mechanisms required for time protection. We note that other semantics to realise the temporal fence are conceivable as well: for instance, a combination of multiple instructions and registers could be used. For our proof of concept, we stick to a single instruction that both triggers the reset of vulnerable microarchitectural state (Document) and guarantees a history-independent context switch latency (Document). Except for increased cycle and instruction counters, has no architectural effects. For evaluation purposes, we encode as a U-type instruction with the opcode custom-0. optionally takes a 20-bit immediate value, which is a bitmap selecting the components that should be reset. In the following, we will present three different implementation strategies for .

Basic Flush

In the first version of , we exclusively flush the principal components used by the prime-and-probe attack in Document: the L1 data and instruction caches, the tlb, and the branch predictors (bht, btb). If present, prefetchers are cleared and write-buffers are drained. We write back any dirty state (for write-back caches), invalidate the caches and tlb, and purge the branch predictors. To preserve the computational correctness of in-flight instructions, we also flush the pipeline. In the following, we will refer to this approach as the Basic Flush ().

Full Flush

Motivated by our security analysis of the basic flush in Document, we identify several secondary, stateful components deeply embedded in CVA6 with a possible timing impact. In particular, these are

  • a lfsr per cache (L1 data, L1 instruction) that generates a pseudo-random number sequence for the cache-replacement policy,

  • a plru tree in each tlb that identifies a replacement candidate,

  • a round-robin memory arbiter that arbitrates cache accesses between the load unit, the store unit, and the memory-management unit,

  • two round-robin arbiters in the write buffer of the write-through L1 data cache, which choose an entry to serve (lookup or write back) next.

We extend to a full flush () by adding the support to clear the state of these components as well.

Microreset

Finally, we propose a principled and systematic approach to enforce complete temporal partitioning. The key idea is to clear all on-core state that is not architectural by default, and explicitly exclude architectural state. We call this mechanism Microreset, as it exclusively resets non-architectural microarchitectural state. All flip-flops in the design are extended by an additional clear input. By asserting this input on a , we can guarantee that all state on flip-flops in the design is set back to a predefined state. On-core state that is not resettable (such as SRAMs) must be cleared separately. To ensure computational correctness, architectural state needs to be retained, either by saving it before Microreset (e.g. write-back of an L1 cache) or by explicitly excluding it from the Microreset. Hence, we design the controller to proceed in the following six steps: Save the program counter. To resume execution after Microreset from the correct location, we store the address of the instruction following in a register that is excluded from Microreset (we consider this architectural state). Save locally modified (dirty) architectural state. In particular, this concerns components such as write-back L1 caches: to preserve the contents of dirty cache lines, we need to write them back before clearing the cache. As we will discuss later, this is the most costly step of . Drain pending transactions. Next, we need to wait for all pending external transactions to complete without issuing or accepting new ones. This way, we prevent violating any handshake protocols or losing data that we began to write back in the previous step. Clear components that are not cleared on reset. Not all components can be fully reset to a predefined state. For instance, SRAMs such as those found in caches are generally not resettable. To temporally partition these components, they need to be cleared separately, e.g. by a fsm that overwrites their contents line by line.

[width=0.8]figures/microreset.pdf

Figure : Illustration of the Microreset.

Assert Microreset. We now clear all flip-flops containing non-architectural state. For this purpose, we assert the clear input of all flip-flops in the design. We explicitly exclude flip-flops that hold architectural state—the state of these components is preserved during Microreset. This approach removes the risk of omitting state that could create a timing channel. While identifying the architectural state is potentially one of the biggest challenges of this approach, for CVA6, it turned out relatively straight-forward: we explicitly exclude the integer and floating-point register files, the csr file, and the controller, which is driving Microreset. Document shows the resulting setup. In other designs, such as out-of-order cores with merged register files, identifying the architectural state might be more challenging but should, in general, still be feasible with reasonable effort. Continue execution from saved program counter. Finally, we de-assert the Microreset and continue fetching from the next program counter.

Time Padding

When introducing time protection in Document, we asserted that the context switch latency needs to be independent of previous execution (Document

). In , the context switch routine is initiated by a timer interrupt of the clint. While this interrupt is generated at a fixed period independently of the microarchitectural state, any machine-mode or kernel code following it until the end of may be delayed by cache misses, mispredictions etc. Hence, to prevent a dependency of the context switch latency on previous execution, we pad the interval between the clint’s timer interrupt and the completion of to a worst-case latency. We implement this mechanism by adding a custom csr that takes a 32-bit value. We stall the completion of until cycles after the clint timer interrupt. Padding can be disabled by setting to 0.

Evaluation

Hardware Platform

0.9 Box = [draw, rounded corners, minimum height=1cm, minimum width=3.5cm, node distance=2ex, align=center] BoxTwo = [draw, rounded corners, minimum height=2em, minimum width=3.5cm, node distance=2ex, align=center] InsideBox = [draw, rounded corners, minimum height=0.5cm, minimum width=1.5cm, align=center] Desc = [node distance=0.3cm, align=right] [x=1ex,y=1ex] (ariane) [BoxTwo, text depth=1.2cm] CVA6; [InsideBox, anchor=south west] (l1d) at () L1 D$; [InsideBox, anchor=south east] (l1i) at () L1 I$; (l1ddots) at () …; (l1idots) at () …; (l2) [BoxTwo, below=of ariane] L2; (dram) [BoxTwo, below=of l2] DRAM; [anchor=south, rotate=90, align=center] (wt) at () write-through or
write-back; [anchor=north, rotate=90] (wb) at () write-back; l1d.south)+-1,0) coordinate (l1dsource); l1d.south)+1,0) coordinate (l1dsink); [-stealth] (l1i.north) – (l1idots.south); [-stealth] () – (); [-stealth] () – (); [-stealth] (l1dsource) – (l1dsource—-l2.north); [-stealth] (l1dsink—-l2.north) – (l1dsink); [-stealth] (l1i.south—-l2.north) – (l1i.south); [-stealth] () – (); [-stealth] () – (); [-] (wt.south—-l1d.west) – (l1d.west); [-] (wb.south) – (l2.west);

Figure : Hardware platform.

We evaluate the channels and defences on CVA6, an open-source, RV64GC, 6-stage RISC-V core developed at ETH Zürich and currently maintained by OpenHW Group [Zaruba2019].222CVA6 is formerly known as Ariane. It is implemented in SystemVerilog and publicly available on GitHub [GitHub:Ariane]. It features three privilege levels and address translation, and thus supports full-fledged operating systems. Its configurability, simplicity, and openness make it a good candidate for architectural exploration.

Setup

We instantiate the CVA6 core on a Xilinx Kintex-7 FPGA (Digilent Genesys II), running at 50. We configure two versions of CVA6: one with a write-through L1 data cache and one with a write-back cache. Both versions feature an 8-way, 32 KiB L1 data cache and a 4-way, 16 KiB L1 instruction cache. The caches use 16-byte lines and a pseudo-random replacement strategy driven by an 8-bit lfsr. The L1 data cache is accessed by the load-, store-, and memory-management units, with concurrent accesses arbitrated with a round-robin policy. The branch predictor has a 64-entry bht and a 16-entry btb. There are two single-level, fully associative data and instruction tlb, with 16 entries each, using a plru replacement policy. Our system-on-chip features a 512-KiB write-back L2 cache [wolfgang2019llc-thesis] that is connected to DRAM. Document shows the memory architecture. We partition the L2 cache by colouring [kessler1992colouring], which precludes channels in the memory backend and allows us to focus on channels resulting from on-core state.

Security Analysis

Prime and Probe

0.8[x=0.14y=1.5] [local bounding box=state1] in 0,1,…,5 [inner sep=1pt] (BL) at (0,-) ; [inner sep=1pt] (UR) at (1,-+1) ; (MM) at () ; (BL) rectangle (UR); [anchor=west] at () ?; [anchor=east] at () 0; [anchor=west] at () ?; [anchor=east] at () 1; [anchor=west] at () ?; [anchor=east] at () 2; [anchor=west] at () ?; [anchor=east] at () 3; [anchor=west] at () …; [anchor=east] at () …; [anchor=west] at () ?; [anchor=east] (nsets) at () ; () – (); (state1right) at () ; [shift=(2.4,0), local bounding box=state1] in 0,1,…,5 [inner sep=1pt] (BL) at (0,-) ; [inner sep=1pt] (UR) at (1,-+1) ; (MM) at () ; (BL) rectangle (UR); [anchor=west] at () SPY; [anchor=east] at () 0; [anchor=west] at () SPY; [anchor=east] at () 1; [anchor=west] at () SPY; [anchor=east] at () 2; [anchor=west] at () SPY; [anchor=east] at () 3; [anchor=west] at () …; [anchor=east] at () …; [anchor=west] at () SPY; [anchor=east] (nsets) at () ; () – (); (state2right) at () ; (state2left) at () ; [shift=(4.8,0), local bounding box=state2] in 0,1,…,5 [inner sep=1pt] (BL) at (0,-) ; [inner sep=1pt] (UR) at (1,-+1) ; (MM) at () ; (BL) rectangle (UR); [anchor=west] at () TROJAN; [anchor=east] at () 0; [anchor=west] at () …; [anchor=east] at () …; [anchor=west] at () TROJAN; [anchor=east] at () ; [anchor=west] at () SPY; [anchor=east] at () ; [anchor=west] at () …; [anchor=east] at () …; [anchor=west] at () SPY; [anchor=east] (nsets) at () ; () – (); (state3right) at () ; (state3left) at () ; [shift=(7.2,0), local bounding box=state2] in 0,1,…,5 [inner sep=1pt] (BL) at (0,-) ; [inner sep=1pt] (UR) at (1,-+1) ; (MM) at () ; (BL) rectangle (UR); [anchor=west] at () SPY; [anchor=east] at () 0; [anchor=west] at () SPY; [anchor=east] at () 1; [anchor=west] at () SPY; [anchor=east] at () 2; [anchor=west] at () SPY; [anchor=east] at () 3; [anchor=west] at () …; [anchor=east] at () …; [anchor=west] at () SPY; [anchor=east] (nsets) at () ; () – (); (state4left) at () ; [-stealth] (state1right) – node[above,align=center] Step 1.
Spy:
Prime (state2left); [-stealth] (state2right) – node[above,align=center] Step 2.
Trojan:
Encode (state3left); [-stealth] (state3right) – node[above,align=center] Step 3.
Spy:
Probe (state4left);

Figure : A prime-and-probe attack on a cache.

Techniques for exploiting covert channels are well established; for our scenario of intentional leakage, the prime-and-probe attack [percival2005cache] is simple and effective. We stress that our proposed mechanism addresses the root cause of covert channels and therefore expect an equal efficacy for other attacks such as evict-and-time and evict-and-reload. In a prime-and-probe attack, the spy first forces the exploited hardware resource into a known state (prime

). For the data cache it traverses a large buffer (in cache-line-sized strides for efficiency); for the instruction cache it executes a series of linked jumps. The tlb are similarly primed by accessing or jumping with page-size strides.

333This is a somewhat simplified description—in general, it is necessary to randomise the access order to prevent interference from prefetching, but that is not an issue on our processor. The branch predictors are primed by a series of conditional branches (bht) or by executing multiple indirect jumps (btb). With a correctly-sized priming buffer, this leaves the hardware resources in a state where further accesses by the spy within the same address range are fast. This state is illustrated in the second element of Document where all entries in the buffer have been primed by the spy. At the end of its time slice, the os preempts the spy and switches to the application that contains the Trojan, which accesses a subset of the hardware resource to encode the secret. Given a cache of lines, the Trojan can transmit a secret , the input signal, by touching cache lines, thereby replacing the spy’s content. The resulting state is illustrated in the third element of Document. Obviously, more complex encodings are possible to increase the amount of data transferred in a time slice (the channel capacity), but for our purposes, the simple encoding is sufficient, as we want to prevent any leakage. When execution switches back to the spy, it again traverses (probes) the whole buffer, observing its execution time. Each entry replaced by the Trojan’s execution leads to a cache miss, and results in an increase in probe time. If the latency of a hit is and that of a miss is , the total latency increase is For our simple encoding scheme, the output signal is the total probe time, which is linearly correlated to the input signal. A more sophisticated encoding scheme could, for example, exploit the time measurements of each individual access and thus extract more information.

Measuring Leakage

We adopt the approach of Ge et al. [Ge2018] for quantifying and evaluating leakage and prevention strategies. For attack , the Trojan encodes as input value a randomly chosen secret, , and the spy subsequently measures as the output value its probe latency, . and

can be regarded as samples of the random variables

and . A covert channel exploits the correlation of the two random variables: if the output is correlated with the input , there is a covert channel that transfers information from the Trojan to the spy. We use a sample size (number of repeated attacks) of 1 million. For leakage we use a combination of two indicators: The channel matrix for visualisation and the discrete mutual information as a quantitative metric.

Channel Matrix

The channel matrix represents the conditional probability of observing a particular output value,

, given input value

. The conditional probability distribution

can be computed directly from the measured sample pairs . We represent the channel matrix as a heat map: inputs vary horizontally and outputs vertically, and bright colours indicate high, dark colours low probability. A variation of colour along any horizontal line through the graph indicates a dependence of the output on the input, and thus a channel. For example, Document shows a clear diagonal pattern indicating a channel: if the spy observes a probe time of 85,000 cycles, it can infer with high confidence that the Trojan has encoded a value between 170 and 180. Repeating the experiment can increase the spy’s confidence.

Mutual Information

For quantifying channel capacity we use continuous mutual information , the amount of information gained about a random variable by observing another, possibly correlated random variable [Shannon_48]. Intuitively, mutual information is the difference of the information gained by observing the random variable without and with knowledge of the second random variable . If both random variables are highly correlated (i.e., there exists a covert channel), the information gained by observing is low and high. Conversely, if both random variables are uncorrelated, . The unit for is bits; as most of our channel capacities are small, we use millibits () in our measurements.

Zero Leakage Upper Bound

Since all measurements are affected by noise,

will mostly not be zero, even if there is no channel. We use a Monte Carlo simulation for estimating the apparent channel produced by this noise. Specifically, we pick uniformly random pairs of input and output values, and thus remove any correlation between them, while retaining their original value ranges and spreads. Any mutual information that is measured from this data can only be due to noise. We repeat this process 1000 times and then compute the 95%-confidence interval

for an experiment without a channel. Notably, can differ strongly between experiments, as it depends on the range and distribution of the measured values. We conclude that a channel is present if , otherwise, the result is consistent with no channel. We use the leakiEst tool [Chothia2013] to compute mutual information and zero leakage upper bounds .

Testbench

0.9 Box = [draw, rounded corners, minimum height=1cm, minimum width=3.5cm, node distance=1ex, align=center] BoxTwo = [draw, rounded corners, minimum height=2em, minimum width=3.5cm, node distance=1ex, align=center] InsideBox = [draw, rounded corners, minimum height=0.5cm, minimum width=1.5cm, align=center] Desc = [node distance=0.3cm, align=right] [x=1ex,y=1ex] (ChanBench) [Box, text depth=0.751cm] Channel bench; [InsideBox, anchor=south west] (Trojan) at () Trojan; [InsideBox, anchor=south east] (Spy) at () Spy; (sel4) [BoxTwo, below=of ChanBench] seL4 microkernel; (Ariane) [BoxTwo, below=of sel4] CVA6 RISC-V core; (Application) [Desc, left=of ChanBench] Application; (Supervisor) [Desc, left=of sel4] Supervisor; (HWPlat) [Desc, left=of Ariane] Hardware platform; Ariane.southeast)+2,0) node(start) ; ChanBench.northeast)+2,0) node(end) ;

Figure : HW/SW stack of the evaluation framework.

Ge’s Channel Bench [Ge2019phd, GitHub:channel-bench] provides a minimal os and data collection infrastructure; we port it to and adapt it to CVA6. Channel Bench uses attack implementations from the Mastik toolkit [Yarom2016], running on an experimental version of seL4 [Klein2014_seL4] that supports time protection. The resulting stack of our evaluation framework is shown in Testbench.

L1 Data Cache

every axis/.append style= x tick label style=/pgf/number format/.cd, precision=1, scaled y ticks = false, y tick label style=/pgf/number format/.cd, fixed, precision=0, xtick distance = 64, colorbar style = at=(1.15,0), anchor=south west

[t] [scale=0.73] color0rgb0.267004,0.004874,0.329415 [ axis background/.style=fill=color0, colorbar, colorbar style=ytick=0,0.00964909860909337,0.013876876089505,0.0163499673769393,0.0181046535699167,0.0194656939197933,0.0205777448573509,0.0215179705280914,0.0223324310503284,0.0230508361447851,0.023693471400205,0.0279212488806167,0.0303943401680509,yticklabels=,,,,,,,,,,,,,ylabel=Probability, colormap/viridis, height=point meta max=0.03049394860864, point meta min=0, tick align=outside, tick pos=left, width=x grid style=white!69.0196078431373!black, xlabel=Secret, xmin=0, xmax=257, xtick style=color=black, y grid style=white!69.0196078431373!black, ylabel=Time (cycles), ymin=76254, ymax=89564, ytick=77500,80000,82500,85000,87500, ytick style=color=black ] graphics [includegraphics cmd=,xmin=0, xmax=257, ymin=76254, ymax=89564] figures/cm/l1d-wt-000.png; [-¿,very thick,draw=red] (0, 85000) – (172, 85000); [-¿,very thick,draw=red] (176, 85000) – (176, 76250);
[t] [scale=0.73] color0rgb0.267004,0.004874,0.329415 [ axis background/.style=fill=color0, colorbar, colorbar style=ytick=0,0.00969801765177042,0.0139472291370951,0.0164328585136437,0.0181964406224198,0.0195643811806639,0.0206820699989684,0.0216270624299604,0.0224456521077444,0.0231676993755169,0.0238135926659885,0.0280628041513132,0.0305484335278618,yticklabels=,,,,,,,,,,,,,ylabel=Probability, colormap/viridis, height=point meta max=0.03068729303777, point meta min=0, tick align=outside, tick pos=left, width=x grid style=white!69.0196078431373!black, xlabel=Secret, xmin=0, xmax=257, xtick style=color=black, y grid style=white!69.0196078431373!black, ylabel=Time (cycles), ymin=91546, ymax=92580, ytick style=color=black ] graphics [includegraphics cmd=,xmin=0, xmax=257, ymin=91546, ymax=92580] figures/cm/l1d-wt-sw2-000.png;
[t] [scale=0.73] color0rgb0.267004,0.004874,0.329415 [ axis background/.style=fill=color0, colorbar, colorbar style=ytick=0,0.0437687232305772,0.0629461023741812,0.0741641500353015,0.0821234815177852,0.0882972186504182,0.0933415291789055,0.0976064329615387,0.101300860661389,0.104559576840026,0.107474597794022,0.126651976937626,0.137870024598747,0.14582935608123,0.152003093213863,0.157047403742351,0.161312307524984,0.165006735224834,0.168265451403471,0.171180472357467,0.190357851501071,0.201575899162192,0.209535230644675,0.215708967777308,0.220753278305796,0.225018182088429,0.228712609788279,0.231971325966916,0.234886346920912,0.254063726064516,yticklabels=,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,ylabel=Probability, colormap/viridis, height=point meta max=0.2614795863628, point meta min=0, tick align=outside, tick pos=left, width=x grid style=white!69.0196078431373!black, xlabel=Secret, xmin=0, xmax=257, xtick style=color=black, y grid style=white!69.0196078431373!black, ylabel=Time (cycles), ymin=92521, ymax=92698, ytick style=color=black ] graphics [includegraphics cmd=,xmin=0, xmax=257, ymin=92521, ymax=92698] figures/cm/l1d-wt-selfirst-000.png;
[t] [scale=0.73] color0rgb0.267004,0.004874,0.329415 [ axis background/.style=fill=color0, colorbar, colorbar style=ytick=0,0.141271293181516,0.203169675210547,0.239377907552843,0.265068057239578,0.284994885442794,0.301276289581873,0.315042020638118,0.326966439268609,0.337484521924169,0.346893267471825,0.408791649500856,0.444999881843151,0.470690031529886,0.490616859733102,0.506898263872182,0.520663994928426,0.532588413558917,0.543106496214477,0.552515241762133,0.614413623791164,0.650621856133459,0.676312005820194,0.696238834023411,0.71252023816249,0.726285969218734,yticklabels=,,,,,,,,,,,,,,,,,,,,,,,,,,ylabel=Probability, colormap/viridis, height=point meta max=0.7300380468369, point meta min=0, tick align=outside, tick pos=left, width=x grid style=white!69.0196078431373!black, xlabel=Secret, xmin=0, xmax=257, xtick style=color=black, y grid style=white!69.0196078431373!black, ylabel=Time (cycles), ymin=92780, ymax=92961, ytick style=color=black ] graphics [includegraphics cmd=,xmin=0, xmax=257, ymin=92780, ymax=92961] figures/cm/l1d-wt-selall-000.png;
[t] [scale=0.73] color0rgb0.267004,0.004874,0.329415 [ axis background/.style=fill=color0, colorbar, colorbar style=ytick=0,0.105707885398016,0.152024068441943,0.179117298699178,0.198340251485871,0.213250732055656,0.225433481743105,0.235733849837306,0.244656434529798,0.252526712000339,0.259566915099584,0.305883098143511,0.332976328400745,0.352199281187438,0.367109761757224,0.379292511444672,0.389592879538873,0.398515464231365,0.406385741701907,0.413425944801151,0.459742127845078,0.486835358102312,0.506058310889005,0.520968791458791,0.53315154114624,0.543451909240441,0.552374493932932,0.560244771403474,0.567284974502718,0.613601157546645,0.64069438780388,0.659917340590572,0.674827821160358,0.687010570847807,yticklabels=,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,ylabel=Probability, colormap/viridis, height=point meta max=0.6970264911652, point meta min=0, tick align=outside, tick pos=left, width=x grid style=white!69.0196078431373!black, xlabel=Secret, xmin=0, xmax=257, xtick style=color=black, y grid style=white!69.0196078431373!black, ylabel=Time (cycles), ymin=92780, ymax=92954, ytick style=color=black ] graphics [includegraphics cmd=,xmin=0, xmax=257, ymin=92780, ymax=92954] figures/cm/l1d-wt-fencet-000.png;

Figure : Unmitigated. (1629/0.5)
Figure : Software. (1165/0.5)
Figure : . (10.7/1.6)
Figure : . (248.4/0.1)
Figure : Microreset. (21.9/27.8)
Figure : Channel matrices and corresponding mutual information ([mb]/[mb]) for the write-through L1D.
Write-Through L1 Write-Back L1
2-11 12-21 None SW Microreset None SW Microreset
2-3 4-5 6-7 8-9 10-11 12-13 14-15 16-17 18-19 20-21
L1D 1629 0.5 1165 0.5 10.7 1.6 248.4 10.0 21.9 27.8 1620 0.5 770 1 36.0 36.3 34.4 36.8 32.6 37.7
L1I 1891 0.5 n/a n/a 9.5 1.4 33.4 42.0 42.0 42.5 1893 0.5 n/a n/a 9.1 3.0 43.0 48.7 13.5 13.4
DTLB 378 0.1 n/a n/a 1.7 4.4 4.8 5.7 4.3 7.9 363 0.1 n/a n/a 69.0 90.0 37.6 91.4 60.9 91.6
BTB 3611 0.1 n/a n/a 54.8 134.4 84.2 156.5 137.6 161.7 3690 0.1 n/a n/a 85.2 158.2 92.3 181.3 83.1 162.7
BHT 3933 0.4 n/a n/a 137.4 160.5 161.0 160.0 0 0 4147 0.2 n/a n/a 118.8 167.2 99.8 162.2 0 0
Table : Timing channel capacities and their corresponding zero-leakage upper bounds for the unmitigated design and the discussed mitigation mechanisms in millibit [mb]. Leaking channels are highlighted.

Document shows the result of Channel Bench for the write-through L1 data cache for different implementation approaches of time protection. We use the write-through L1 data cache as an example for an in-depth security analysis and comparison of the proposed mechanisms. Most of the following observations also hold for the other microarchitectural components.

Unmitigated

As a baseline, we use the original, unmodified CVA6 core and run our testbench without any further on-core time protection in seL4. Document shows the resulting channel matrix. A clear correlation between the Trojan’s secret and the spy’s execution time is visible, indicating the presence of a covert channel. This is confirmed by the mutual information , which is clearly above the zero-leakage upper bound at more than 1.6 per iteration. To illustrate this channel’s bandwidth, let us assume a 256-bit AES key, two concurrently running applications, and a time slice of 1. The AES key could be leaked in less than 320. More efficient encodings could achieve even higher throughput.

Mitigation Using Existing Architecture

We next evaluate the approach of Document, using only existing instructions to mitigate the timing channel. As the result in Document shows, this decreases the channel’s capacity without fully closing it. One explanation for this behaviour lies within the replacement policy of the data cache. CVA6 pseudo-randomly selects a cache entry for eviction in case of a collision. As a result, the os cannot reliably evict all data cache entries on a context switch. It is possible to re-iterate the prime sequence, but the security guarantees remain limited, and the performance costs increase rapidly, as shown in Document. We conclude that the current architecture does not provide the os sufficient means to enforce time protection, and hardware support is needed.

Basic Flush

The channel matrix for the basic flush, presented in Document, is shown in Document. While the overall appearance of the channel matrix is flat, some patterns along the x-axis remain. Additionally, the mutual information is clearly above the zero-leakage upper bound, confirming a residual channel. A closer analysis reveals that the timing of memory accesses is not only determined by the state of the cache itself, but also by that of further stateful components, such as the lfsr providing a pseudo-random index sequence for the cache replacement policy, and the round-robin memory arbiters of the core. Concurrently to our work, Vila et al. [Vila2020Flushgeist] made similar observations on an Intel core.

Full Flush

The full flush (Document) clears these secondary components as well. While we close most channels with this approach, sporadically, a binary channel such as the one shown in Document reappears in the write-through L1 data cache. The channel does not exist consistently for each measurement. We observe that it appears depending on the initial hardware state. Running Channel Bench on CVA6 in RTL simulation, we find that a single-cycle flush of the targeted components is insufficient. As CVA6 is a pipelined design, the flush signal may reach the various components at different points in time. If the components are not reset synchronously, information can flow from one component that is not yet reset to another component that has already been reset and thus persist. A possible approach to solving this issue is to apply the flush signal for multiple cycles to ensure it propagates through the whole design before being de-asserted, as we do for Microreset. We do not explore this path further for the full flush. Another channel that was identified through this analysis is the miss handler of the L1 data cache. When it receives a new request just before is executed, it waits for the cache’s write-back and flush procedure to complete before serving the request. The response is discarded later on, but it still leaves a trace on state that was already reset by . This trace depends on the request that was issued before , and therefore previous execution. Although these channels might appear very small and impractical at first sight, they become more prominent as additional sources of noise are removed with the reset of other components. Also, it is important to highlight that our L1 data cache attack shown in Document does not target these channels specifically—they are only visible as a side effect of the L1 data cache attack. An attack directly targeting the presented channels would, presumably, achieve much higher capacity.

Microreset

Finally, Microreset from Document yields the expected result, as demonstrated in Document. The channel is consistently closed across configuration and attacks, as supported by Document.

Further Components

Besides the L1 data cache, we analyse prime-and-probe attacks on the L1 instruction cache, the data tlb, the btb, and the bht, see Document. They confirm our findings from the L1 data cache: the unmodified design leaks significant amounts of data (e.g. more than 4 per iteration for the bht), while executing using Microreset during a context switch reliably closes all channels.

Context-Switch Latency

[x=0.01y=0.01] (spy1) [minimum height=1cm, minimum width=1cm, align=center] Spy; (cs1) [minimum height=1cm, minimum width=1.8cm, align=center, anchor=west] at (spy1.east) Context
Switch
; (trojan) [minimum height=1cm, minimum width=3cm, align=center, anchor=west] at (cs1.east) Trojan; (cs2) [minimum height=1cm, minimum width=1.8cm, align=center, anchor=west] at (trojan.east) Context
Switch
; (spy2) [minimum height=1cm, minimum width=1cm, align=center, anchor=west] at (cs2.east) Spy; [-] (spy1.north west) – (spy2.north east); [-] (spy1.south west) – (spy2.south east); [-] (spy1.north east) – (spy1.south east); [-] (cs1.north east) – (cs1.south east); [-] (trojan.north east) – (trojan.south east); [-] (cs2.north east) – (cs2.south east); (clint1) [anchor=south, align=center] at () CLINT
interrupt
; (clint2) [anchor=south, align=center] at () CLINT
interrupt
; [-] (clint1.south) – (cs1.north west); [-] (clint2.south) – (cs2.north west); [dotted] (spy1.south east) – ++(0,-5); [dotted] (spy2.south west) – ++(0,-5); [-stealth] () – ();

Figure : Time span measured by the spy in the context-switch latency channel.

[t]0.33 [scale=0.73] color0rgb0.267004,0.004874,0.329415 [ axis background/.style=fill=color0, colorbar, colorbar style=ytick=0,0.00236027175021071,0.00472054350042142,0.00708081525063213,0.00836845975077223,0.00936723441942883,0.0101832927433345,0.0108732602250784,0.0114709372434746,0.0119981257358968,0.0124697119121312,yticklabels=,,,,,,,,,,,ylabel=Probability, colormap/viridis, height=point meta max=0.01396431401372, point meta min=0, tick align=outside, tick pos=left, width=x grid style=white!69.0196078431373!black, xlabel=Secret, xmin=0, xmax=257, xtick style=color=black, y grid style=white!69.0196078431373!black, ylabel=Time (cycles), ymin=402702, ymax=403736, ytick style=color=black ] graphics [includegraphics cmd=,xmin=0, xmax=257, ymin=402702, ymax=403736] figures/cm/cs-000.png; M &= 1297.9
M_0 &= 0.5
[t]0.33 [scale=0.73] color0rgb0.267004,0.004874,0.329415 [ axis background/.style=fill=color0, colorbar, colorbar style=ytick=0,0.0779852012537635,0.1121546186287,0.132142446464528,0.146324036003637,0.157324131442562,0.166311863839465,0.173910883305133,0.180493453378574,0.186299691675293,0.191493548817498,0.225662966192435,0.245650794028263,0.259832383567372,0.270832479006296,0.2798202114032,0.287419230868868,0.294001800942309,0.299808039239028,0.305001896381233,0.33917131375617,0.359159141591998,yticklabels=,,,,,,,,,,,,,,,,,,,,,,ylabel=Probability, colormap/viridis, height=point meta max=0.3694202303886, point meta min=0, tick align=outside, tick pos=left, width=x grid style=white!69.0196078431373!black, xlabel=Secret, xmin=0, xmax=257, xtick style=color=black, y grid style=white!69.0196078431373!black, ylabel=Time (cycles), ymin=405661, ymax=423063, ytick style=color=black ] graphics [includegraphics cmd=,xmin=0, xmax=257, ymin=405661, ymax=423063] figures/cm/cs-fencet-000.png; M &= 7257.2
M_0 &= 0.4
[t]0.33 [scale=0.73] color0rgb0.267004,0.004874,0.329415 [ axis background/.style=fill=color0, colorbar, colorbar style=ytick=0,0.0213883077656956,0.0307596500619289,0.0362415338866474,0.0401309923581621,0.0431478907301258,0.0456128761828807,0.0476969916873126,0.0495023346543954,0.0510947600075992,0.0525192330263591,0.0618905753225923,0.0673724591473109,0.0712619176188256,0.0742788159907893,0.0767438014435441,0.0788279169479761,0.0806332599150588,yticklabels=,,,,,,,,,,,,,,,,,,ylabel=Probability, colormap/viridis, height=point meta max=0.08076131343842, point meta min=0, tick align=outside, tick pos=left, width=x grid style=white!69.0196078431373!black, xlabel=Secret, xmin=0, xmax=257, xtick style=color=black, y grid style=white!69.0196078431373!black, ylabel=Time (cycles), ymin=424253, ymax=424655, ytick style=color=black ] graphics [includegraphics cmd=,xmin=0, xmax=257, ymin=424253, ymax=424655] figures/cm/cs-fencet-pad-000.png; M &= 1.4
M_0 &= 1.6

Figure : Unmitigated.
Figure : without time padding.
Figure : with time padding.
Figure : Evicted time measured by spy over secret number of L1 data cache lines written by Trojan.

To evaluate leakage through the context-switch latency, we configure the spy to measure the time span during which it is evicted, as shown in Document. In this interval, the Trojan application runs for a fixed time-slice before the os performs a context switch back to the spy. Document shows the secret number of L1 data cache lines that the Trojan writes on the horizontal axis and the corresponding eviction duration measured by the spy on the vertical axis for the unmitigated case. A clear correlation between both is visible, indicating a covert channel. When the Trojan writes to cache lines, it evicts any kernel data stored at the same location. Thus, the context switch routine causes cache misses, increasing the context switch latency. Resetting the microarchitectural state by executing on a context switch without accounting for the latency worsens the situation: Document shows a covert channel close to its theoretical upper capacity limit of 8. As needs to write back the L1 data cache’s dirty cache lines sequentially, its latency directly depends on the secret number of previously written cache lines. We proposed padding for the worst-case execution latency in Document. Hence, we measure the latency for a fully dirty write-back cache: it takes cycles from the clint timer interrupt until the start of the execution of execution.444The majority of this latency (approximately 1800 cycles) are spent for reconfiguring the clint. Scheduling takes around 800 cycles and switching to the new thread (e.g. changing the address space) takes another 320 cycles. It then takes cycles for writing back and invalidating the dirty L1 data cache. Finally, cycles are spent draining pending transactions, and Microreset is asserted for another cycles, resulting in a total of cycles from the clint timer interrupt until the end of . We conservatively round up and set . Similarly, we determine a worst-case upper bound of 3,700 cycles for the write-through L1 data cache. As shown in Document, this removes any dependence of the context switch latency on previous execution and microarchitectural state.

Costs

Context-Switch Latency

For evaluating the context switch latency, we use the inter-address-space IPC benchmark from sel4bench [github:sel4bench]. Analogue to Document, we set . However, as the benchmark uses the process-initiated fastpath context switch routine, in contrast to Document, the beginning of the context switch routine is not defined by a clint timer interrupt. Hence, for our performance evaluation, we use the privilege level switch from U-mode as the start of the pad interval. This event approximates the clint timer interrupt for full context switches.

2*Mitigation Write-Through L1 Write-Back L1
3-6 7-10 Mean SD Mean SD
2*None Hot 514 0 423 0
Cold/Dirty 1243 2 1827 217
SW 40650 2 41152 15
3* 4073 1 22402 5
4067 1 22402 5
Microreset 4125 0 22450 0
Table : seL4 fastpath context switch latency in cycles without mitigation, mitigated using existing architecture (SW), and with . is padded for a worst-case slowpath context switch.

Document shows the context-switch latencies for different configurations. Hot assumes that the core very recently executed a context switch; hence the caches contain kernel data, the branch predictors are trained, etc. No on-core mitigations against timing channels are in place. On the other hand, Cold/Dirty is the context-switch latency for an untrained and dirty microarchitecture. For instance, this is the case after a resource-intensive application has evicted all kernel information during its time slice. We assume that this is the more common case in practice. We emulate this behaviour by resetting the microarchitecture and, if configured as write-back, polluting the L1 data cache from user space between context switches. Finally, SW and are the resulting context-switch latencies after configuring seL4 to mitigate on-core timing channels using the approaches described in Document. The Software mitigation (SW) is significantly more expensive than the approaches while only partially mitigating the L1 data cache channel. A more extensive mitigation would make this approach even more costly. Comparing Microreset to both flush approaches, there are no significant performance differences. The reason is that most components of CVA6 are reset in parallel—therefore, flushing or resetting more state usually does not require more cycles. The dominating factor for both approaches is the write-back of the L1 data cache padded for the worst case. This means that a principled reset generally does not imply higher costs than a selective flush. Compared to the cold, unmitigated case, adds less than 21,000 cycles to the context-switch routine. Assuming a processor running at 1 and a context-switch frequency of 1, this increase corresponds to an overhead of about 2.1. Decreasing the frequency of (e.g. in a hypervisor scenario or by clustering mutually trusting applications into security domains) would decrease the relative overhead accordingly. It is important to note that is padded for the worst-case slowpath context switch, while Cold/Dirty gives the fastpath latency. When applied to a slowpath context switch, the overhead would be smaller. In addition to the direct costs shown here, the dirty cache would cause indirect costs resulting from cache misses and write-backs experienced after the context switch, while after , execution continues with a clean cache. As such, Document overestimates the performance impact of .

Indirect Costs

Resetting the microarchitectural state on a context switch potentially removes an application’s information from the on-core state that is still required at a later point, resulting in indirect costs due to cache misses or mis-speculation once the application is re-scheduled. However, Ge et al. have shown that application-specific information is mostly evicted from the on-core state (L1 caches, branch predictors etc.) after one or several time-slices of execution of other application(s), and that the indirect costs of a reset on context switch are therefore limited. A difference in indirect costs between Microreset and the selective flush can generally only be caused either by a poor choice of reset values for one of both approaches, or by a residual timing channel.

Hardware Overhead

We synthesise the original and modified versions of CVA6 in GlobalFoundries 22 FDX technology at 1 at worst-case conditions (0.72, 125). We convert the results to gate equivalent (GE), a technology-independent unit for the complexity of a circuit. The area overhead of our modifications is negligible at 0.4%, with the controller being the largest addition at around 1.6 compared to a total core area of 1.2. There is no significant impact on the critical path.

Related Work

The L1 data cache is the focus of several previous works on on-core timing channel mitigation. One possible approach is spatial partitioning of the cache, proposed by Page [Page2005Partition] and followed up on by Domnitser et al. [Domnitser2012NoMo] and Dessouky et al. [Dessouky2021ChunkedCache]. Since L1 caches are relatively small and time-shared between applications, spatial partitioning of these components is not very efficient. Wang and Lee [Wang2007DynamicCache1, Wang2008DynamicCache2] propose a dynamic, randomised cache-remapping to mitigate targeted cache collisions. An improved implementation was presented by Qureshi [Qureshi2018CEASER]. We find this approach insufficient, as in general, randomisation merely adds noise to a communication channel without fundamentally closing it. As Constable and Unterluggauer [Constable2021SEED] demonstrate, cache line mappings be designed to prevent specific attacks (i.e. prime-and-probe on the index of an accessed cache line). Besides imposing impractical constraints on the cache layout, this approach is not suited for other attacks, such as prime-and-probe on the number of accessed cache lines, which we use in this work. All works mentioned above do not consider microarchitectural components besides the L1 data cache. Tiwari et al. [Tiwari2009ExecutionLeases] propose a major architectural modification to fundamentally separate data from control flow. Our work extends that of Ge et al., who propose time protection and the need for flushing all microarchitectural on-core state on a partition switch, and demonstrate the need for hardware support [Ge2018, Ge2019a, Ge2019phd], which is what our temporal fence provides. There exist several approaches in a similar direction: Bourgeat et al. [Bourgeat2019MI6] present a processor with a purge instruction, similar to our , that flushes on-core microarchitectural components to secure enclaves. Li et al. [Li2020SIMF] propose FenceX, an instruction similar to the basic flush version of presented in Document of this work, which we found insufficient to close all timing channels reliably. Escouteloup et al. [Escouteloup2021Dome] present an isa extension that allows the allocation of hardware resources to security domains. They implement and evaluate their proposal bare-metal on an embedded RISC-V core designed to model known microarchitectural vulnerabilities, whereas this work targets an existing, application-class RISC-V core, optimised for efficiency and running a full operating system. To the best of our knowledge, all previous works on temporal partitioning first identify vulnerable microarchitectural components, and then add them to a partition set. A major drawback of this approach is that it remains difficult to make hard claims, such as that all microarchitectural covert channels are closed. Our Microreset proposed in Document works the other way around: all stateful components are reset per default. Only a selected set of architectural state is explicitly excluded from the reset. Furthermore, we also consider the flush latency itself, since neglecting it can open significant new channels, as demonstrated in Document.

Conclusions

In this work, we present the temporal fence instruction, , which allows an os to reliably prevent on-core timing channels. We propose and compare different hardware implementations of , ranging from an basic flush of well-known vulnerable microarchitectural components, over an exhaustive flush of manually identified vulnerable secondary components, to a systematic erasure of all non-architectural state, which we call Microreset. Evaluating these mechanisms on the open-source RV64GC CVA6 core with different cache configurations and running an experimental, unverified version of the seL4 microkernel, we find that with Microreset is the only approach that consistently closes all timing channels, while offering a low implementation effort, a 2.1 performance impact for a typical 1 system with a 1 context switch period, and negligible hardware costs.