ShadowGuard : Optimizing the Policy and Mechanism of Shadow Stack Instrumentation using Binary Static Analysis

02/18/2020
by   Buddhika Chamith, et al.
0

A shadow stack validates on-stack return addresses and prevents arbitrary code execution vulnerabilities due to malicious returns. Several recent works demonstrate that without shadow stack protection, control-flow-integrity – a related security hardening scheme – is vulnerable to attacks. Above benefits notwithstanding, shadow stacks have not found mass adoption due to the high overheads they impose. In this work, we re-examine the performance viability of shadow stacks as a binary hardening technique. Our work is inspired by the design principle of separating mechanism and policy. Existing research on shadow stacks focus on optimizing the implementation of the shadow stack, which is the mechanism. At a policy level, we define Return Address Safety (RA-Safety) to formally capture the impact of memory writes to return addresses. Based on RA-Safety, we design safe function elision and safe path elision, two novel algorithms to optimize the instrumentation policy for shadow stacks. These two algorithms statically identify functions and control-flow paths that will not overwrite any return address so we can safely elide instrumentation on them. Finally, we compliment above policy improvements with Register frame, Binary function inlining, and Dead register chasing; three new mechanism optimizations. We evaluate our new shadow stack implementation ShadowGuard, with SPEC 2017 and show that it reduces the geometric mean overhead from 8 unoptimized shadow stack. We also evaluate several hardened server benchmarks including Apache HTTP Server and Redis, and the results show above techniques significantly reduce the latency and throughput overheads.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

06/25/2018

A Leak-Resilient Dual Stack Scheme for Backward-Edge Control-Flow Integrity

Manipulations of return addresses on the stack are the basis for a varie...
05/02/2021

Security Properties for Stack Safety

What exactly does "stack safety" mean? The phrase is associated with a v...
02/03/2019

Zipper Stack: Shadow Stacks Without Shadow

Return-Oriented Programming (ROP) is a typical attack technique that can...
03/11/2020

Bypassing memory safety mechanisms through speculative control flow hijacks

The prevalence of memory corruption bugs in the past decades resulted in...
10/28/2021

Fuzzm: Finding Memory Bugs through Binary-Only Instrumentation and Fuzzing of WebAssembly

WebAssembly binaries are often compiled from memory-unsafe languages, su...
09/09/2019

Proconda – Protected Control Data

Memory corruption vulnerabilities often enable attackers to take control...
07/26/2018

ret2spec: Speculative Execution Using Return Stack Buffers

Speculative execution is an optimization technique that has been part of...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Shadow stacks prevent control-flow hijacking attacks which are based on modifying a return address on the call stack. Shadow stacks maintain a copy of the regular stack in a tamper-proof region, and check that both copies match on all function returns. This scheme is effective against code reuse attacks like Return-oriented programming (ROP) [27] and return-to-libc [35].

Precisely checking return-edges using shadow-stacks also helps to secure other coarse-grained mitigation techniques such as control-flow integrity (CFI) [1], which locks down the application run-time control flow to a statically-determined, safe (but over-approximate) control-flow graph. For example, a recent work on Control-Flow Bending [6] shows that CFI solutions which do not use shadow stacks can be subverted even if they use finer-grained return-edge approximation which allows returning to any of the hardened-function call-sites.

Even with aforementioned benefits, however, shadow stacks have not found mass adoption due to high overheads. Thurston et. al. [9] report a geometric mean overhead of  12% on SPEC 2006 CPU benchmarks. Historically, any mitigation technique imposing overheads of more than 5% [29] has not found general adoption. Another recent work [5] implements a compiler-based shadow stack by using a dedicated register and shows lower overheads of around 5% can be achieved, albeit by sacrificing compatibility with unprotected code and requiring source code. For real-world adoption, however, compatibility with unprotected code is highly desired since the hardened application may be required to work with other closed source libraries.

In this work, we re-examine the performance viability of shadow stacks that preserve compatibility with unprotected code. We observe that when it comes to overhead reduction, prior work largely stops at optimizing the shadow stack instrumentation. From an instrumentation policy vs. mechanism standpoint, such optimizations can be thought of as improving the shadow stack mechanism. While such optimizations could result in significant overhead reductions, we observe that under a practical threat model, it is possible to statically establish some safety guarantees for certain code regions in the application, which allows us to elide shadow stack instrumentation for certain control-flow paths, or even for entire functions—resulting in further overhead reductions. Instrumentation-elision describes a particular instrumentation policy for shadow stacks, which does not sacrifice security and compliments existing research that optimize the shadow stack mechanism.

We base our instrumentation-elision policy on Return Address Safety () property. We define on functions () and on basic blocks (). We can informally state as follows: a given function or a basic block satisfies if its execution (1) does not cause stack mutations above the local stack frame starting from the local return address and (2) does not result in calls functions that do not satisfy . So by definition, any code region which satisfies will not modify any of the return addresses on the run-time call stack.

The simplest example of a function satisfying is a leaf function that does not write to memory. A straightforward static analysis could identify such functions. Even if a function mutates the stack, we can still perform data-flow analysis on the stack pointer at each stack write and conservatively determine if any writes break . In addition, we analyze the function call graph to extend to non-leaf functions.

extends the safety property to intra-procedural control-flow graph in order to find complete—from function entry to return—control-flow paths which satisfy . We develop an algorithm to restrict the shadow stack instrumentation to only cover non--abiding intra-procedural control flows (i.e., control flows containing non- basic blocks), thus eliding instrumentation on safe paths.

Next, we compliment our instrumentation-elision policy with several improvements to the shadow stack instrumentation, i.e., the mechanism. For leaf functions, we implement the register frame optimization which utilizes an unused register to hold the shadow stack frame. The dead register chasing

optimization moves instrumentation from typical instrumentation locations (such as function entry) to search for dead registers which can be used as scratch space in the instrumentation code, thus reducing instructions emitted to save and restore register states. Finally, we develop a heuristic for inlining short leaf functions and thereby avoid instrumentation within those.

We implemented our new shadow stack implementation, ShadowGuard, using Dyninst [25], an established binary analysis and instrumentation tool. In our initial experiments, however, Dyninst incurred prohibitive overheads even with the empty instrumentation (a NOP)—over 100% in some applications. We therefore improved Dyninst’s instrumentation trampoline generation and exception-handling support and brought the framework overhead close to zero. We are attempting to upstream our changes to mainstream Dyninst and hope that our improvements can benefit other security researchers.

Our evaluations on SPEC 2017 CPU benchmarks show that our policy and mechanism optimizations reduce the geometric mean overhead from 8.1% to 2.4% over the unoptimized instrumentation. We also benchmarked hardened Redis and Apache HTTP servers, and with our optimized shadow stack, the throughput degradation in various benchmark settings was significantly lower.

Figure 1: A call tree with a mix of safe and unsafe functions with respect to memory writes. Functions shown in gray feature memory writes which break . Even though c does not contain unsafe writes it calls unsafe f and hence breaks . Dotted line marks the frontier between safe and unsafe functions (with respect to ) obtained by traversing safe functions from bottom up.

In summary, in this paper we present a novel shadow stack implementation named ShadowGuard, which follows a two-pronged approach for reducing shadow-stack overheads using (1) instrumentation-elision and (2) instrumentation code optimizations, both based on binary static analysis, with the following specific contributions:

  • We define under a practical threat model and describe a set of static analyses which determine of a given function or a basic block.

  • We devise safe function elision and safe path elision—two algorithms that perform instrumentation elision on inter- and intra-procedural control flows, respectively, based on .

  • We formally prove that safe function elision and safe path elision preserve run-time return-edge safety by not skipping instrumentation checks on any unsafe function return.

  • We describe several novel shadow-stack instrumentation optimizations which compliment the overhead reductions obtained with instrumentation elision.

  • We significantly reduce Dyninst’s binary instrumentation overhead which improves its utility for security research.

2 Instrumentation Elision Overview

Instrumentation elision reduces shadow-stack overhead by cutting down the dynamic instruction footprint of the instrumentation. We base safe function elision on and safe path elision on as follows:

Safe Function Elision

Consider the call tree in Figure 1. Here, leaf functions d and e are safe with respect to and hence eligible for instrumentation elision. Furthermore, is a structurally defined property on the call graph—that is, only if all the callees of a function are safe, the function it self is safe, given there are no unsafe memory writes within the function. Since b doesn’t have unsafe callees b also satisfies . However, function c is unsafe due to the call to f even though it doesn’t feature unsafe memory writes within itself. In practice this means that function f may overwrite function c’s return since f may write above its stack frame. In Section 4 we describe an iterative algorithm for propagating through the call graph.

This bottom-up instrumentation elision typically results in significant overhead reductions since functions deep in the call graph at/near the leaves (1) tend to be short lived and thus the instrumentation overhead as a percentage of function execution time is higher, and (2) are more frequently executed, contributing more to the execution frequency of instrumentation.

Safe Path Elision

We further propagate through intra-procedural control-flow graph with . Now let’s assume the unsafe leaf function f has the control-flow graph as shown in Figure 2. Any control-flow path containing the breaking basic block 4 may overwrite return address at current frame or any other frame above it. However, there are paths such as 1 -> 6 and 1 -> 2 -> 3 -> 5 -> 6 which are safe. With typical entry and exit block instrumentation at blocks 1, 6 and 7 we would be doing redundant shadow stack checks for such safe paths. If these safe paths are frequently taken the redundant instrumentation cost would be significant.

Safe path elision removes such inefficiencies by lowering the instrumentation from function entry to only cover path segments which feature unsafe basic blocks. In Figure 2 the naive solution is to introduce shadow stack instrumentation covering all incoming and outgoing edges of basic block 4. However such a solution is not ideal since now the stack check will happen many times as the number of loop iterations, as the basic block 4 is contained within a loop. In Section 6 we develop an algorithm for safe path elision with the following guarantee—after instrumentation lowering, any unsafe run-time control-flow graph walk (i.e, cycles included) will execute exactly one shadow stack check. Additionally, the algorithm ensures that stack operations happen in pairs and in correct order (i.e a stack pop must follow a push).

Figure 2: An intra-procedural control-flow graph featuring a mix of safe and unsafe basic blocks. Basic block 4 shown in gray features unsafe memory writes. Sink blocks (e.g: return blocks) are shown in double rectangles. Any control-flow path going through basic block 4 is unsafe.

Furthermore, we can extend safe path elision to unsafe non-leaf functions. Let’s consider function c in Figure 1 which is unsafe by the virtue of calling the unsafe f. None of its basic blocks contain unsafe memory writes. So we extend the definition of so that, any basic block featuring a call to an unsafe function is unsafe. Now we can consider the control-flow paths containing the unsafe basic block involving the function call to f and apply safe path elision. In Section 6 we provide a safety proof for safe path elision with respect to both intra-procedural and inter-procedural control-flow.

3 Threat Model and RA-Safety

We assume the following threat model for instrumentation elision.

Threat Model

We assume that the process is running with non-executable data and non-writeable code which enforced by the hardware [17]. We assume that an attacker may cause arbitrary memory writes within the thread-local stack or within the heap. However, a thread may not overwrite another thread’s stack in order cause a meaningful exploit. We note that this threat model similar to the ones assumed in other shadow stack research [5, 9], but is somewhat more relaxed than the threat models assumed in previous literature for control flow integrity [6]. We believe this model is still practical because:

  1. Attempts at overwriting different thread stacks with contiguous buffer overflows in the heap or within a thread-local stack should, with high probability, lead to accesses to unmapped or write protected memory before non-local thread stacks are reached due to the sparsity of mapped virtual-memory regions and thread-stack guard pages 

    [31].

  2. Leveraging non-contiguous writes would be difficult, since reliably identifying stack references containing returns requires precise knowledge of the stack frame layout of a thread at given point of time and precisely timing the address overwrite due to ephemeral nature of stack frames.

Under the above threat model we now formally define for basic blocks and for functions.

RA-Safety

Let f be a function and bb be a basic block within the function f. Let , be an instruction within the function f given that function f contains instructions. Let be the set of basic blocks belonging to the function f. Let be the set with where the instruction range from to belongs to bb. Let the set of inter-procedural targets (i.e., functions) incident on the outgoing edges of the basic block bb be given by . Note here we consider basic blocks ending with returns as containing no outgoing edges. Basic blocks terminating with direct or indirect calls will have nonempty . Assuming downward stack growth, let stack-height of a stack memory reference be the address offset of the reference with respect to the stack frame start address. We define stack frame to contain the return address of the caller. Hence the stack-height at function entry is -word-size, where word-size is the size of a memory reference in the system.

Now, we define the flat dataflow lattice and the identity abstraction function , such that .

We also define variables which hold the stack-height of the destination memory operand of each . We now assume a h: , a map that gives the fixed point values for at after carrying out a stack-height analysis at each instruction in function f.

Next, we define a flat lattice and abstraction function , such that:


.
Now, we assume that the flow function is given as:

Given a function f, some h and some incoming data flow information d, returns outgoing data flow information. Similarly, we assume a flow function given as:

accepts a basic block as the first argument.

Let and contain basic block and function data flow information for bb and f respectively. Now, we define flow functions and as follows.



where



-word-size)

Now we can define and for a function and a basic block respectively.





Here fix is the least fixed point of each flow function for a given basic block or a function under a given stack height analysis h and initial flow conditions , respectively.

4 RA-Safety Implementation

for basic blocks and functions can be calculated using an iterative algorithm based on work lists. We first calculate the connected components from the call graph, with special handling for indirect calls described below, and carry out the iterative algorithm on the resulting directed acyclic graph with a bottom-up post order traversal. CalculateRASafety outlines the algorithm.

1:procedure CalculateRASafety( )
2:      CallGraph()
3:      ConnectedComponents(callGraph)
4:     
5:     for  in  do
6:               
7:     return result
8:
9:procedure ProcessComponent( , , )
10:      scc.functions
11:      fns.blocks
12:      blocks
13:     while  do
14:         
15:         
16:         if  then
17:              for  do
18:                                 
19:                             
20:     for  do
21:               

The input to the algorithm is a stack height analysis which maps each instruction to a stack height of destination operand. For instructions not featuring a memory write the height is defined as . Also, the analysis must be sound, meaning that can map a stack write to , but cannot map a non-stack write to a concrete stack height. In order to obtain , we use the stack height analysis in Dyninst’s DataflowAPI [10] which is an intra-procedural analysis that maps each register at each instruction to a stack height lattice . The stack height analysis tracks the effect of each instruction on the stack pointer and its aliases. Since is a flat lattice, if a function dynamically grows the stack in a loop, we will map stack writes in the loop to .

In line 15, determines whether a basic block contains unsafe memory writes or unsafe function calls. For binary analysis, it is extremely difficult to precisely determine the call targets of indirect calls. Therefore, we assume that indirect calls are unsafe, and our implementation returns when the basic block contains an indirect call.

In line 17, gives the set of inter-procedural predecessors of a given basic block. It will return the empty set for any non-entry basic block. Here indirect call edges are irrelevant as a function that contains an indirect call is assumed to be unsafe. The fix point data flow information in the returned result map from CalculateRASafety can be used to determine of any function or basic block as shown in section 3.

5 Safe Function Elision

Using , we perform Safe Function Elision—instrumentation elision on functions. Before formally stating the safety property of Safe Function Elision we define Validation-Soundness which defines the soundness of instrumentation with respect to return-edge safety.

Validation-Soundness

Any return address modification on the call-stack will be detected by some shadow-stack instrumentation check.

Now, given a function satisfying we claim the following,

Safe Function Elision Property

Eliding instrumentation at any function which satisfies preserves validation-soundness.

We provide a proof for the above claim.

Proof : Given a leaf function satisfying let’s assume a run-time call stack S with function on the top of the stack (i.e function is being executed on the processor). Let’s assume S contains call stack frames with being the top of the stack call stack frame belonging to function . Also let be the function associated with the call stack frame . Let be the stack reference to return address at call stack frame .

First, we present the following lemma about .

Lemma 5.1.

Given a call stack S and the associated call chain ,
.

Proof : We provide an inductive proof on the call chain. Let’s consider the function . There must be a basic block bb in ending with a call to . According the flow function , bb satisfies due to the unsafe inter-procedural edge to . Now, according to , satisfies since b is unsafe. Now lets assume . Following a similar argument as before we can show . Hence . ∎

Now, we do case analysis on the leaf function .

. By definition of , cannot contain writes to any . Hence instrumentation elision at will not miss a return address modification to

. We consider two sub-cases here.

Function overwrites Since the function is not RA-Safe we do not elide instrumentation at . Therefore, the shadow-stack check at will detect the return address modification.

Function overwrites for some The check at will not detect this overwrite. But according to the Lemma 5.1 satisfies . Hence, no instrumentation elision is performed at . Therefore, the return address modification at will be detected at ’s return.

So far we have proved that Safe Function Elision Property holds for any leaf function and any non-leaf function with a call to an unsafe function. The final case is any non-leaf function featuring unsafe memory writes, but not performing any unsafe child calls. In this case, by definition, the function still satisfies . So the instrumentation will not be elided. Hence Safe Function Elision Property holds for any given function. ∎

6 Safe Path Elision

In this section we extend instrumentation elision to the intra-procedural CFG. Given that some of the basic blocks within a function satisfy , certain control-flow paths through the function may be safe. If these safe control-flow paths are taken frequently, we can reduce instrumentation overhead by eliding shadow-stack instrumentation on those paths. For example, a memoized function is one instance where such an invocation pattern is typical, where the initial computation may lead to a unsafe memory write, which memorizes the result, but the subsequent invocations fetch the memoized value with just memory read (no unsafe memory write).

We perform instrumentation lowering in order to push the instrumentation down in the intra-procedural CFG so that it covers any unsafe control-flow path segment. In order to preserve validation-soundness and avoid performance degradations, instrumentation lowering needs to satisfy the following constraints.

  • Lowering must cover every unsafe control-flow path with exactly one shadow-stack check.

  • Shadow-stack operations should match in the correct order when traversing any control flow path (i.e., a stack push must be followed by a stack pop in that order).

1:procedure DoInstrumentationLowering( , )
2:      CloneCFG(g)
3:     
4:     for  do
5:               
6:     
7:
8:procedure DFSVisit(, , , , )
9:     if  then
10:         
11:         
12:         
13:         return      
14:     for  do
15:         if  then
16:              
17:                             

The algorithm DoInstrumentationLowering performs the lowering given an intra-procedural CFG and a stack analysis result . Here we retain the definitions from Section 3. In addition we define the following. For a CFG , let be the set of exits blocks of the function. Also let be the entry basic block of the function. Given a basic block let be the set of intra-procedural control-flow outgoing edges of the basic block.

As the first step we clone the CFG with CloneCFG. It returns a cloned graph and a map from original graph vertices to the cloned vertices. At line 5, we call AppendStackPop which instruments each exit block of the cloned graph with a stack pop.

Next, we traverse the original graph in depth first order and introduce stack push instrumentation at the first unsafe basic block encountered at a given control-flow path. At line 10 we reroute the control flow to transition from the original graph to the cloned graph with RedirectEdge. We then instrument the transition edge with stack push instrumentation by calling AppendStackPush at line 12. AppendStackPush requires the stack height at the basic block entry so that it can properly generate the instructions to retrieve the return address. Once the unsafe node has been processed we immediately return without visiting its children. This guarantees that any given control flow will only encounter a single stack push. Figure 3 shows the resulting CFG after performing instrumentation on fig. 2.

The implementation of DoInstrumentationLowering requires the capabilities of instrumenting control-flow edges, cloning CFG elements including basic blocks and edges, and redirecting control-flow edges. Dyninst supports all these operations through structured CFG editing [3]. In our implementation, we add another pass to count the number of safe control-flow paths. If a function does not contain any safe control-flow path, we do not perform DoInstrumentationLowering to avoid unnecessary code cloning.

Figure 3: Figure 2 after instrumentation lowering performed. The original graph is on the left and the cloned on the right. The transition edge shown in a dotted line features a stack push. All the exit blocks of the cloned graph contains a stack pop at the end.

Safe Path Elision Property

Instrumentation lowering, as presented above, preserves validation-soundness.

Now we prove the above safety property.

Proof : We start by defining the following.

Tainted Region : Let be a run-time graph walk (i.e., cycles allowed) of a given an intra-procedural CFG where each is a basic block and where is the function entry node. Let there be non-empty set B such that, . Given B, now we define the tainted region of the above walk as , where is the first basic block satisfying when traversing the walk starting from the left.

Now we present couple of lemmas.

Lemma 6.1.

Any entry to exit run-time graph walk of an instrumentation lowered CFG has its tainted region covered by exactly one shadow stack check (i.e a stack push at the head of the region and a stack pop at the tail).

Proof : DoInstrumentationLowering adds a stack push instrumented transition edge to the cloned graph at the first unsafe block of a given control-flow path traversed in depth first order. Since this unsafe block is by definition the head of the tainted region for this path, the stack push occurs just before the start of the tainted region. Rest of the tainted region lies within the cloned graph. Since the cloned graph a) does not contain back edges to the original CFG causing repeated traversals of the transition edge b) does not have any stack push instrumentation, no nested stack pushes are possible. Now, every exit block of the cloned graph is instrumented with a stack pop at the end. Hence the lowered instrumentation covers the tainted region of any given control-flow graph walk with exactly one matching shadow stack push-pop pair in that order.∎

Lemma 6.2.

Any unsafe basic block occurring within a run-time control graph walk must lie within a tainted region.

Proof : Given a graph walk the basic block must either be the first unsafe basic block or non-first unsafe basic block. In either case, by definition, the basic block falls within the tainted region of the walk. ∎

Now we consider a run-time call stack S with stack frames and a call-chain similarly defined as in Section 5. Let be the control graph walk executed in the function within the context of the stack frame . Now lets consider the case where contains a tainted region. We prove validation-soundness by considering following two cases.

Case 1: overwrites . Here the return address of the current frame is overwritten within the tainted region. By Lemma 6.1, this mutation will be validated by the shadow stack check which covers the tainted region (possibly) extended to an exit block (if is not an exit block).

Case 2: overwrites . Here we do induction over the call chain. Since contains one or more unsafe basic blocks (by the existence of a tainted region), by definition of , satisfies . Now, given graph walk within function , the block contains an inter-procedural edge to unsafe . Hence, by definition of , satisfies . Now, by Lemma 6.2, must fall within a tainted region. Also by lemma 6.1, this tainted region (extended until a return block) will be covered with a shadow stack check. Now assume contains a tainted region. With a similar argument as above we can show that there will be an extended tainted region including which will be covered with a shadow stack check. Hence by induction, the stack modification at will be validated by the shadow stack check which covers the extended tainted region in .

The above two cases apply to any tainted region of in a call stack activation S. ∎

7 Optimizing the Mechanism

In previous two sections, we presented a new shadow stack instrumentation policy that elides instrumentation without compromising the run-time safety. Now we look in to how we can improve the instrumentation mechanism, that is, how we can optimize the shadow stack instrumentation, in order to bring further overhead reductions.

7.1 Basic Operations

We implement shadow stack operations in a shared library that we LD_PRELOAD. Shadow stack initialization for the main thread happens in the library load time constructor and child-thread stacks get initialized in the interposed pthread_create function that we provide. Figure 4 illustrates the shadow stack layout with the segment register gs pointing to the bottom of the stack. Zeroed guard-words are used to detect run-time stack under-flows while the scratch area is used to hold the temporary register value in the register frame optimization described in  Section 7.2. We assume that the shadow region is protected with information hiding [30]. We discuss the implications of this assumption in section 10.

Figure 4: Shadow memory region layout. Guard words are zeroed out. Accessing top of the stack requires a double de-reference. Scratch space can be accessed with a single memory access via gs:0x8.

The push and pop operations mirror regular call stack operations. In order to support the exceptions and setjmp/longjmp we repeatedly pop the values from the stack until a match or an under-flow occurs. Section 7.1 shows the instruction sequence for push. The pop is similar (though in reverse order and with the loop) and is provided in the appendix for completeness. We also note that it is straight-forward to improve our implementation to also save and validate the frame pointer register which also detects attacks related to recursive calls.

As per Section 7.1 the instruction sequence for a push consists of 9 instructions and 6 memory accesses. The pop operation incurs a similar overhead leading to about 20 instructions and 12 memory accesses per function call from shadow stack instrumentation.

[t] [escapeinside=||, mathescape=true]gas # Create scratch registers push push # Retrieve return address from the stack mov 0x10(# Retrieve the shadow stack pointer mov # Save return address to shadow stack mov # Move and save the shadow stack pointer lea 0x8(mov # Restore scratch registers pop pop Instruction sequence for basic shadow stack push operation.

7.2 Leaf Function Optimizations

We find that leaf functions often contain a small number of instructions, but are frequently executed, causing significant overhead. Therefore, we design two novel optimizations for leaf functions.

Register Frame Optimization: Instead of storing the return address to the shadow stack, we use a free register to implement a register stack frame. Section 7.2 shows an example of a register frame push operation. Compared to the basic stack push operation shown in  Section 7.1, a register frame push operation contains only 2 instructions.

We make two observations for the first instruction in  Section 7.2. First, we perform a register usage analysis of the leaf function to identify a free general purpose register (GPR). Note that we still need to save and restore the value of the chosen GPR because the compiler may not always emit ABI-compliant code (due to hand coded assembly etc.). Second, we save the chosen GPR to the scratch space in the shadow memory region, which is isolated from the original program. It is unsafe to save the chosen GPR on the execution stack. The save and restore of the GPR is separated by a unsafe memory write; otherwise, we would have elided instrumentation in this function. If the attacker is aware of our defense, the attacker may control the unsafe write to overwrite the saved GPR on stack and control this GPR after its restore.

[t] [escapeinside=||, mathescape=true]gas # Save the register to the scratch space in # the shadow memory region mov # Retrieve return address mov ( Instruction sequence for register frame push operation.

Inlining Leaf Functions: We found that some small leaf functions are still not inlined even at -O3 optimization level. Inlining such function calls can completely remove instrumentation. To ensure correctness, we still instrument the original leaf function in order to catch any indirect calls to it. As it is difficult to implement a general function inlining mechanism for binary rewriting, we inline every direct function call to a leaf function that contains a straight-line code.

7.3 Dead Register Chasing

The basic shadow stack push and pop each need two GPRs as scratch space. We leverage register liveness analysis to identify dead registers and if available, use them as scratch space, thus avoiding context saves for the scratch registers.

We observe that we can move our instrumentation to realize more dead registers compared to fixing our instrumentation at function entry and exit. For example, in Section 7.3 if we instrument at the function entry, %rbx and %r10 are live because their values are then pushed to the stack. If we move our shadow stack push operation after the push instruction sequence, we have two dead registers. In this scenario, we need to adjust the offset for retrieving return address from the stack as we have pushed two GPRs to the stack. Currently, we only move instrumentation within the same basic block and we cannot move instrumentation beyond unsafe memory writes or unsafe function calls.

[t] [escapeinside=||, mathescape=true]gas A: # No dead register at function entry push push # mov 0x1234, mov An example showing the benefits of moving instrumentation for more dead registers.

8 Improving Dyninst

We chose Dyninst as the instrumentation tool for implementing our shadow stack because it provides the required capabilities for implementing safe path elision and it is a unified framework for applying binary analysis for optimizing the instrumentation.

However, we observed that the baseline overhead of Dyninst (i.e., with empty instrumentation) is prohibitive with overheads of more than 100% in some applications. In this section, we highlight some key improvements that we made to Dyninst in order to reduce this base line overhead.

First, we briefly summarize Dyninst’s operation and define some terminology. When a user instruments a function , Dyninst will copy the entire body of to a newly allocated memory region. We call the copied as the relocated function, which will hold the instrumentation. When relocating a function, Dyninst needs to adjust instructions in to compensate for the new execution location, including PC-relative addressing and return addresses on the stack for exception unwinding. Dyninst then installs trampolines in original to transfer execution from original to relocated . Lastly, Dyninst adds its run-time library libdyninstAPI_RT.so as a dependency for the rewritten binary, which provides some runtime routines needed by Dyninst instrumentation.

8.1 Improving Exception Unwinding

Dyninst currently compensates for exception unwinding by emulating a function call [4]. As function calls in instrumentation are in a new memory region, there will be no .eh_frame information corresponding to these new addresses for unwinding the stack, or .gcc_except_table information for finding the code blocks that catch exceptions. Therefore, a three instruction sequence, call next_insn; add [%rsp], $off; jmp $target is used to emulate a call instruction call $target, where call next_insn pushes the current PC to the stack and $off represents the address distance between the relocated call instruction and the original call instruction. In this way, Dyninst ensures that the instrumentation is transparent to stack unwinding and catching exceptions.

Unfortunately, such call emulation significantly increases the overhead of making a function call. As Dyninst has to emulate every function call whose callee may throw an exception, a large number of function calls are emulated for C++ programs. Empirically, we observed call emulation caused about 30% overhead in some of the SPEC CPU 2017 C++ benchmarks.

To remove the overhead of call emulation, we observe that exception unwinding is not a common operation. So, we defer the translation from relocated PC to original PC from call time to stack unwind time. Specifically, we first compile libunwind with exception unwinding support and pre-load libunwind to ensure exception unwinding uses libunwind instead of libstdc++.so. We then wrap the x86_64_step function from libunwind, which unwinds a stack frame given the register state. We wrap this function in libdyninstAPI_RT.so to translate the recovered return address from relocated code to original code. This additional translation cost is negligible compared to other stack unwinding operations. Hence our improved exception unwinding has negligible overhead.

8.2 Trampoline Installation and Execution

Dyninst currently installs trampolines at function entry block, target blocks of a jump table, and blocks that catch exceptions. The binary analysis component in Dyninst is responsible for identifying these blocks [23]. These trampolines are critical for the correctness of instrumentation because indirect calls, indirect jumps, and exception unwinding will cause the control flow to be back to execute the original code. In most cases, a trampoline is implemented as a direct branch, which is a five byte long instruction on x86-64. In cases where there is not enough space for a five byte branch instruction, Dyninst installs a one-byte breakpoint instruction and registers a signal handler to transfer the execution. The signal handler and the signal registration are in libdyninstAPI_RT.so.

We observed two performance problems caused by trampolines. First, frequent execution of breakpoint instructions are prohibitively expensive. Second, frequent execution of trampolines indicated that the execution was bouncing between the original and relocated code, which increased the i-cache pressure. To address these two problems, we implemented following optimizations:

Multi-branch trampoline: On x86-64, the five byte branch has a 32-bit displacement, which is typically large enough to branch to relocated code. The idea of a multi-branch trampoline is to make use of the two-byte branch instruction, which has a

branch range and enables us to jump to nearby padding bytes, which may hold a five byte branch instruction. In practice, the compiler frequently emits padding bytes to align branch targets and function starts. So, this optimization effectively reduces the use of breakpoint trampoline.

Jump table rewriting: Dyninst statically resolves jump table targets by calculating a symbolic expression for the indirect jump target [23]. Using the symbolic expression, we can plug in the addresses of relocated target blocks and solve for the values that should be written into the jump table. The rewritten jump tables will ensure that the execution will stay in relocated code. With jump table rewriting, there is no need to install trampolines at jump table target blocks any more.

Function pointer rewriting: Indirect calls will redirect execution back to the original code and require trampolines at function entries. We reduced the execution frequency of such trampolines by rewriting function pointers identified in code and data sections. For this, we scan each quad-words (8 bytes long) in .data and .rodata sections. If a quad-word matches the entry address of an instrumented function, we modify the quad-words to the entry of the relocated code. Similarly, we scan each instruction in .text; if we find an immediate operand or a PC-relative addressing operand matches an instrumented function entry, we adjust them accordingly. This optimization does not guarantee that all pointers in the programs are adjusted, so it is still required to install trampolines at all instrumented function entries. However, it reduces the number of times that trampolines are executed.

8.3 Customized Instrumentation Setup

Dyninst often emits additional instruction around the instrumentation to ensure instrumentation does not break the original program, including saving and restoring register states, including GPRs and XMM registers, adjusting for stack red-zone, aligning stack pointer, and guarding for recursive instrumentation. While Dyninst has internal heuristics to skip certain steps, they do not satisfy our need. We observed that Dyninst may emit as many as 100 additional instructions per instrumentation site. So we modified Dyninst to allow the users to specify which steps are necessary.

As shown in  Section 7.2 and  Section 7.3, our instrumentation sequences properly save and restore used register states. And, our instrumentation does not call any other functions; so, there is no need to guard for recursive instrumentation or to align the stack pointer. Since our leaf function instrumentation does not write to the stack, there is no need to protect stack red-zone. For non-leaf functions, when we instrument at function entry or exit, since stack red-zone space is not used for passing parameters or return values, our instrumentation is correct. When we find a non-leaf function contains stack red-zone write, we will apply safe path elision for this function only if we find two dead registers to avoid writes to the stack. In summary, we control Dyninst to emit no additional instructions without hurting the correctness of instrumentation.

9 Evaluation

We evaluated our techniques with SPEC CPU 2017 standard benchmarks and with couple of real world server applications (Redis and Apache HTTP server). The benchmarks were run on a machine with Intel(R) Xeon(R) CPU E5-2695 at 2.10GHz, 128GB memory, running Red Hat 7. We used Dyninst 10.1 for carrying out the shadow stack instrumentation.

9.1 SPEC CPU 2017 Results and Analysis

The SPEC CPU 2017 benchmarks were compiled with GCC-6.4.0, with the default optimization level (-O3). The benchmarks were run five times each. Figure 5 shows the overheads for SPEC CPU 2017. FULL mode is the baseline unoptimized shadow-stack with no instrumentation elision. LIGHT mode corresponds to the fully optimized shadow-stack with all policy and mechanism optimizations applied. Our optimizations yield significant overhead reductions over the baseline. LIGHT mode has a 2.4% geometric mean overhead and a 14.1% maximal overhead (600.perlbench_s), while FULL mode has a 8.1% geometric mean overhead and a 26.8% maximal overhead (620.omnetpp_s).

Figure 5: SPEC 2017 results. FULL mode represents the results of instrumenting every function without any of our optimizations. LIGHT mode represents the results of instrumentation with all optimizations enabled.

Policy vs. Mechanism

In order to break down the impact of the various policy and mechanism optimizations, we selected seven high overhead benchmarks from the suite and measured the overhead with a subset of optimizations enabled. Figure 6 shows the effects of policy vs. mechanism optimizations. We make three observations.

First, EMPTY represents Dyninst run-time overhead due to trampolines and relocated code. We measure this by instrumenting all the functions with an empty instrumentation. With our improvements in  Section 8 the EMPTY overhead is now less than 2% on average (down from 100% in some cases). We believe that these improvements are necessary for making Dyninst acceptable for binary hardening use cases.

Second, we observe that safe function elision (SFE) always yields an overhead reduction. Also, in most cases safe path elision (SPE) compliments SFE with roughly cumulative overhead reductions. However, we attribute small overhead jumps with SPE to the way edge instrumentation is currently implemented in Dyninst. At present, Dyninst introduces the stack push in the transition edge (see  fig. 3) as a separate basic block and does not inline it to the source block even when it is possible to do so, which results in an additional branch.

Third, we observe that our mechanism optimizations (MO) also yield significant overhead reductions, even more so than the policy optimizations (PO). But more importantly, combined together, PO and MO result in overhead reductions greater than when individually applied (except for 631.deepsjeng_s).

Figure 6: Results for seven high overhead benchmarks in SPEC 2017 with partial optimization enabled. EMPTY represents relocating all functions with empty instrumentation. FULL and LIGHT are copied from  Figure 5 for reference. SFE represents applying only safe function elision described in  Section 5. PO represents applying SFE and safe path elision described in  Section 6. MO represents our shadow stack mechanism optimizations described in  Section 7.

In Figure 5 and Figure 6, we observed several negative overhead numbers for certain benchmarks, such as 621.wcf_s with FULL mode and 657.xz_s with LIGHT mode. Our optimizations not only change the policy and mechanism of shadow stacks, but also change the layout of the executable, which can impact i-cache efficiency and branch prediction accuracy. We believe such a layout change can accidentally improve performance. It is interesting to investigate the interactions between code layout caused by instrumentation and the execution efficiency, as other existing work has also reported negative overheads for a small subset of benchmarks [5]. We leave this investigation as future work.

Individual Optimizations

Next, we investigate the exclusive contribution to overhead reduction from each optimization. The exclusive contribution can be measured by disabling one optimization with all other optimization enabled. The results are shown in  Figure 7. We make the following observations.

First, each optimization makes unique contribution for reducing overhead, even though in certain cases an optimization may have slight negative impact on overhead reduction. We already explained how SPE may lead to extra branch instructions and thus, jumps in overhead. Binary function inlining may cause extra i-cache pressure. However, taken as a whole, all optimizations lead to a positive overhead reduction.

Second, the overhead reductions from different optimizations are not additive. For example, SPE and DRC may be conflicting. SPE moves instrumentation from function entry to a later position in the function. It is possible that we have sufficient dead registers at function entry, but no dead registers when instrumentation is lowered. If the safe control-flow paths are infrequently executed this lowering may result in a performance loss. Path execution frequency is a dynamic property which can only be obtained via a path profile [2]. So this leads to the idea of profile-guided instrumentation; where instrumentation can be made optimal with respect to both safe control-flow paths and dead registers using profile data. We leave this as future work.

Finally, we do not have safe function elision in this figure as safe path elision is defined based on the assumption that safe function elision is enabled.

Figure 7: Results for seven high overhead benchmarks in SPEC 2017 with partial optimization enabled. LIGHT results are copied from  Figure 5 for reference. Safe path elision, register frame, binary function inlining, and dead register chasing are denoted as SPE, RF, INLINE, and DRC, respectively.

Applicability of Optimizations

To better understand the applicability of our optimizations, we also collected statistics on how many functions that each major optimization applies to. Table 1 shows that our policy optimizations (SFE + SPE) cumulatively apply to about 35% of functions on average. However, they result in significant overhead reductions since these optimizations capture frequently executed leaf/ near-leaf functions.

Benchmark SFE SPE RF Total
600.perlbench_s 14% 31% 1% 2336
602.gcc_s 20% 34% 2% 12157
603.bwaves_s 2% 2% 0% 42
605.mcf_s 19% 21% 12% 43
607.cactuBSSN_s 35% 15% 1% 2716
619.lbm_s 3% 7% 3% 29
620.omnetpp_s 18% 10% 5% 5434
621.wrf_s 24% 8% 33% 7274
623.xalancbmk_s 30% 13% 7% 3875
625.x264_s 13% 17% 27% 540
628.pop2_s 11% 12% 7% 2007
631.deepsjeng_s 37% 17% 12% 115
638.imagick_s 23% 25% 1% 2351
641.leela_s 29% 17% 11% 335
644.nab_s 13% 20% 6% 236
648.exchange2_s 14% 14% 0% 14
649.fotonik3d_s 7% 5% 4% 125
654.roms_s 45% 10% 4% 333
657.xz_s 25% 27% 5% 362
mean 20% 15% 7%
Table 1: Function instrumentation statistics. “SFE”, “SPE”, and “RF” columns present the percentages of the total functions, to which safe function elision, safe path elision, and register frame optimization are applied. All optimizations are non overlapping.

Memory Write Classification

Our safe path and function elision are based on which uses stack height analysis in order to determine unsafe memory writes. An unsafe memory write or an indirect function call will cause a basic block to be unsafe, and thus ruling out both safe path and safe function elision. Table 2 breaks down the percentages of different types of memory writes in each benchmark. The “Global” column represents memory writes that write to a statically determined address; these memory writes are typically PC-relative addressed, with an address in .data or .rodata

sections. Such global writes are guaranteed to not overwrite any return addresses. The “Unsafe” column combines several types of memory writes including a) stack writes with negative stack heights (i.e: writes to a non local frame) b) memory writes which our stack analysis was unable to classify as being stack writes. In practice this second category may contain heap memory writes and writes to memory references passed via function arguments etc. Our current analysis yields unsafe write percentages ranging from 6% to 66%. We can potentially improve the effectiveness of safe function elision and safe path elision by improving the precision of this memory classification by:

  • incorporating a heap points-to analysis so that heap writes can be ruled out from the unsafe set

  • extending the analysis to be inter-procedural so that heap and stack references passed as arguments can be tracked and ruled out as safe at access sites.

We leave this as future work.

Benchmark Stack Global Unsafe Total
600.perlbench_s 49% 9% 42%
602.gcc_s 55% 9% 36%
603.bwaves_s 67% 0% 33%
605.mcf_s 32% 1% 66%
607.cactuBSSN_s 55% 9% 36%
619.lbm_s 81% 1% 18%
620.omnetpp_s 49% 3% 48%
621.wrf_s 82% 2% 16%
623.xalancbmk_s 56% 1% 43%
625.x264_s 38% 0% 62%
628.pop2_s 76% 4% 20%
631.deepsjeng_s 44% 9% 47%
638.imagick_s 53% 0% 47%
641.leela_s 43% 0% 57%
644.nab_s 46% 13% 40%
648.exchange2_s 52% 3% 45%
649.fotonik3d_s 86% 8% 6%
654.roms_s 84% 4% 12%
657.xz_s 37% 1% 62%
Table 2: Memory write classification. “Stack” column represents safe stack writes which only access the local stack frame. “Global” column represents writes to globals. “Unsafe” is rest of the memory writes.

9.2 Server Benchmark Results and Analysis

We also measured the overhead of shadow stack protection on couple of server benchmarks. We use the throughput reduction as the main overhead metric for this evaluation. In these experiments, we only instrument the server binary; we do no instrument the benchmark client. We used following server setups:

Redis: We compiled Redis-5.0.7 [32] with GCC-6.4.0 with its default optimization level (-O2). Redis provides a benchmark client redis-benchmark. We ran it with the default input parameters, which means 50 parallel connections and 100,000 total requests. The experiment was repeated for 10 times. Table 3 presents the results for the LIGHT mode and the FULL mode. Our results show that our optimizations can significantly reduce the overhead; the largest throughput reduction is 7.07% for operation GET.

Operation LIGHT FULL
PING_INLINE -1.76% 12.70%
PING_BULK 5.48% 13.41%
SET 3.24% 7.48%
GET 7.07% 19.24%
INCR 0.84% 4.91%
LPUSH -6.81% 16.12%
RPUSH -7.84% 10.88%
LPOP -5.25% 17.43%
RPOP -5.98% 12.64%
SADD 4.21% 16.42%
HSET -1.15% 16.47%
SPOP -6.41% 23.95%
LPUSH -3.42% 16.88%
LRANGE_100 -2.38% 12.25%
LRANGE_300 0.83% 15.41%
LRANGE_450 0.74% 13.77%
LRANGE_600 1.11% 11.82%
MSET (10 keys) 1.71% 12.99%
Table 3: Redis benchmark throughput reduction results. The numbers are averaged by 10 runs.

Apache HTTP Server: We ran ab (a client program) on the httpd server instances, by transmitting a 150KB html file with number of clients 1, 4 and 8. Table 4 shows that our optimizations are effective for reducing overhead.

No. of clients Latency increase Throughput decrese
LIGHT FULL LIGHT FULL
1 1.0% 6.0% 1.0% 7.0%
4 -0.7% 2.0% 0.3% 3.0%
8 -0.8% 0.5% -1.0% 2.0%
Table 4: Apache HTTP server benchmark results.

9.3 RIPE64 Benchmark

RIPE [34] is a runtime intrusion prevention evaluator, implementing a large set of buffer overflow attacks. RIPE64 [13] is a port of the original 32 bit implementation to 64 bits. Among all attacks implemented by in RIPE64, 24 attacks are triggered through overflowing a buffer on the stack to overwrite a return address on the stack. Without our instrumentation, 23 attacks always succeeded and 1 attack often succeeded. With our instrumentation, all 24 attacks always failed. This further validates our theoretical result about the soundness of based instrumentation elision.

10 Discussion and Related Work

Memory Protection: We assume that shadow region is protected via information hiding [30] similar to previous literature [5, 9]. However, recent works [24, 18, 15] demonstrate that information hiding is insufficient to prevent information leakage even on 64-bit systems. But unfortunately current schemes for implementing safe memory-regions on x64 incur high overheads [20]. Implementing cheap, safe memory-regions on x64 remains an interesting future research.

Hardware implementations: Upcoming Intel Control-flow Enforcement Technology [19] includes a hardware shadow stack. However, it will be at least several years before CET enabled hardware reach mass adoption since no CET enabled hardware has been released yet. Even then, it will be important to protect programs running on legacy hardware. Intel Processor Trace (PT) has also been shown to be effective for implementing shadow stacks [16].

Compiler protections: Our techniques might as well be implemented as a compiler protection. In fact, some compilers have recently implemented shadow stack protection [12]. Also, all major compilers feature a -fstack-protector stack canary protection [33] that leverages certain heuristics in order to skip the canary checks at certain functions  [11]. However to our knowledge, ours is the first work which explores an instrumentation elision policy with a well defined threat model and with a theoretical study of soundness. If implemented within a compiler, our instrumentation policies may yield greater overhead reductions since compiler-based memory analysis and indirect call analysis [21] can be more precise when compared to our binary analyses.

Binary instrumentation: We use static binary rewriting as opposed to run-time binary instrumentation which may be applied without the need of a pre-processing step. Even though there has been some recent work on lightweight run-time instrumentation [14, 7, 8], since we still need to carry out heavy weight static analyses as part of the instrumentation, we move all of the instrumentation tasks to before run-time.

Static analysis performance: Our current static analyses finish in a few minutes. With more precise binary analysis to analyze memory writes and indirect calls such as C++ virtual functions [28, 26, 36], the analysis time may become a concern. Dyninst has started parallelizing its binary analysis with multi-threading [22], which is promising for addressing the concern of increased analysis time.

11 Conclusion

In this paper we presented several new techniques for reducing shadow stack overhead inspired by the design principle of separating mechanism and policy. We demonstrated that it is possible to derive an instrumentation policy for eliding instrumentation on provably safe code regions. We also demonstrated how this instrumentation elision policy compliments improvements to the instrumentation mechanism by implementing several novel code optimizations. Our techniques are general, and can be used to improve the performance of existing compiler or other software-based shadow stacks.

References

  • [1] Martín Abadi, Mihai Budiu, Úlfar Erlingsson, and Jay Ligatti. Control-flow integrity principles, implementations, and applications. ACM Transactions on Information and System Security (TISSEC), 13(1):4, 2009.
  • [2] Thomas Ball and James R Larus. Efficient path profiling. In Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29, pages 46–57. IEEE, 1996.
  • [3] Andrew R. Bernat and Barton P. Miller. Structured binary editing with a cfg transformation algebra. In 2012 19th Working Conference on Reverse Engineering (WCRE), page 9–18, Kingston, ON, Canada, Oct. 2012.
  • [4] Andrew R. Bernat, Kevin A. Roundy, and Barton P. Miller. Efficient, sensitivity resistant binary instrumentation. In The International Symposium on Software Testing and Analysis (ISSTA), Toronto, Canada, July 2011.
  • [5] Nathan Burow, Xinping Zhang, and Mathias Payer. Sok: Shining light on shadow stacks. In 2019 IEEE Symposium on Security and Privacy (SP), pages 985–999. IEEE, 2019.
  • [6] Nicholas Carlini, Antonio Barresi, Mathias Payer, David Wagner, and Thomas R Gross. Control-flow bending: On the effectiveness of control-flow integrity. In 24th USENIX Security Symposium (USENIX Security 15), pages 161–176, 2015.
  • [7] Buddhika Chamith, Bo Joel Svensson, Luke Dalessandro, and Ryan R Newton. Living on the edge: rapid-toggling probes with cross-modification on x86. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 16–26, 2016.
  • [8] Buddhika Chamith, Bo Joel Svensson, Luke Dalessandro, and Ryan R Newton. Instruction punning: Lightweight instrumentation for x86-64. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 320–332, 2017.
  • [9] Thurston HY Dang, Petros Maniatis, and David Wagner. The performance cost of shadow stacks and stack canaries. In Proceedings of the 10th ACM Symposium on Information, Computer and Communications Security, pages 555–566. ACM, 2015.
  • [10] Dyninst Developers. Dataflowapi programmer’s guide. https://dyninst.org/sites/default/files/manuals/dyninst/dataflowAPI.pdf, 2016. [Online; accessed 13-February-2020].
  • [11] GCC Developers. "Strong" stack protection for GCC, 2014 (accessed February 11, 2020).
  • [12] LLVM Developers. Shadow Call Stack, 2019 (accessed February 11, 2020).
  • [13] RIPE64 Developers. RIPE64: a 64bit port of the Runtime Intrusion Prevention Evaluator, 2019 (accessed February 11, 2020).
  • [14] Alexis Engelke and Josef Weidendorfer. Using llvm for optimized lightweight binary re-writing at runtime. In 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pages 785–794. IEEE, 2017.
  • [15] Dmitry Evtyushkin, Dmitry Ponomarev, and Nael Abu-Ghazaleh. Jump over aslr: Attacking branch predictors to bypass aslr. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 1–13. IEEE, 2016.
  • [16] Xinyang Ge, Weidong Cui, and Trent Jaeger. Griffin: Guarding control flows using intel processor trace. In Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Xi’an, China, April 2017.
  • [17] Hector Marco Gisbert and Ismael Ripoll. On the effectiveness of nx, ssp, renewssp, and aslr against stack buffer overflows. In 2014 IEEE 13th International Symposium on Network Computing and Applications, pages 145–152. IEEE, 2014.
  • [18] Enes Göktaş, Robert Gawlik, Benjamin Kollenda, Elias Athanasopoulos, Georgios Portokalidis, Cristiano Giuffrida, and Herbert Bos. Undermining information hiding (and what to do about it). In 25th USENIX Security Symposium (USENIX Security 16), pages 105–119, 2016.
  • [19] Intel. Control-flow Enforcement Technology Specification, 2019 (accessed February 12, 2020).
  • [20] Koen Koning, Xi Chen, Herbert Bos, Cristiano Giuffrida, and Elias Athanasopoulos. No need to hide: Protecting safe regions on commodity hardware. In Proceedings of the Twelfth European Conference on Computer Systems, pages 437–452, 2017.
  • [21] Kangjie Lu and Hong Hu. Where does it go? refining indirect-call targets with multi-layer type analysis. In 2019 ACM SIGSAC Conference on Computer and Communications Security (CCS), London, United Kingdom, Nov. 2019.
  • [22] Xiaozhu Meng, Jonathon M. Anderson, John Mellor-Crummey, Mark W. Krentel, Barton P. Miller, and Srđan Milaković. Parallelizing binary code analysis, 2020.
  • [23] Xiaozhu Meng and Barton P. Miller. Binary code is not easy. In The International Symposium on Software Testing and Analysis (ISSTA), Saarbrücken, Germany, July 2016.
  • [24] Angelos Oikonomopoulos, Elias Athanasopoulos, Herbert Bos, and Cristiano Giuffrida. Poking holes in information hiding. In 25th USENIX Security Symposium (USENIX Security 16), pages 121–138, 2016.
  • [25] Paradyn Project. Dyninst: Putting the Performance in High Performance Computing, http://www.dyninst.org.
  • [26] Andre Pawlowski, Moritz Contag, Victor van der Veen, Chris Ouwehand, Thorsten Holz, Herbert Bos, Elias Athanasopoulos, and Cristiano Giuffrida. Marx: Uncovering class hierarchies in c++ programs. In 24th Annual Symposium on Network and Distributed System Security (NDSS), San Diego, California,USA, Feb. 2017.
  • [27] Ryan Roemer, Erik Buchanan, Hovav Shacham, and Stefan Savage. Return-oriented programming: Systems, languages, and applications. ACM Transactions on Information and System Security (TISSEC), 15(1):2, 2012.
  • [28] Edward J. Schwartz, Cory F. Cohen, Michael Duggan, Jeffrey Gennari, Jeffrey S. Havrilla, and Charles Hines.

    Using logic programming to recover c++ classes and methods from compiled executables.

    In 2018 ACM SIGSAC Conference on Computer and Communications Security (CCS), Toronto, Canada, Nov. 2018.
  • [29] Laszlo Szekeres, Mathias Payer, Tao Wei, and Dawn Song. Sok: Eternal war in memory. In 2013 IEEE Symposium on Security and Privacy, pages 48–62. IEEE, 2013.
  • [30] PaX Team. Pax address space layout randomization (aslr). 2003.
  • [31] Qualys Research Team. The Stack Clash, 2017 (accessed November 11, 2019).
  • [32] The Redis Project.

    An open source, in-memory data structure store,

    https://redis.io/.
  • [33] Perry Wagle, Crispin Cowan, et al. Stackguard: Simple stack smash protection for gcc. In Proceedings of the GCC Developers Summit, pages 243–255. Citeseer, 2003.
  • [34] John Wilander, Nick Nikiforakis, Yves Younan, Mariam Kamkar, and Wouter Joosen. RIPE: Runtime intrusion prevention evaluator. In In Proceedings of the 27th Annual Computer Security Applications Conference, ACSAC. ACM, 2011.
  • [35] Rafal Wojtczuk. The advanced return-into-lib (c) exploits: Pax case study. Phrack Magazine, Volume 0x0b, Issue 0x3a, Phile# 0x04 of 0x0e, 2001.
  • [36] Chao Zhang, Chengyu Song, Kevin Zhijie Chen, Zhaofeng Chen, and Dawn Song. Vtint: Protecting virtual function tables’ integrity. In 22nd Annual Network and Distributed System Security Symposium (NDSS), San Diego, CA, USA, Feb. 2015.

Appendix A Appendix

[t] [escapeinside=||, mathescape=true]gas # Create scratch registers push push # Load the shadow stack pointer mov L3: # Move back the shadow stack pointer lea -0x8(# Load the value from shadow stack mov (# Store the shadow stack pointer mov # Compare return address with # the one in shadow stack cmp 0x10(je L1 # Keep comparing until reaching # the bottom of the shadow stack cmpl

[t] [escapeinside=||, mathescape=true]gas # Compare return address on the stack # with the value in the register frame cmp (je L1 # Create scratch registers for unwinding # the shadow stack push push # Load the shadow stack pointer mov L4: # Move back the shadow stack pointer lea -0x8(# Load the value from shadow stack mov (# Store the shadow stack pointer mov # Compare return address with current # shadow stack frame cmp 0x10(je L2 # If we reach the bottom of the shadow # stack, we encounter a unmatched return # address, triggering an SIGILL cmpl