Spatial safety violations are the root cause of many security attacks (Shacham, 2007; Bletsch et al., 2011; One, 1996; Nergal, 2001; Checkoway et al., 2009; Evans et al., 2015; Strackx et al., 2009; Conti et al., 2015; Biondo et al., 2018). Attackers can exploit spatial safety bugs to hijack an application’s control flow or steal sensitive information (e.g., passwords). Beyond security issues, spatial safety is also important to ensure the expected behavior of an application. For example, unintentionally accessing an out-of-bounds location can cause unintended behavior or program crashes that are hard to debug.
Spatial safety is just one aspect of reliability. Managed languages, such as Java and C#, offer better reliability by providing complete (spatial and temporal) memory and type safety. However, C doesn’t guarantee any of these safeties by default. Despite the lack of memory safety, C and C++ are still preferred over managed languages for systems applications because managed languages are less efficient. Consequently, many performance-sensitive applications are still vulnerable to security exploits and are therefore not reliable. In this work, we propose a mechanism to enforce spatial safety for C applications.
Several techniques have been proposed to enforce spatial safety for C/C++ applications. At a high level, these techniques can be categorized into pointer-based (Necula et al., 2002; Condit et al., 2003; Nagarakatte et al., 2009; Jim et al., 2002; Austin et al., 1994; Xu et al., 2004) and object-based (Jones and Kelly, 1997; Ruwase and Lam, 2004; Dhurjati and Adve, 2006; Kuvaiskii et al., 2017; Akritidis et al., 2009; Younan et al., 2010; Duck and Yap, 2016; Duck et al., 2017) approaches.
Pointer-based approaches track the bounds of sub-objects and can detect sub-object overflows. Even with hardware support (12; 17), these approaches incur high CPU and memory overheads because they need to store and update bounds information for every pointer. Oleksenko et al. (Oleksenko et al., 2018) have reported around 75% CPU and 125% memory overheads for SPEC benchmarks for the Intel MPX (17) implementation of the ICC compiler.
In object-based approaches, spatial safety checks ensure that the memory access using a pointer is within the heap/stack/ global allocation bounds. These approaches have low memory overheads because they don’t need to store the bounds for every pointer. Finding the base or limit of an object using a pointer is challenging because the pointer can be an internal address of an object. Initial approaches (Jones and Kelly, 1997; Ruwase and Lam, 2004; Dhurjati and Adve, 2006) used a splay-tree-based lookup to check if the pointer points to a location within object bounds. For efficiency, later works (Akritidis et al., 2009; Younan et al., 2010; Duck and Yap, 2016; Duck et al., 2017)
enforced spatial safety at loose allocation bounds rather than the actual allocation bounds. These works pad the actual allocation size to satisfy an alignment property and keep track of alignment instead of the actual allocation size. Figure1 shows the object layout in these approaches. An important drawback of these approaches is that they allow applications to access the padded area that is not within the actual bounds of the objects. This would allow unintended behavior that may remain undetected.
Our goal is to enforce bounds checking for the actual bounds of the object. We call this property complete object-bounds protection. SGXBounds (Kuvaiskii et al., 2017) is so far the most efficient technique (41% CPU and 0.4% memory overheads for SPEC CPU2006) that provides complete object-bounds protection, but it restricts the application’s usable address space to 32-bit on a 64-bit platform. SGXBounds uses the remaining 32 bits to store the upper bound of the object. This allows SGXBounds to compute the upper bound directly from the pointer itself without any expensive search. The fundamental weakness of this approach is that it can’t support larger address space because the upper bound can’t fit in the unused bits of the virtual address. As a consequence, the application’s address space gets restricted to 32-bit.
We propose CGuard, a framework that provides object-bounds protection without restricting the virtual address space. Figure 2 shows the layout of an object and pointer in our scheme in a general case. CGuard stores the size of an object before the base address of the object and attaches a tag to every pointer to efficiently locate the base address. CGuard uses the spare 16-bits of a virtual address available in the x86_64 hardware to store the tag. In the tag, CGuard stores the relative offset of the pointer with respect to the base address of the object referred to by the pointer. To find the base address, CGuard simply subtracts the offset from the pointer value. For objects that can not fit in the spare bits, CGuard uses a custom allocator that allows it to find the base of an object in just one memory access. However, unlike SGXBounds, CGuard needs to update the offset in the tag on every pointer arithmetic. CGuard performs static analysis to reduce the number of tag updates. The mean CPU and memory overheads incurred by CGuard for SPEC CPU2017 are 42.1% and 1.1%, respectively.
Spatial safety mechanisms for managed languages are well understood. The size of an object is stored along with the object. Managed languages don’t allow pointer arithmetic, enabling the mechanisms to discover the size of the object at all program points statically. On the other hand, C allows programmers to create interior pointers, store them in memory, pass them to other routines, and return them to a caller. This makes the static tracking of base pointers very hard. In our approach, the tag information needs to be updated only if a statically known potential interior pointer escapes the static scope. Thus our scheme allows programmers to control the overhead of spatial safety. If the usage of interior pointers is restricted to the static scope, our tagged pointers are equivalent to normal pointers, and the spatial safety handling mechanism is similar as in the case of a managed language.
In summary, we make the following contributions.
An approach based on pointer tagging to provide object-bounds protection for C applications at low overheads.
An LLVM-based implementation and performance evaluation for the SPEC, Phoenix, and Apache webserver benchmarks.
Detection and reporting of bugs in the SPEC CPU2017 and Phoenix-2.0 benchmark suites.
As described earlier, CGuard stores the size of an object before the base address of the object called object-header. To compute the object bounds, we need to find the base address of the object at runtime. Consider the following example,
In this example, the argument arr in bar is an interior pointer. To compute the bounds of arr at line-4, we need to find the base address of arr. To locate the base address, we store the offset from the base address in the tag area of the pointer. Here, the tag area of argument arr contains value 20. Using this information, CGuard can compute the base address by simply subtracting offset (20) from the virtual address of arr. CGuard doesn’t update the tag area for every pointer. For example, at line-4, after computing the address of arr[i] for the memory access, CGuard doesn’t need to update the tag because it statically knows that arr and &arr[i] belong to the same array, and it can compute the base address from the argument arr at line-1. We call arr the static-base of &arr[i]. Similarly, the static-base of &newarr[i] at line-5 is newarr at line-3. CGuard statically analyzes the routine to identify the static-base for every pointer. Whenever a pointer escapes the static scope, it may become a static-base in other parts of the program. For example, at line-13, &x leaves the static scope and becomes the static-base at line-3. Therefore, we update the tag before storing &x in var at line-13. Similarly, CGuard updates the tag of &x (at line-14) and &arr[i] (at line-8) before passing to and returning from the bar routine. CGuard doesn’t need to update the tag while storing the return value of bar in var at line-14. This is because the return value of a function is a static-base, and it already has the correct offset in its tag area.
A problem with this approach is that the maximum offset gets restricted by the number of bits in the tag field (CGuard uses 15 bits to store the offset). For objects that can’t fit into 15 bits, CGuard uses a segmented heap (Section 2.3). In this case, the base of the object is computed using the alignment property of the segmented heap. Another problem is that C doesn’t distinguish between a pointer and an array. For example, argument n in bar at line-2 is a pointer to a structure element; however, CGuard needs to add bounds check at line-7 before accessing the structure field because it could be an array of structures. On the contrary, in managed languages, the type-system can distinguish between an object and an array of objects. Therefore, object accesses don’t need to perform explicit bounds checks. To eliminate the need for these bound checks, we expect all static-bases to point to a memory area that is large enough to store at least one element of the corresponding array. We call this property the size-invariant property. For example, in the above example, CGuard requires the argument n to point to a memory area that is at least “sizeof(struct node)” long. We found that for most benchmarks, this property already holds. In our scheme, programs that don’t satisfy the size invariant property may have to pay an additional performance penalty.
In our scheme, changing the pointer layout further complicates the pointer comparison and subtraction operations. Now, the same pointers may have different offsets in their tag area depending on their static bases. For example, the equality checks at line-6 will fail because newarr and arr contain different offsets in their tag area. To handle this correctly, CGuard resets the tags in the pointer operands during these operations. CGuard also resets the tag before every memory access. CGuard uses custom wrappers to invoke system library routines. These wrappers reset the tag field from the pointer arguments because the unmodified library doesn’t understand CGuard’s pointer layout. Finally, CGuard inserts dynamic checks before memory accesses to abort the program if the memory accesses are not within the object-bounds.
Figure 3 shows the architecture of CGuard. CGuard takes the intermediate representation (IR) of a program as input. The IR is in static single assignment (SSA) form. We incorporate our spatial safety logic in the IR to generate the checked IR. The checked IR is compiled to an executable. At load time, the executable is linked with our custom library that implements wrappers, custom library routines, and the custom allocator. In the rest of this section, we explain our scheme in detail.
2.2. Identifying static-base
For every pointer x, we statically identify a pointer y from which x is derived. We call y the static-base of x. We discuss below our algorithm to find the static-base for different kinds of definitions in the IR.
For pointer arithmetic and typecast operation x, we recursively backtrack all arithmetic and typecast operations to obtain a pointer y that is not the result of pointer arithmetic or a typecast operation. In this case, the static-base of y is the static-base of x.
For an integer-to-pointer typecast x, if we can statically correlate x with a previous pointer to integer operation y, we infer the static-base of x as the static-base of y. If a corresponding pointer to integer operation is not found, x is treated as the static base of itself.
Pointers loaded from memory, the return value of a function call, function arguments, stack allocations, and global variables are also the static-bases of themselves.
The SSA representation contains phi-nodes to merge the definitions coming from multiple predecessor basic-blocks. In this case, we add a new phi-node that merges the static-bases of the definitions coming from these predecessors.
z_sb = phi <sb(x), pred1>, <sb(y), pred2> z = phi <x, pred1>, <y, pred2>
In the above example, z is a phi-node that merges definitions x and y coming from basic blocks pred1 and pred2. We add a new phi-node z_sb, the static-base of z, that merges the static-bases of x and y denoted using sb(x) and sb(y). In general, there can be any number of predecessors.
The IR contains the instruction select that emulates the ternary operator as shown below.
z_sb = select cond, sb(x), sb(y) z = select cond, x, y
In this example, select takes condition cond and definitions x and y as input and creates a new definition z. At runtime, z will be equal to x or y depending on the value of cond. To find the static-base, we add an additional select instruction that takes cond, sb(x), and sb(y) as inputs and create a new definition z_sb, the static-base of z.
2.3. Tagged pointer
We use the spare higher 16-bits of virtual address on x86_64 hardware to store the tag. Conceptually, a tagged-pointer has the following structure.
In the rest of the paper, we will refer to the tagged-pointer type using tag_t. The lower 48-bits of a pointer contain the actual address, represented using the address field in the tag_t. The invalid field is used to mark a pointer invalid (as discussed later in this section). The maximum offset that can be stored in the 15-bits offset field is MAX_OFFSET (0x7FFF). The allocation for a size larger than or equal to MAX_OFFSET is performed from the segment-based allocator (discussed in the next paragraph). The offset field in the static-base tagged pointer contains the offset relative to the actual base address of the object referred by the static-base. If the relative offset is equal to MAX_OFFSET, then the base address is computed using the alignment property of the segment-base allocator.
The segment-based allocator maintains a list of segments that are shared across all threads. A segment (Figure 4) is a 4GB (configurable at compile time) contiguous virtual address space. The starting address of a segment is aligned to 4GB. The segment is divided into fixed-size slots. Both the size and alignment of a slot are (a power of two). The value of k can vary across segments. The first few pages of the segment are used to store the metadata (e.g., a bitmap to track free slots). Initially, the virtual addresses are reserved for the entire segment. Physical pages are mapped only during the actual allocation. For every allocation, a slot is returned to the caller. Because a slot can be much larger than an actual allocation size, we only map the number of physical pages that are sufficient for the allocation size. The physical pages are reclaimed during the deallocation.
CGuard manages stack allocations of sizes greater than or equal to MAX_OFFSET using malloc and free. For these objects, CGuard replaces the calls to stack allocation API with calls to malloc and insert free when the objects go out of scope. We also insert object-headers before stack and global allocations.
Updating the pointer tag:
The tag is updated every time a pointer escapes the static scope as a result of it being passed to a function, stored in memory, or returned to a caller. We don’t track a pointer if it escapes after being typecasted to an integer. Instead, we expect that the program casts them back to a pointer before the escape if the integer is out-of-bounds or modified due to some arithmetic operations on the integer. After the escape, the pointer may become a static-base in other parts of the program. For example, in our static-base identification logic (Section 2.2), a loaded value is identified as static-base. After a pointer is stored in memory, it can be loaded at different parts of the program and treated as a static-base. We update the pointer tag before the escape to ensure that all the tagged static-bases always contain the correct offset.
The update_tag routine in Figure 5 updates the tags for escaping pointers. In this routine, base is the actual base address of the object, ptr is the escaping pointer, access_size is the size of the type of array element that ptr is pointing to, and the limit is the upper bound of the object. The return value of update_tag is a tagged pointer (for the address in ptr) with the correct offset. The update_tag routine stores the relative offset of the ptr address with respect to the base address in the return value if the relative offset is less than or equal to MAX_OFFSET. Otherwise, MAX_OFFSET is stored in the return value. If the address of ptr is not within the bounds, the invalid bit is set in the return value. Notice that update_tag is only needed if ptr is not a statically known alias of the static-base of ptr, or if access_size is larger than the size of the array element type pointed by the static-base of ptr. The access_size check is needed for the size-invariant property discussed in Section 2.5.
Handling out-of-bounds pointers:
Dereferencing an out-of-bounds pointer is illegal; however, its creation is not. CGuard can compute the base of an out-of-bounds static-base in the range [base, base+MAX_OFFSET). If a pointer goes out-of-bounds in any way, but the corresponding static-base follows the above property, CGuard can always retrieve the actual base address of the pointer. The base computation logic is discussed in Section 2.4. If an out-of-bounds pointer escapes the static scope, CGuard sets the invalid field in the pointer tag (Figure 5). This is needed for two reasons:
If the offset is greater than or equal to MAX_OFFSET or less than 0, CGuard can’t track an out-of-bounds pointer just using the offset field.
CGuard relies on the invalid bit to throw an out-of-bounds exception for pointers that don’t satisfy the size-invariant (Section 2.5).
2.4. Computing bounds and inserting checks
CGuard computes the base address of a pointer definition using the tagged static-base. The base computation logic (get_base) is shown in Figure 6. If the offset is less than MAX_OFFSET, get_base subtracts the offset in the tag from the address of the static-base pointer. Otherwise, if the static-base is invalid (i.e., out-of-bounds), get_base can’t retrieve the actual base and returns NULL. If the offset is equal to the MAX_OFFSET and the address belongs to the range of global variables, it calls the allocator API get_base_allocator (discussed in Section 3), which doesn’t rely on pointer tag to obtain the base. It also maintains a small cache to avoid calls to the allocator API, which works well in practice because there are only a few global variables of size greater than or equal to MAX_OFFSET across all of our benchmarks. Finally, for segment-based allocation, the base address of the object is computed using the alignment property of the segments. All slots in a segment are aligned to . The starting address of a slot is computed by resetting the lower k-bits of the pointer address. The first eight bytes of a segment contains slot_mask (). The starting address of the object slot is computed by ‘anding’ the pointer address and the slot_mask. The starting address of a slot is the object-header. The base address of the object is computed by skipping the object-header.
If the static-base is an integer-to-pointer typecast, the offset field can be incorrect due to untracked integer arithmetic operations performed on the integer. To handle this case, if the static-base is an integer-to-pointer instruction or a phi or select node that depends on an integer-to-pointer instruction, CGuard backtracks all operations on the integer to check if it is involved in any arithmetic. If so, CGuard uses the allocator API to find the base. In case an integer which escapes the static scope with an incorrect tag can be accessed in the future, we rely on the application to typecast it into a pointer before letting it escape.
Bounds check: Our bounds check logic is shown below.
The arguments to the bounds_check routine are the pointer (ptr) (without tag) that is being accessed, the base address (base) of the object referred by ptr (e.g., obtained using get_base), the upper bound of the memory access (ptrlimit), and the upper bound of the object (limit). The upper bound of the object is computed by adding the object size, obtained from the object-header, to the base address of the object. If ptr doesn’t lie between base and limit, the program is aborted.
CGuard enforces the size-invariant to eliminate checks when only the first element of an array or pointer to a structure element is accessed. This invariant requires all static-bases to point to a memory area that is large enough to store at least one element of the corresponding array. For example, if char *a is a static base, then a must point to a memory area that is at least one byte long; if the type of a is unsigned long long *, it must point to a memory area that is at least eight bytes long. If a pointer escapes the static scope, we invalidate the pointer if the size-invariant doesn’t hold. The corresponding logic is shown in the update_tag routine (Figure 5). Here, access_size is the size of the array element type. If ptr.address and ptr.address + access_size - 1 are not within the object-bounds, the invalid bit in the tag is set. This allows CGuard to remove bounds-check when only the first element of the array is accessed, since CGuard doesn’t reset the invalid bit for these accesses.
In our experiments, we found that the size-invariant holds for the majority of the benchmarks (Section 4.4). For the benchmarks that violate the invariant, the problem can be addressed either by using a smaller type for the static-base and external typecasts whenever needed or by using extra allocation. Consider the following example:
In this example, the size-invariant requires bar to pass an object of at least sizeof(struct info) to foo. Because the size-invariant doesn’t hold, CGuard invalidates the parameter passed to foo. The hardware generates an access violation when foo tries to dereference the invalid pointer. In this case, the bounds check will be performed in the signal handler as discussed in the next section. However, signal handling is expensive. These cases can be efficiently handled using code refactoring.
One way to fix this problem is to allocate at least sizeof(struct info) memory for the variable arr in the bar routine. This approach incurs memory overhead. Another way is to rewrite the foo and bar routines as follows:
In this case, because the type of argument in foo is int*, bar doesn’t invalidate the parameter passed to foo. CGuard adds dynamic checks in foo while dereferencing i because based on the size-invariant, it knows only that i is at least four bytes long. This approach doesn’t have any memory overhead but has a CPU overhead due to bounds checking. If types can’t be modified, an additional type attribute can be used to disable or pick a different size for the size-invariant optimization for a given type. We plan to implement the type attribute in the future.
2.6. Recovery from size-invariant errors
A legal memory access can cause an access violation if the size-invariant property is violated at runtime. Consider the following example:
Here, field_i of argument n is within the bounds, but the argument was invalidated because it didn’t satisfy the size invariant property. At function entry %rdi contains the argument n. At line-1, the address of field_i is computed. At line-3, CGuard resets the offset field in the pointer tag. At line-4, the actual dereference happens. Because the size invariant property doesn’t hold, the hardware will throw an exception at line-4. At this point, the signal handler in the userspace is called. To recover from fault, we perform a bound check in the signal handler for which we need to compute the base address. The base address can be computed using the fault address, tag bits, and the offset from the static-base. However, as we can see, the tag information is lost at this point because the %rdi register that was originally holding the tag has been overwritten at line-3. To obtain the tag bits, we have modified the compiler to ensure that the value of the potential fault address with the tag remains live during the access violation. In the modified assembly, at line-8, the compiler saves the content of %rdi in the %r11 register, which is live during the memory access.
In addition, CGuard generates metadata that is used by the signal handler to emulate the bounds check. The metadata includes the base register and displacement of the potential excepting instruction (%rdi and 0), the register that contains the tag (%r11), the offset from the static-base (0x10), the size of the memory access (4), and the length of the excepting instruction. Using this information, CGuard performs the bounds check in the signal handler. If the bounds check succeeds, CGuard generates a stub corresponding to the excepting instruction. The first instruction in the stub is the excepting instruction. The stubs are cached and reused for future faults to the same instruction pointer. Before calling the stub, the signal handler saves the address of the next instruction (address of line-11) and the contents of the base register (%rdi) on the stack (i.e., the stack pointer before the exception). It then sets the instruction pointer to the starting address of the stub and resets the invalid bit in the base register (%rdi) before returning from the signal handler. After returning from the signal handler, the stub code is executed that executes the excepting instruction and restores the value of the base register (%rdi) before returning to the original code (line-11).
Using this approach, we can only recover from those accesses in which the offset field in the tag is less than MAX_OFFSET. We can’t retrieve the base address for large objects because the invalid bit is used for both size-invariant violation and an out-of-bounds address. The additional overheads for these changes are in the range 0-5% for the SPEC benchmarks.
2.7. Library calls
We assume that system libraries are safe. Since library code can’t interpret our tagged pointers, we add wrappers around library calls to mediate between an instrumented application binary and unmodified system libraries, as shown in Figure 3. We trust most of the library functions to use pointer arguments safely. For some library routines, we insert bounds check to ensure spatial safety.
For many library calls, our compiler simply resets the tags in the pointer arguments before calling the target function. However, this is not always sufficient. For example, library functions may return an interior pointer, perform a callback to the application routine compiled using CGuard , and return their internal objects. In addition, the internal fields of an argument may contain tagged pointers. CGuard uses a custom implementation to handle these cases correctly.
2.8. Object initialization and memory accesses
If an object is not initialized properly, the application may access any arbitrary memory location. To prevent such cases, we initialize the pointer fields in all allocations (including stack and global variables) with NULL. Furthermore, if a global variable is initialized with an interior pointer, we also update the corresponding tag in the initialization.
If memory access is guarded by a bounds check, we reset the pointer tag before the memory access; otherwise, we only reset the offset field to catch the invalid accesses using pointers that don’t satisfy the size-invariant (Section 2.5).
For indirect calls with memory operands, we reset the pointer tag. If the address of a function leaves the static scope, we make it invalid. Marking the function addresses invalid disallows read/write on these addresses; however, the execution of invalid addresses is allowed using an indirect call. As a result, the application can execute any arbitrary virtual address using an indirect call. The existing mechanism for protecting indirect calls (Abadi et al., 2009; Zeng et al., 2011) can be used alongside our scheme to enforce control flow integrity for indirect calls.
We implemented CGuard as a compiler pass in the LLVM-10.0.0 compiler. We used JEMALLOC-5.2.1 as our allocator. We extended the JEMALLOC allocator to allocate large objects from our segment-based allocator as discussed in Section 2. We discuss below our implementation and some optimizations to reduce the CPU overheads.
Finding base: As we discussed in Section 2, for pointers whose tag may be incorrect due to untracked arithmetic operations on integers, CGuard relies on the allocator get_base_allocator routine, which takes an internal address of an object and returns the base address of the object. To find the base address for heap objects, we leveraged the JEMALLOC existing radix-tree implementation to find the extent (a large contiguous area to allocate fixed-size objects) corresponding to an address. The starting address and the allocation size of an extent are used to compute the starting address of the object. For large-heap objects, the base address is computed using the alignment property of the segments.
To support the base finding for stack variables, we register stack objects with the allocator when they are created and deregister them when they are destroyed. Note that this is required only for the stack variables that escape the static scope or are typecasted to an integer. For global and static variables, at load time, the allocator walks global objects in different sections of the executable as specified in the executable format and stores the bases in sorted order to find the base using the binary search during the program execution.
Variable-length arguments: Our current implementation doesn’t ensure safety for variable-length arguments. In our implementation, we have assumed that the variable number of arguments is always 16. To correctly handle this case, we require the caller to pass the number of arguments for every function call. For this purpose, we can use a spare caller-saved register, which is not used to pass the argument.
Memory-mapped files and shared memory: We don’t support memory-mapped files and some shared-memory APIs. However, we support ANONYMOUS mmap by allocating one extra page for storing the object-header. Notice that mmap always returns a page-aligned address.
Loop optimization: If i) a pointer is always accessed inside a loop, ii) the pointer address only depends on the induction variable and the values outside the loop, iii) the lower bound, upper bound, and the step count of the induction variable are known, iv) the loop executes at least once, and v) the loop condition is the only way to exit from the loop, then we move the bounds check outside the loop.
The example below demonstrates our optimization.
In this program, arr[i+k] is always accessed inside the loop, i is the induction variable, and k and j are defined outside the loop. The lower bound, upper bound, and the step count of i are zero, j, and one. The loop executes at least once because j is greater than zero. In this case, we can statically compute the lower bound and upper bound of all possible accesses within the loop, i.e., arr[k] and arr[k+j]. Thus, we can remove the check inside the loop and place one check outside the loop to check that memory accesses from arr[k] to arr[k+j] are safe. To prevent the underflow and overflow of lower and upper bounds, we add an assertion that the upper bound is greater than the lower bound.
Updating pointer tag: We need to compute the base and limit to update the pointer’s tag. The tag update is performed for escaping pointers (i.e., during store, call, and return). Because we initialize all pointers fields with NULL during allocation and mark all pointers initialized with a constant integer or a function address as invalid – we can easily filter a valid pointer during our base and limit computation. However, there are some cases when the pointer is not a valid object, e.g., an object that has been freed. In this case, accessing the object to fetch the size may cause an access violation. To handle this case, we always access the base address in the base finding handler for escaping pointers and catch the access violations using a signal handler in the userspace. In the signal handler, if the access violation is encountered at an instruction pointer that belongs to our base finding algorithm for escaping pointers, we change the instruction pointer in such a way that the base finder algorithm returns NULL. This eventually leads to the invalidation of the pointer during the tag update. We never encountered such cases in any of our benchmarks.
4.1. Experimental setup and benchmarks
We ran our experiments on a machine running Ubuntu-20.04.2 equipped with an 8-core 3.6 GHz Intel i9-9900k processor, 32GB RAM, 1-Gigabit Ethernet controller, and 512GB SSD drive for persistent storage. We disabled hyper-threading during our experiments. We measured CPU overheads using the SPEC CPU2017 (Bucek et al., 2018) benchmarks. We used the reference input size for SPEC. For multicore performance, we used Phoenix-2.0 (Yoo et al., 2009) and the Apache-2.4.46 webserver. For Apache, we also instrumented apr-1.7.0 and apr-util-1.6.1 for spatial safety checks. We configured Phoenix and Apache not to use memory-mapped files. In addition, we configured Apache to use “anonymous MMAP” for shared memory instead of “System V shared memory APIs”. Phoenix (Yoo et al., 2009) reports that for the kmeans, pca, and histogram benchmarks, the pthread version is more scalable than the map-reduce version. We ran the pthread version for these benchmarks and the map-reduce version for the rest. We used the large input set in our evaluations. For matrix-multiply, pca, and kmeans, we used input sizes of 2000x2000, 3000x3000, and 200000 respectively to make them run for at least a second. These benchmarks have a very short execution time even for the large input set. For the security evaluation, we ran the BugBench (Lu et al., 2005) benchmark suite.
To measure the execution time, we took the median of five runs for every benchmark. To report the memory overhead, we used the “Maximum resident set size” reported by “/usr/bin/time -v” command. We used the geometric mean to compute the average overhead. For the server experiment, we ran the client on a different machine (with a 1-Gbps network card) and directly connected both the machines. For scalability experiments, we disabled CPU cores using the CPU hotplug feature in the Linux kernel. For native results, we used the unmodified version of theLLVM-10.0.0 compiler and the JEMALLOC-5.2.1 allocator that we have used for our implementation. We compiled all our benchmarks with the O3 optimization level. The GeoMean label in our graphs represents the geometric mean average.
Figure 7 shows the runtime overheads for SPEC benchmarks with and without size-invariant optimization. With all optimizations, the overheads are in the range of 1-245%. The geometric mean average is 42.1%, as shown in the last column. Perlbench has the worst overheads of 245%, whereas lbm shows merely 1.2% overhead. SGXBounds reported 41% overheads inside the SGX enclaves (McKeen et al., 2013) and 55% overheads for outside the enclaves for SPEC CPU2006. Their average overhead also includes C++ benchmarks; therefore, direct comparison is not possible. Outside enclave, SGXBounds overheads for lbm and mcf are around -50% (better than native) and 30% compared to our overheads of 1.2% and 47.9% for these benchmarks. Inside enclave, SGXBounds reported around 5% overheads for lbm and 1% overheads for mcf. Interestingly, SGXBounds reported that lbm also performs better than the native version for the AddressSanitizer implementation in the LLVM compiler. They attributed the change in memory layout to this speedup. Perlbench and gcc are the two worst performing benchmarks in our experiments. These benchmarks were not evaluated by SGXBounds because they require custom modifications in the source code. We also require custom changes for these benchmarks, as described in Section 4.4. The overheads of gcc, mcf, and imagick are 170.9%, 106%, and 68.3% without size-invariant optimization compared to 107.2%, 47.9%, and 34.1% overheads with the size-invariant optimization. This shows that size invariant is a useful optimization.
Our memory overhead for SPEC is 1.16% (Figure 8), which is slightly higher than the 0.4% overhead reported by SGXBounds. gcc and perlbench are the worst-performing benchmarks with overheads of 104% and 17%, respectively. For gcc, our memory overhead is mainly due to the source code modifications related to the size-invariant (discussed in Section 4.4). To confirm this, we ran the native run with our custom allocator. The memory overhead in this case was 2%. We performed a similar experiment for perlbench and observed 16% overhead. This confirms that source code refactoring is not the reason for the memory overhead in perlbench. To validate that the overhead is not due to our segment-based allocation, we modified the original allocator to allocate eight additional bytes for every allocation. With the modified allocator, the overhead was the same as with our custom allocator. This indicates that the overhead is primarily due to the small objects for which the overhead of object headers is high.
To test the scalability of our approach, we ran the Phoenix benchmark suite with 1, 2, 4, and 8 CPUs. Figure 9 shows the execution time overhead of CGuard with respect to the native execution. Phoenix’s average CPU and memory overheads are 26.3% and 1.6% on a single core and 19.9% and 5.9% on eight cores respectively. As expected, our performance doesn’t degrade significantly as the number of cores increases. However, we observed a sharp decrease in overheads with an increasing number of CPUs in the histogram and linear-regression benchmarks. This is because both of these benchmarks are not fully utilizing the CPUs on multiple cores, thus leaving scope for CGuard to steal some CPU cycles. The CPU utilization for histogram for the native run on 1, 2, 4, and 8 cores is 99%, 139%, 177%, and 205%, compared to 99%, 151%, 205%, and 251% CPU utilization for CGuard. A similar pattern exists for the linear-regression benchmark as well. For this benchmark suite, the additional CPU overheads after disabling the size-invariant optimization were within the range of 10% except for the kmeans for which the additional overhead is around 25%.
For the Phoenix benchmark suite, SGXBounds performs better than CGuard. For kmeans, SGXBounds reported around 60% overhead compared to 148% overhead in our approach. For the remaining benchmarks, the CPU overheads in SGXBounds were less than 10%.
We observed large variations in the memory overheads for the kmeans and matrix multiply benchmarks (Figure 10). The overheads vary between 47-108% for kmeans and 5-25% for matrix-multiply. The memory consumption of these benchmarks is very small: 10MB for kmeans and 53MB for matrix-multiply. We believe that the page table pages corresponding to our custom heap segments are adding a few extra MBs, which is prominent due to the small memory footprint. To validate our hypothesis, we ran these benchmarks with relatively large inputs, and the resulting overheads of kmeans and matrix-multiply were in the ranges 57-62% and 2-6%. To further validate that the high overheads in kmeans are not due to our segment-based allocation, we ran the native version with a modified allocator that allocates eight extra bytes for each allocation. In this case, we found that the memory overheads for kmeans with respect to the native execution were in the range of 1-4%. This means that the memory overheads in kmeans are mainly due to large number of small size live objects.
To further validate the usability of our tool for real applications, we ran the Apache webserver. Using a 1Gbps network card, we couldn’t saturate all the cores even with concurrent requests. In the native run, the network card could only saturate three cores, so we ran our experiments with increasing number of cores. We ran the ab tool on the client machine and enabled the KeepAlive feature in the requests. To find the right metric for the concurrency level, we tried different parameters until we observed either a reduction or no significant change in the throughput. During these experiments, we ran our instrumented server and used its default pages. We got different concurrency levels for a different number of cores.
plots the result for 1,2,3,4,5, and 8 cores. The first and second bars correspond to overheads with and without the size-invariant optimization. We observed 29.7% overhead with the size-invariant optimization and 36.6% overhead without the size-invariant optimization when the CPUs were fully saturated (i.e., with less than four cores). Our numbers started improving when the cores were partially saturated in the native run. With eight cores, we observed only 0.9% overhead. The relative standard deviations for this experiment were in the range of 0.25-1.47% across all runs.
|Benchmark||Access violation points|
|bc||bc.c:1425; util.c:270,577; storage.c:177|
To test the effectiveness of CGuard, we ran the BugBench (Lu et al., 2005) benchmark suite, which contains a set of buggy applications some of which have spatial safety bugs. Table 1 shows all the program points at which CGuard detected out-of-bounds accesses for the BugBench, SPEC, and Phoenix benchmark suites. We found all the bugs reported in the BugBench code repository. In addition, CGuard also detected spatial safety violations in gcc and x264 benchmarks from the SPEC CPU2017 benchmark suite. In gcc, the global variable hard_regno_nregs is accessed using a negative index. The check for the negative index is conducted after the variable access. Importantly, the AddressSanitizer implementation in LLVM could not detect this bug in gcc. In x264, global variables INIT_FLD_MAP_I and INIT_FLD_LAST_I are accessed at an index that is outside the bounds of these objects. These variables are passed at lines-90,91 in context_ini.c. In the string_match benchmark from the Phoenix, fdata_keys, which is allocated for size finfo_keys.st_size at line-259 in string_match.c is accessed in the loop. This loop has an incorrect bound check in the loop condition that allows the program to access an additional byte past the original allocation size. Our post-evaluation inspection revealed that the bug reported for perlbench (2) in SPEC CPU2006 has already been fixed in SPEC CPU2017. Therefore, CGuard did not report it.
|Benchmark||Type||Source code modification|
|Perlbench||a||hv.h:48; pad.c:2808; MD5.c:184; op.c:8401|
|c||pp_pack.c:3038, av.c:159; pp_hot.c:3175; regcomp.c:16274; perly.c:408|
|gcc||a||tree-ssa-operands.c:130,133; tree-ssa-sccvn.c:1542,1580,1610; tree.c:2102,863,865,958,1467,1584,3604,9411; sbitmap.c:82; sparseset.c:38; rtl.c:199,341; reload1.c:915; gimple.c:148; cpp_symtab.c:173|
|c||c-common.c:5296; dominance.c:1339; ggc-page.c:571; pointer_set.c:67|
For most benchmarks, we didn’t need to refactor source code. Table 2 provides a summary of our changes. At a broad level, we distributed these changes into three categories: a) changes related to the size-invariant, b) changes related to a pointer comparison converted to an integer comparison by the frontend, and c) other changes.
Most changes were related to the size-invariant, and these were the easiest to fix. We found that for most of these cases, CGuard threw an exception at the allocation point itself.
In some cases, the frontend generated an integer comparison instead of a pointer comparison. In most cases, these conversions were done for “!” style comparison. We refactored the code by rewriting them in a way that the frontend generated a pointer comparison. In the future, we plan to extend the frontend to avoid the need for these changes.
In gcc, pointers are used as integers in comparisons, array indexes, and hash table keys. In all these cases, we changed the source code to reset the pointers’ tags. In the Phoenix map-reduce library, a pointer is accessed in assembly. To support this case, we reset the tag before accessing the memory.
To summarize, most benchmarks didn’t require any refactoring. Even for large applications e.g. Apache, we needed refactoring at only one place. This indicates that our technique can be used in practice. We also ran gcc, perlbench, and apache without the size-invariant modifications. CGuard could successfully run perlbench and apache without any additional overheads. This is because the parts of code that require size-invariant modification are not on the hot path. However, we couldn’t run gcc because it uses a custom allocator that uses system allocator in the backend. As a result, most of the objects are large for which CGuard can’t retrieve the base address during the size-invariant violations.
5. Limitations and future work
CGuard relies on a programmer to typecast an integer to a pointer if an integer with an inconsistent tag escapes the static scope and is accessible in the future. In our experiments, we found that this practice is generally followed (see Section 4.4). However, our automated solution caused applications to break at multiple places. Therefore, this task is left to the developer. We also assume that the implicit integer-to-pointer typecasts are safe. In a rare case, if the size-invariant property is violated due to an implicit typecast some bugs may remain undetected.
At a more general level, CGuard assumes that the developer’s intent is not malicious. It also assumes a weaker form of type safety (as discussed in the previous paragraph) and temporal safety. Existing works also have similar limitations. In SGXBounds(Kuvaiskii et al., 2017) approach, if the limit of the tagged pointer is modified using an integer, the bounds check may incorrectly succeed or fail at runtime. In BaggyBounds (Akritidis et al., 2009), PAriCheck (Younan et al., 2010), and Low Fat Pointers (Duck and Yap, 2016; Duck et al., 2017) approaches, an out-of-bound pointer can be created and accessed using integer arithmetic. These works also require source code refactoring. The primary reason behind such limitations is that it is hard to statically track arithmetic operations on an escaped integer that is also a pointer.
We don’t support sub-object overflows because that requires storing the bounds information for each pointer, which adds additional memory and CPU overheads.
In the future, we will investigate whether our work can be extended to support temporal safety. However, existing techniques (Boehm and Weiser, 1988) for temporal safety can be used alongside our approach with minor modifications for tagged pointers.
6. Related work
Jones and Kelly (Jones and Kelly, 1997) proposed the idea of object-bounds protection. However, it didn’t allow the creation of an out-of-bounds pointer. CRED (Ruwase and Lam, 2004) improved on this work by supporting an in-bounds pointer derived from an out-of-bounds pointer. However, both of these works suffered from CPU overheads due to the splay-tree-based implementation for bounds checking. Dhurjati and Adve (Dhurjati and Adve, 2006) reduced the CPU overheads by using per-pool splay-trees instead of a global splay-tree.
Baggy Bounds (Akritidis et al., 2009), PAriCheck (Younan et al., 2010), and Low Fat Pointers (Duck and Yap, 2016; Duck et al., 2017) further reduce the CPU overheads by adding extra padding to objects that allow them to locate the base address without an expensive search. However, these works don’t provide complete object-bounds protections because they allow the applications to access the padded area. These works have also used the pointer tagging approach. SGXBounds (Kuvaiskii et al., 2017) supports complete object-bounds protection but restricts the application address space to 32-bit on a 64-bit platform. Delta Pointers (Kroes et al., 2018) further reduces the CPU overheads of SGXBounds by only detecting overflows. Both SGXBounds and Delta Pointers use pointer tagging, and they store the tag in the virtual address of the pointers, similar to us.
Another line of work provides spatial safety for pointer-bounds. These approaches can detect sub-object overflow at the cost of high CPU and memory overheads because they need to store and update bounds for every pointer.
CCured (Necula et al., 2002; Condit et al., 2003) statically categorized the pointers into SAFE, SEQ, and WILD. SAFE pointers are normal pointers and don’t require any checks. SEQ and WILD pointers are fat-pointers that store the bounds information of pointers and objects and require runtime checks. Cyclone (Jim et al., 2002) uses fat-pointers and also provides programmers a variety of pointer qualifiers to control the runtime checks. SoftBound (Nagarakatte et al., 2009) stores per-pointer metadata in a disjoint address space for better compatibility. SafeC (Austin et al., 1994) and Xu et al. (Xu et al., 2004) also track bounds for every pointer and can also detect temporal safety bugs in addition to spatial safety bugs.
We presented CGuard, a tool that provides complete object-bounds protection for C applications at low CPU and memory overheads. CGuard requires applications to obey a weak form of type-safety. Our evaluation revealed that for most applications, this property holds. The changes needed for applications that did not satisfy the property were minor. CGuard was able to detect spatial safety violations in widely used benchmarks. In particular, it detected a bug in gcc that was not reported in any other works to the best of our knowledge. This evaluation demonstrates that our approach is effective and can scale to real applications.
-  (2009) Control-flow integrity principles, implementations, and applications. ACM Transactions on Information and System Security (TISSEC) 13 (1), pp. 1–40. Cited by: §2.8.
-  (2018 (accessed April 25, 2021)) AddressSanitizerFoundBugs. Note: https://github.com/google/sanitizers/wiki/AddressSanitizerFoundBugs#Spec_CPU_2006 Cited by: §4.3.
-  (2009) Baggy bounds checking: an efficient and backwards-compatible defense against out-of-bounds errors.. In USENIX Security Symposium, Vol. 10. Cited by: §1, §1, §5, §6.
-  (1994) Efficient detection of all pointer and array access errors. In Proceedings of the ACM SIGPLAN 1994 conference on Programming Language Design and Implementation, pp. 290–301. Cited by: §1, §6.
-  (2018) The guard’s dilemma: efficient code-reuse attacks against intel sgx. In 27th USENIX Security Symposium (USENIX Security 18), pp. 1213–1227. Cited by: §1.
-  (2011) Jump-oriented programming: a new class of code-reuse attack. In Proceedings of the 6th ACM Symposium on Information, Computer and Communications Security, pp. 30–40. Cited by: §1.
-  (1988) Garbage collection in an uncooperative environment. Software: Practice and Experience 18 (9), pp. 807–820. Cited by: §5.
-  (2018) SPEC cpu2017: next-generation compute benchmark. In Companion of the 2018 ACM/SPEC International Conference on Performance Engineering, pp. 41–42. Cited by: §4.1.
-  (2009) Can dres provide long-lasting security? the case of return-oriented programming and the avc advantage.. EVT/WOTE 2009. Cited by: §1.
-  (2003) CCured in the real world. ACM SIGPLAN Notices 38 (5), pp. 232–244. Cited by: §1, §6.
-  (2015) Losing control: on the effectiveness of control-flow integrity under stack attacks. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, pp. 952–963. Cited by: §1.
-  (2008) Hardbound: architectural support for spatial safety of the c programming language. ACM SIGOPS Operating Systems Review 42 (2), pp. 103–114. Cited by: §1.
-  (2006) Backwards-compatible array bounds checking for c with very low overhead. In Proceedings of the 28th international conference on Software engineering, pp. 162–171. Cited by: §1, §1, §6.
-  (2017) Stack bounds protection with low fat pointers.. In NDSS, Vol. 17, pp. 1–15. Cited by: §1, §1, §5, §6.
-  (2016) Heap bounds protection with low fat pointers. In Proceedings of the 25th International Conference on Compiler Construction, pp. 132–142. Cited by: §1, §1, §5, §6.
-  (2015) Missing the point (er): on the effectiveness of code pointer integrity. In 2015 IEEE Symposium on Security and Privacy, pp. 781–796. Cited by: §1.
-  (2013 (accessed April 25, 2021)) Introduction to intel memory protection extensions. Note: https://software.intel.com/content/www/us/en/develop/articles/introduction-to-intel-memory-protection-extensions.html Cited by: §1.
-  (2002) Cyclone: a safe dialect of c.. In USENIX Annual Technical Conference, General Track, pp. 275–288. Cited by: §1, §6.
-  (1997) Backwards-compatible bounds checking for arrays and pointers in c programs.. In AADEBUG, Vol. 97, pp. 13–26. Cited by: §1, §1, §6.
-  (2018) Delta pointers: buffer overflow checks without the checks. In Proceedings of the Thirteenth EuroSys Conference, pp. 1–14. Cited by: §6.
-  (2017) SGXBOUNDS: memory safety for shielded execution. In Proceedings of the Twelfth European Conference on Computer Systems, pp. 205–221. Cited by: §1, §1, §5, §6.
-  (2005) Bugbench: benchmarks for evaluating bug detection tools. In Workshop on the evaluation of software defect detection tools, Vol. 5. Cited by: §4.1, §4.3.
-  (2013) Innovative instructions and software model for isolated execution.. Hasp@ isca 10 (1). Cited by: §4.2.
-  (2009) SoftBound: highly compatible and complete spatial memory safety for c. In Proceedings of the 30th ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 245–258. Cited by: §1, §6.
-  (2002) CCured: type-safe retrofitting of legacy code. In Proceedings of the 29th ACM SIGPLAN-SIGACT symposium on Principles of programming languages, pp. 128–139. Cited by: §1, §6.
-  (2001) The advanced return-into-lib(c) exploits: pax case study. In Phrack Magazine, Volume 11, Issue 0x58, Cited by: §1.
-  (2007) Valgrind: a framework for heavyweight dynamic binary instrumentation. ACM Sigplan notices 42 (6), pp. 89–100. Cited by: §6.
-  (2018) Intel mpx explained: a cross-layer analysis of the intel mpx system stack. Proceedings of the ACM on Measurement and Analysis of Computing Systems 2 (2), pp. 1–30. Cited by: §1.
-  (1996) Smashing the stack for fun and profit. Phrack magazine 7 (49), pp. 14–16. Cited by: §1.
-  (1991) Purify: fast detection of memory leaks and access errors. In In Proc. of the Winter 1992 USENIX Conference, Cited by: §6.
-  (2004) A practical dynamic buffer overflow detector.. In NDSS, Vol. 2004, pp. 159–169. Cited by: §1, §1, §6.
-  (2012) AddressSanitizer: a fast address sanity checker. In 2012 USENIX Annual Technical Conference (USENIXATC 12), pp. 309–318. Cited by: §6.
-  (2007) The geometry of innocent flesh on the bone: return-into-libc without function calls (on the x86). In Proceedings of the 14th ACM conference on Computer and communications security, pp. 552–561. Cited by: §1.
-  (2009) Breaking the memory secrecy assumption. In Proceedings of the Second European Workshop on System Security, pp. 1–8. Cited by: §1.
-  (2004) An efficient and backwards-compatible transformation to ensure memory safety of c programs. In Proceedings of the 12th ACM SIGSOFT Twelfth International Symposium on Foundations of Software Engineering, pp. 117–126. Cited by: §1, §6.
-  (2009) Phoenix rebirth: scalable mapreduce on a large-scale shared-memory system. In 2009 IEEE International Symposium on Workload Characterization (IISWC), pp. 198–207. Cited by: §4.1.
-  (2010) PAriCheck: an efficient pointer arithmetic checker for c programs. In Proceedings of the 5th ACM Symposium on Information, Computer and Communications Security, pp. 145–156. Cited by: §1, §1, §5, §6.
-  (2011) Combining control-flow integrity and static analysis for efficient and validated data sandboxing. In Proceedings of the 18th ACM conference on Computer and Communications Security, pp. 29–40. Cited by: §2.8.