libmpk: Software Abstraction for Intel Memory Protection Keys

11/18/2018 ∙ by Soyeon Park, et al. ∙ 0

Intel memory protection keys (MPK) is a new hardware feature to support thread-local permission control on groups of pages without requiring modification of page tables. Unfortunately, its current hardware implementation and software supports suffer from security, scalability, and semantic-gap problems: (1) MPK is vulnerable to protection-key-use-after-free and protection-key corruption; (2) MPK does not scale due to hardware limitations; and (3) MPK is not perfectly compatible with mprotect() because it does not support permission synchronization across threads. In this paper, we propose libmpk, a software abstraction for MPK. libmpk virtualizes protection keys to eliminate the protection-key-use-after-free and protection-key corruption problems while supporting a tremendous number of memory page groups. libmpk also prevents unauthorized writes to its metadata and supports inter-thread key synchronization. We apply libmpk to three real-world applications: OpenSSL, JavaScript JIT compiler, and Memcached for memory protection and isolation. An evaluation shows that libmpk introduces negligible performance overhead (<1 improves their performance by 8.1x over secure equivalents using mprotect(). The source code of libmpk will be publicly available and maintained as an open source project.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Operating systems (OSs) rely on the memory management unit (MMU) to define and enforce a process’s access right to a memory page. The page table maintains a page table entry (PTE) for each page specifying the permission, e.g., readable, writable, or executable, and the MMU checks it to determine the legitimacy of each memory access. OSs can change the permission by updating PTEs and flushing the translation lookaside buffer (TLB) to reload the updated PTEs into the TLB.

Alternatively, some instruction set architectures (ISAs), e.g., ARM [3] and IBM Power [18], allow OSs to assign a key

to each page, and define the access rights of a process using a CPU register. This effectively classifies the memory into multiple groups with the same access rights. Changing a process’s access right to a group of pages is as easy as updating a CPU register, which commonly takes about tens of cycles at most.

Maintaining page access rights is important to prevent attackers from accessing and manipulating arbitrary memory locations to eliminate information leakage and memory corruption attacks. An arbitrary read vulnerability of an application can lead to leak the application’s sensitive information kept in the memory. For example, the Heartbleed vulnerability of the OpenSSL library [27] allows attackers to steal the sensitive data of applications using the library, including cryptographic private keys and passwords. Further, protecting control data from the memory corruption attacks helps to prevent ways to achieve arbitrary code execution. By using information leakage vulnerabilities, attackers aim to find the locations of control data (e.g., return addresses and virtual function tables) and corrupt them with arbitrary write vulnerabilities, to carry out control-flow hijacking attacks [36].

A number of software- or hardware-based mechanisms have been proposed to ensure the confidentiality and integrity of the memory. Some of them [31, 8, 32] are designed to prevent a class of attacks completely. However, due to their high costs, others [7, 38, 16, 29, 13] focus on minimizing the impact of single vulnerability by isolating a small, critical portion of code and data only, which are practical but have coverage and/or scalability problems.

Recently, Intel deployed a hardware feature, known as memory protection keys (MPK). MPK has three advantages over page-table-based mechanisms: (1) performance, (2) group-wise control, and (3) per-thread view. First, MPK utilizes a protection key rights register (PKRU) to maintain the access rights of individual protection keys associated with specific memory pages: read/write, read-only, or no access. Unlike the page-table-based mechanisms that flush the TLB and update kernel-level page-table data structures (virtual memory areas, VMAs) to change access rights, taking more than 1,000 cycles, MPK only requires to execute a non-privileged instruction WRPKRU to update PKRU, taking around 20 cycles (§ 2.3). In addition, MPK enables execute-only memory [24] because its access rights are orthogonal to whether the page is executable.

Second, MPK supports up to 16 different protection keys; that is, there can be up to 16 different memory page groups consisting of memory pages with the same protection keys. 111The default group (0) is special, so only 15 groups are effective in general. The pages that compose a group do not need to be contiguous, and their access right can be changed at once. This group-wise control allows developers to change access rights to memory page groups in a fine-grained manner according to the types and contexts of data stored in them. For example, a server application can associate different memory page groups with different sessions or clients to protect them individually.

Third, MPK allows each thread (i.e., each hyperthread) to have a unique PKRU, realizing per-thread memory view. Even if two threads share the same address space, their access rights to the same page can differ (e.g., read/write versus read-only).

Unfortunately, we find that the current hardware implementation of Intel MPK and its standard library and kernel supports suffer from (1) security, (2) scalability, and (3) semantic-gap problems, hindering its widespread adoption. First, MPK suffers from protection-key-use-after-free and protection key corruption problems. The protection key assigned to a page group can be re-used after deallocation via pkey_free() system call. However, pkey_free() does not initialize the protection key field of the page group associated with the deallocated key, resulting in ambiguity when the deallocated key is re-assigned to a different group using pkey_alloc(). Also, if attackers know an arbitrary write vulnerability, they can corrupt MPK protection keys stored in a variable to manipulate an application to change the permission of a target page group.

Second, MPK does not scale because its PKRU can manage up to 16 protection keys due to a hardware restriction. When an application tries to allocate more than 16 protection keys, pkey_alloc() just returns an error, implying that the application itself should implement its own mechanism if it has to deal with more than 16 memory page groups.

Third, MPK has a semantic gap with the conventional mprotect() because, unlike mprotect() working at a process level, MPK works at a thread level, which results in potential security and performance problems. For example, Linux kernel implements an execute-only memory feature with MPK: mprotect(addr, len, PROTEXEC). However, this MPK-based execute-only memory does not consider inter-thread permission synchronization, which should be ensured by mprotect() by nature. This allows another thread that did not execute mprotect() to have a chance to read the execute-only memory. Further, since some applications still assume a process-level memory permission model, they cannot benefit from MPK’s efficiency and group-wise control, unless they synchronize access rights across threads by themselves.

In this paper, we propose libmpk, a secure, scalable, and semantic-gap-mitigated software abstraction to fully utilize MPK in a practical manner. libmpk implements (1) protection key virtualization to eliminate the protection-key-use-after-free problem and to support a virtually infinite number of memory page groups, (2) metadata protection to prevent attackers from tricking MPK by corrupting the protections keys in the memory, and (3) inter-thread key synchronization to ensure the semantics of mprotect() with MPK.

First, libmpk provides virtual protection keys to applications to hide real hardware keys from them. This design avoids the protection-key-use-after-free problem by scheduling the mappings between virtual protection keys and hardware protection keys. With the virtual key scheduling, libmpk scales MPK to support a virtually infinite number of protection keys with the same semantics: group-wise control and per-thread view.

Second, libmpk protects its metadata from corruption. Basically, libmpk makes all virtual protection keys read-only by hardcoding them to the code and enforcing direct calls. All other important metadata, including the mappings between virtual and hardware keys, are maintained in the kernel to avoid corruption while avoiding unnecessary system calls to minimize the performance overhead. In addition, libmpk is designed across the layers—the kernel and user—to protect its internal metadata from malicious overwrites through privilege separation, and yet to maximize the performance cost by avoiding unnecessary system calls.

Third, libmpk provides an efficient inter-thread key synchronization technique to utilize MPK as an efficient alternative of mprotect() with the same semantics. It is 1.7–3.8 faster than mprotect() while varying the number of 4KB pages from 1 to 1,000 (§ 6.2). This huge performance improvement benefits from our lazy PKRU synchronization technique and lacks of TLB flush and VMA update.

To show the effectiveness and practicality of libmpk, we apply it to three real-world applications: OpenSSL library, JavaScript just-in-time (JIT) compiler, and Memcached. First, we modify the OpenSSL library to create secure memory pages for storing cryptographic keys to mitigate information leakage. Second, we modify three JavaScript JIT compilers (SpiderMonkey, ChakraCore, and v8) to protect the code cache from memory corruption, by enforcing the WX security policy. Third, we modify Memcached to secure almost all its data, including slab and hash table whose size can be several gigabytes. The evaluation results show that libmpk and its applications have negligible overhead (<1%).

We summarize the contributions of this paper as follows:

  • [noitemsep,nolistsep]

  • Comprehensive study. We study the design, functionality, and characteristics of Intel MPK in detail. Further, we identify the critical challenges of utilizing MPK: security (protection-key-use-after-free and protection-key corruption), scalability (a limited number of hardware protection keys), and semantic difference (thread view versus process view).

  • Software abstraction. We design and implement libmpk, a software abstraction to fully utilize MPK. The protection key virtualization, metadata protection, and inter-thread key synchronization of libmpk allow applications to effectively overcome the three challenges.

  • Case studies. We apply libmpk to three applications, OpenSSL library, JavaScript JIT compiler, and Memcached, to show its effectiveness and practicality. libmpk secures all of them with a few modifications and negligible overhead.

Organization..Organization.Organization.. § 2 explains the current hardware and software supports of MPK. § 3 describes the limitations of MPK. § 4 depicts the design of libmpk to effectively resolve all the explained problems. § 5 shows real-world applications that can benefit from libmpk to improve their security. § 6 evaluates the security and performance characteristics of libmpk and the applications. § 7 discusses the limitations of libmpk and possible approaches to overcome them. § 8 introduces related work and § 9 concludes this paper.

2 Intel MPK Explained

In this section, we describe the hardware design of Intel MPK and current kernel and library support. Also, we check the performance characteristics of MPK to show its efficiency.

Figure 1: An example showing how MPK checks the permission of a logical core (hyperthread) on a specific memory page according to PKRU and page permissions. The intersection of the permissions determines whether a data access will be allowed. An instruction fetch is independent to the PKRU.

2.1 Hardware Primitives

Intel MPK updates the permission of a group of pages by associating a protection key to the group and changing the access rights of the protection key instead of individual memory pages (Figure 1). We explain the hardware primitives of MPK.

Protection key field in page table entry.Protection key field in page table entryProtection key field in page table entry. MPK assigns a unique protection key to a memory page group to rapidly update its permission at the same time. Intel CPUs with MPK utilize the previously unused four bits of each page table entry (from 32nd to 35th bits) to store a memory page’s corresponding key value. Thus, MPK supports up to 16 different page groups. Since only supervised code can access and change PTEs, the Linux kernel (from version 4.6) started to provide a new system call, pkey_mprotect(), to allow applications to assign or change the keys of their memory pages (§ 2.2).

Protection key rights register (PKRU). A CPU with MPK uses the value of PKRU to determine its access right to each page group. Two bits representing the right are access disable (AD) and write disable (WD) bits. The value of represents a thread’s permission to a page group: read/write , read-only , or no access . PKRU exists for each hyperthread to provide a per-thread view.

Instruction set.Instruction setInstruction set. MPK introduces two new instructions to manage the PKRU: (1) WRPKRU to update the protection information of the PKRU and (2) RDPKRU to retrieve the current protection information from the PKRU. WRPKRU uses three registers as input: the EAX register containing new protection information to overwrite the PKRU, and the other two registers, ECX and EDX, filled with zeroes. RDPKRU also uses the three registers for its operation: it returns the current PKRU value via the EAX register while overwriting the EDX register with 0. The ECX register also should be filled with 0 to execute RDPKRU correctly. Note that, currently, Intel does not document why WRPKRU and RDPKRU use zeroed ECX and EDX registers during their execution.

2.2 Kernel Integration and Standard APIs

The Linux kernel supports MPK since version 4.6, and glibc supports MPK since version 2.27. They focus on how to manage protection keys and how to assign them to particular PTEs. The Linux kernel provides three new system calls: pkey_mprotect(), pkey_alloc(), and pkey_free(). It also changes the behavior of mprotect() to provide execute-only memory. Further, glibc provides two userspace functions, pkeyget and pkeyset, to retrieve and update the access rights of a given protection key. Table 1 summarizes the APIs.

Name Cycles Description
pkeyalloc() 186.3 Allocate a new pkey
pkeyfree() 137.2 Deallocate a pkey
pkeymprotect() 1,104.9 Associate a pkey key with memory pages
pkeyget()/RDPKRU 0.5 Get the access right of a pkey
pkeyset()/WRPKRU 23.3 Update the access right of a pkey
Ref. mprotect(): 1,094.0 / MOVQ (rbx to rdx): 0.0 / MOVQ (rdx to xmm): 2.09
Table 1: Overhead of MPK instruction, system calls, and standard library APIs. ref shows the overhead of mprotect() and normal register move instructions for comparison. Each component is repeated 10 million times and the microbenchmarks are executed 10 times.

pkey_mprotect().pkey_mprotect()pkey_mprotect(). The pkey_mprotect() system call extends the mprotect() system call to associate a protection key with the PTEs of a specified memory region while changing its page protection flag. Interestingly, pkey_mprotect() does not allow a user thread to reset a protection key to zero, the default protection key value assigned to newly created memory pages. We anticipate this is in order to minimize the misuse of MPK, i.e., denying access to new pages could result in an application crash.

pkey_alloc() and pkey_free().pkey_alloc() and pkey_free()pkey_alloc() and pkey_free(). The Linux kernel provides two other new system calls to allocate and deallocate memory protection keys: pkey_alloc() and pkey_free(). When a user thread invokes pkey_alloc() with access right, the kernel allocates and returns a protection key with corresponding permission according to a 16-bit bitmap that tracks which protection keys are allocated. When a user thread invokes pkey_free(), the kernel simply marks the freed key as available in the bitmap. The pkey_mprotect() function examines the bitmap afterward to prohibit the use of non-allocated keys.

Execute-only memory.Execute-only memoryExecute-only memory. The Linux kernel supports execute-only memory with MPK. If a user thread invokes mprotect() only with PROTEXEC, the kernel (1) allocates a new protection key, (2) disables the read and write permission of the key, and (3) assigns the key to the given memory region.

2.3 Quantifying Characteristics of Intel MPK

To evaluate the overhead and benefits of MPK, We measure (1) the overhead of the MPK instructions, (2) the overhead of the MPK system calls, and (3) the overhead of mprotect() for contiguous memory and sparse memory.

Environment.EnvironmentEnvironment. Our system consists of two Intel Xeon Gold 5115 CPUs (each CPU has 20 logical cores at 2.4GHz) and 192GB of memory. Linux kernel version 4.14 configured for MPK is installed to this system.

Instruction latency.Instruction latencyInstruction latency. We measure the latency of RDPKRU and WRPKRU to identify their micro-architectural characteristics. Table 1 summarizes the results. The latency of RDPKRU is similar to that of reading a general register, but the latency of WRPKRU is high. We anticipate that WRPKRU performs serialization (e.g., pipeline flushing) to avoid potential memory access violation due to out-of-order execution. To confirm this, we insert a various number of ADD instructions before (W1) and after (W2) WRPKRU and measure the overall latency (Figure 2). The results show that W2 is always slower than W1, implying that the instructions executed right after WRPKRU fail to benefit from out-of-order execution due to the serialization.

Number of instructionssucceeding wrpkru (W2)preceding wrpkru (W1)
Figure 2: Effect of WRPKRU serialization on simple (i.e., ADD) instructions either preceding or succeeding WRPKRU (average of 10 million repetitions).

System calls.System callsSystem calls. We measure the latency of the four Linux system calls for MPK (Table 1). The latency of mprotect() and pkey_mprotect() on a 4 KB page is almost the same because they use the same function do_mprotect_pkey() internally. pkey_alloc() and pkey_free() are fast since they involve only simple operations in the kernel, and the domain switching between kernel and userspace dominates their time costs.

Contiguous versus sparse memory pages.Contiguous versus sparse memory pagesContiguous versus sparse memory pages. Using MPK to change page permission only involves an update on the PKRU and thus is independent of the number of targeted pages and their sparseness. To show the performance benefit of MPK over mprotect(), we check how the number and sparseness of the targeted pages affect the performance of mprotect(). To construct contiguous memory pages, we call mmap() one time with certain memory size. For sparse memory pages, we call mmap() several times with one page size. Figure 3 shows that the overhead of mprotect() increases in proportion to the number of pages. This is because the number of pages affects how many VMAs mprotect() needs to look up for permission update. Moreover, the overhead of mprotect() becomes high when it is invoked on sparse memory pages because we need to call mprotect() multiple times for each of them, introducing frequent context switchings between kernel and userspace.

Number of Pagesmprotect(sparse)mprotect(contiguous)
Figure 3: Overhead of mprotect() on contiguous and sparse memory (average cost of 10 million repetitions). Protecting contiguous pages takes shorter than protecting sparse pages.
Summary. Intel MPK allows a thread to rapidly change the per-thread access rights to a group of pages associated with the same protection key, by updating a thread-local register PKRU which only takes around 20 cycles. Its performance is independent to the number of pages composing a group and their sparseness unlike mprotect().

3 Challenges of Utilizing Intel MPK

We studied Intel MPK can protect sensitive data from unintended memory access or expedite mprotect() for large memory (§ 2). However, a few challenges exist to utilize MPK in real-world applications, which are induced by either its inherent features or the current software interfaces. In this section, We explain the challenges of using MPK in terms of security, scalability, and synchronization.

3.1 Potential Security Problems

Currently, MPK suffers from two possible security problems occured by its interface design.

Protection key use after free. pkey_free() just removes a protection key from a key bitmap and does not update the corresponding PTEs. Whether or not a key could already be associated with some pages, the kernel will allocate the key if it is freed by pkey_free(). If a program obtains a key that is still associated with some memory pages through pkey_alloc(), the new page group will include unintended pages than it is supposed to. A developer can face this vulnerable situation unconsciously as current kernel implementation neither handles this automatically nor checks if a free key is still associated with some pages. The developer community also recognized the problem, and recommends not to free the protection keys [12, 1]. Handling this problem superficially (i.e., wiping protection keys in PTE) without fundamental design change of memory management in kernel will introduce huge performance overhead because it requires to traverse the page table and VMAs to detect entries associated with a freed key to update them, and flush all corresponding TLB entries.

Protection key corruption.Protection key corruptionProtection key corruption. The existing OS supports make the developer store the allocated protection keys in the application’s memory. This design makes allocated protection keys are vulnerable to corruption. For example, after obtaining a key by invoking pkey_alloc(), an application stores the key in its memory to use the key later when it assigns the key to more pages with pkey_mprotect() or switches the permission using WRPKRU. An attacker who managed to corrupt such keys could manipulate one piece of code that has an access to pages with one key into corrupting the pages with another key.

3.2 Limited Hardware Resources

As Intel MPK relies on PKRU register for permission change, it supports only 16 keys currently. It is the responsibility of developers to ensure that an application never creates more than 16 page groups at the same time. This implies that developers have to examine the number of active page groups at runtime, which are used by both the application itself and the third-party libraries it depends on. Otherwise, the program may fail to properly benefit from MPK. This issue undermines the usability of MPK and discourages developers from utilizing it actively. Hardware solution like adding larger register (i.e., AVX) has limitation to scale because MPK utilizes unused bits in PTE. For example, 512 protection keys will demand 9 bits in page table and TLB entries, requiring enlarged entries or shrunk address bits.

3.3 Semantic Differences

To change the permission of any page group, MPK modifies the value of the PKRU. However, the value is only effective in a single thread because PKRU is thread-local intrinsically as a register. As a result, different threads in a process can have different permissions for the same page group. This thread-local inherence helps to improve security for the applications that require isolation on memory access among different threads, but makes the semantic of PKRU change differ from that of mprotect(). mprotect() semantically guarantees that page permissions are synchronized among all threads in a process, on which particular applications rely. This not only makes it difficult to accelerate mprotect() with MPK, but also breaks the guarantee of execute-only memory implemented on mprotect in latest kernel. mprotect() supporting executable-only memory relied on MPK does not consider synchronization among threads which developers basically expect to mprotect(). Even when the kernel successfully allocates a key for the execute-only page, another thread might have a read access to it due to a lack of synchronization. To make MPK a drop-in replacement of mprotect() for both security and usability, developers need to synchronize the PKRU values among all the threads.

4 Software Abstraction of Intel MPK

libmpk provides a secure and usable abstraction for MPK by overcoming the challenges (§ 3). A developer can use MPK easily by either adding calls to libmpk APIs or replacing existing mprotect() calls with those of libmpk. By decoupling the protection keys from APIs, libmpk is immunized against protection-key-use-after-free and protection-key corruption. Also, libmpk allows an application to create more than 16 page groups by virtualizing the protection keys, and provides a light-weight inter-thread PKRU synchronization mechanism. Figure 4 illustrates an overview of libmpk. The current version of libmpk consists of 1.5K lines of code in total.

Threat model.Threat modelThreat model. We assume an adversary who can corrupt non-control user data. Arbitrary execution is beyond our scope because it allows the adversary to request system calls to directly manipulate permission.

Goals..Goals.Goals.. To utilize MPK for domain-based isolation and as a substitute for mprotect(), we have to overcome the three challenges: (1) security problems due to insecure key management, (2) hardware resource limitations, and (3) different semantics from mprotect(). libmpk adopts the three approaches (1) key virtualization, (2) metadata protection, and (3) inter-thread key synchronization that effectively solve the challenges.

Figure 4: libmpk overview. mpk_init() pre-allocates hardware keys and initializes the metadata table. mpk_mmap() creates a page group with metadata, and mpk_munmap() destroys the page group and the corresponding metadata. mpk_begin() and mpk_end() provide domain-based thread-local isolation. mpk_mprotect() synchronizes permission changes globally.
Name Argument Description
mpkinit evictrate Initialize libmpk with an eviction rate
mpkmmap vkey, addr, len, prot Allocate a page group for a virtual key
flags, fd, offset
mpkmunmap vkey Unmap all pages related to a given virtual key
mpkbegin vkey, prot Obtain thread-local permission for a page group
mpkend vkey Release the permission for a page group
mpkmprotect vkey, prot Change the permission for a page group globally
mpkmalloc vkey, size Allocate a memory chunk from a page group
mpkfree size Free a memory chunk allocated by mpk_malloc()
Table 2: libmpk APIs.

4.1 libmpk Api

libmpk provides eight APIs shown in Table 2. To utilize libmpk, an application first calls mpk_init() to obtain all the hardware protection keys from the kernel and initialize its metadata. mpk_mmap() allocates a page group for a virtual key, which should be a constant integer that the developer passes. mpk_munmap() destructs a page group by freeing a virtual key for the group and unmaps all the pages. libmpk maintains the mappings between virtual keys and pages to avoid scanning all pages at this destruction step. On top of these, libmpk also provides simple heap over each page group (mpk_malloc() and mpk_free()), so that a developer can also use one or more page groups to create a heap memory region for sensitive data.

[commandchars=
{},codes=] #define GROUP_1 100 #define GROUP_2 101

int domain_based_isolation () { mpk_init(-1); // default eviction rate: 100% char* addr = (char *)mpk_mmap(GROUP_1, NULL, 0x1000, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); // page permission: rw- & pkey permission: –

mpk_begin(GROUP_1, PROT_READ | PROT_WRITE); // page permission: rw- & pkey permission: rw

// write data in GROUP_1

mpk_end(GROUP_1); // page permission: rw- & pkey permission: –

printf("%s\n", addr); // SEGMENTATION FAULT }

int quick_permission_change () { mpk_init(0.5); // set cache eviction rate: 50% void* addr = mpk_mmap(GROUP_2, NULL, 0x1000, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); // page permission: rw- & pkey permission: –

mpk_mprotect(GROUP_2, PROT_READ | PROT_WRITE | PROT_EXEC); // page permission: rwx & pkey permission: rw }  

Figure 5: Example code for libmpk APIs.

libmpk provides two usage models for developers. The first model, a thread-local domain-based isolation model, allows an application to temporarily unlock a page group only for the calling thread. mpk_begin() and mpk_end() are the APIs for this model which make a page group accessible and inaccessible, respectively. The second model allows an application to quickly change the access rights to a page group, by replacing mprotect() with mpk_mprotect(). Figure 5 is example code showing how to use libmpk APIs.

4.2 Protection Key Virtualization

libmpk enables an application to create more than 16 page groups by virtualizing the hardware protection keys. When an application creates a new page group by calling mpk_mmap(), a virtual key passed as argument is associated with new allocated metadata for the new group. The application uses the virtual key to obtain or release the permission, or free the group, while being prohibited from manipulating hardware keys. The exact physical key that a page is associated with is hidden from the program and a developer.

libmpk handles mapping between virtual keys and hardware keys like a cache (Figure 6). If a virtual key is already associated with a hardware key, the virtual key exists inside the cache, so further access with it produces a few latencies. However, if the virtual key was not associated with a hardware key, it needs to evict another virtual key or do nothing but just call mprotect() for performance to change a permission. The frequency of eviction or calling mprotect() is determined by eviction rate. The cache structure guarantees that a virtual key, which changes a permission frequently, will be mapped with a hardware key since it has a high possibility to be included inside the cache.

libmpk provides two policies to determine the mappings between virtual and hardware keys. When an application unlocks a page group thread-locally by calling mpk_begin(), libmpk always maps the group’s virtual key with a hardware key and uses it to grant access to the calling thread. libmpk maintains the mapping until the thread calls mpk_end() to release the access. For this reason, libmpk cannot ensure that a calling thread always obtains the access due to hardware limitations. That is, if all hardware keys are actively used, libmpk is no longer able to provide any key. In this case, mpk_begin() raises an exception and lets the calling thread handle it (e.g., sleeps until a key is available). If a page group is not used by a thread, libmpk evicts the group by changing its protection key to 0 (default) and revoking its page permission to disallow subsequent accesses. Unlike mpk_mprotect(), mpk_begin() must evict a page group when every key is used to guarantee that a page group is permitted thread-locally.

The second policy, mpk_mprotect(), also needs to map the virtual key to a hardware key, but not exclusively. Even when the page group is accessible, libmpk can unmap a hardware key and rely solely on the page attributes because all threads have the access. Hence, libmpk maps only the page groups whose access rights change frequently. If libmpk fails to find an available hardware key when it handles mpk_mprotect(), it unmaps and uses the least recently used (LRU) key for handling mpk_mprotect(). The hardware key of the evicted page group turns to 0. To avoid excessive overhead due to frequent unmapping, a developer can configure an eviction rate to control whether a hardware key has to be evicted according to how frequent its permission updates. In our approach, enforcing executable-only permission is not straightforward, because a conventional approach (i.e., mprotect()) does not support executable-only permission. Therefore, mpk_mprotect() reserves one key for execute-only pages when an application creates them firstly, and does not evict this key until all executed-only pages disappear. Every incoming executable-only permission request is guaranteed to get a hardware protection key to achieve executable-only permission. If mpk_mprotect() already had executable-only page groups, further executable-only permission requests will merge the incoming page groups with the existing executable-only page groups to utilize the reserved key.

(a) Hit case:

1
A thread calls mpk_begin() or mpk_mprotect() with a vkey;

2
libmpk returns the corresponding pkey immediately.
(b) Miss case:

1
A thread retrieves a vkey, but no corresponding pkey exists (pkey=0);

2
libmpk evicts the LRU pkey. In addition, mpk_begin() updates the page permission of the evicted and loaded page groups using mprotect();

3
libmpk returns the new pkey.
Figure 6: Key virtualization in libmpk. vkey and pkey represent a virtual key and its corresponding hardware protection key associated with a page group. #threads indicates the number of threads running parallel inside a particular domain.

4.3 Metadata Protection

All metadata of libmpk should not be corrupted by an attacker. The first type of metadata is the virtual key that an application uses to call libmpk’s API functions. libmpk assumes those virtual keys are hardcoded in the application binary, and the application never uses indirect calls to access libmpk’s APIs. libmpk verifies it by checking the binary at load time to ensure that all direct invocations of libmpk use hardcoded virtual keys by checking the call site upon each invocation.

The second one is libmpk’s internal metadata: the mappings between virtual keys and hardware keys, and the page group information. To protect these from malicious overwrites, libmpk maps one physical page into two virtual pages: a read-only page for the application and a writable page for the kernel, and stores the metadata in that page. libmpk slightly modifies existing system calls (e.g., mmap(), munmap(), and mprotect()) to manage the metadata in the kernel. Except metadata management, every management logic for libmpk is located in userspace to minimize unnecessary overhead from domain switching between kernel and userspace.

Figure 7: PKRU synchronization:

1
mpk_mprotect() calls do_pkey_sync() to update the PKRU values of remote threads;

2
do_pkey_sync() adds hooks to the threads’ task_work;

3
do_pkey_sync() kicks all the running threads for synchronization;

4
do_pkey_sync() returns to its caller;

5
The threads update their PKRUs when they are scheduled to run.

4.4 Inter-thread Key Synchronization

libmpk implements an inter-thread PKRU synchronization technique, do_pkey_sync(), for mpk_mprotect() for the two purposes: (1) to ensure no thread has the read access to an execute-only page and (2) to replace existing page-table-based mprotect() for performance. do_pkey_sync() guarantees that a PKRU update is globally visible and effective as soon as it returns. Intuitively, this requires a synchronous inter-thread communication; the calling thread needs to send messages to the other threads and wait until they update the PKRU values and acknowledge it, which suffers from a high cost.

We minimize the inter-thread PKRU synchronization latency in a lazy manner, leveraging the fact that the PKRU values are utilized in the userspace. If a remote thread is not currently being scheduled, it does not need to receive the message immediately. Even if the thread is currently being scheduled, it is enough for the thread to have and update the PKRU values when it returns to the userspace. If the calling thread can create a hook that the other threads will invoke right before jumping back to the userspace and ensure that they are not in the userspace, we can guarantee that all the other threads have the new PKRU values when do_pkey_sync() returns.

Figure 7 illustrates the overall procedure of mpk_mprotect(). do_pkey_sync() utilizes an existing hooking point in the Linux kernel to enforce the remote threads to update the PKRU values right before returning to the userspace and ensures that all threads use the new PKRU values by sending rescheduling interrupts. In Linux, a thread can have a list of callback functions (taskwork) that it will invoke at designated points, including the return to the userspace. A thread can register a callback for another by calling task_work_add(). In this way, do_pkey_sync() lets the remote threads update PKRU values lazily. Although do_pkey_sync() still needs to send inter-processor interrupts to ensure that no other thread uses the old PKRU value after a certain point, our evaluation shows that the overall latency of mpk_mprotect() is shorter than that of mprotect() (§ 6.2).

Application Protection Protected data #pkeys #vkeys Changed LoC
OpenSSL Isolation Private key 1 1 83
JIT (key/page) WX Code cache 15 > 15 CC 10 | SM 18
JIT (key/process) WX Code cache 1 1 CC 18 | SM 24 | v8 134
Memcached Isolation Slab, hashtable 2 2 117

Table 3: Three real-world applications of libmpk. To enable WX in JavaScript engines, we use two approaches including using a virtual key for every page in the code cache, namely One key per page and using a single protection key for all the pages in the code cache, namely One key per process. CC and SM indicate Microsoft ChakraCore and Mozilla SpiderMonkey JS engine, respectively. More code modification is required in Google v8 where WX is not originally supported. pkeys and vkeys mean protection keys and virtual keys respectively

5 Applications

We demonstrate the security benefit, efficiency, and usability of libmpk by augmenting three types of popular applications: an SSL library, three JavaScript Just-in-time (JIT) compilation engines, and an in-memory key-value store. Table 3 summarizes the mechanisms (e.g., page isolation or WX) that we aim to provide as well as the protected data (e.g., key or code). Evaluation results about these secure applications are in § 6.

5.1 OpenSSL

Transport layer security (TLS) or secure sockets layer (SSL) is one of the most important security features to prevent attacks from eavesdropping network traffic by encryption. TLS/SSL relies on public-key cryptosystems, such as RSA, to authenticate communication parties and to exchange a session key between them.

OpenSSL is a popular open-source library for offering encrypted HTTPS and other secure communication by implementing the SSL and TLS protocols. Although it is widely used open source project, it still suffer from diverse security threat such as code execution, memory corruption or leak sensitive information. Especially, information leak bugs are powerful because a web server contains several sensitive data (i.e., crypto key, password and personal private data). For example, Heartbleed bug [27] is one of the information leak bug in OpenSSL, allowing attacker to have a chance to leak sensitive information including private keys.

We apply libmpk to OpenSSL to protect its private keys from potential information leakage by storing the keys in isolated memory pages. More specifically, we first figure out all data types that can store private keys (e.g., EVPPKEY) and replace their heap memory allocation function from OpenSSLmalloc() to mpk_malloc() to store them in an isolated memory region. Next, we find all functions that need to access private keys (e.g., pkeyrsadecrypt()) and let them access the isolated memory region by inserting mpk_begin() and mpk_end() before and after their call sites. Note that it is possible to wrap individual, legitimate access to the isolated memory region with mpk_begin() and mpk_end() to minimize the attack window, but it affects performance and programmability such that this paper does not choose that approach.

5.2 Just-in-time (JIT) Compilation

JIT compilation dynamically translates interpreted script languages, e.g., JavaScript and ActionScript, into native machine code or bytecode to avoid the overhead of full compilation and repeated interpretation. However, it can suffer from security problems because it relies on writable code, which potentially results in arbitrary execution. To support JIT compilation, the code cache that stores code generated at runtime needs to be writable by a JIT compilation thread to let it write and update compiled code, and be executable by an execution thread. This implies that if attackers are able to compromise the JIT compilation thread, they make the execution thread execute the code they provide.

ChakraCore [28] and SpiderMonkey [30] mitigate the abovementioned problem by enforcing the WX security policy on the code cache with mprotect(). They make the code cache writable while disallowing execution when they are updating code, and, after it has updated, they make the code cache executable while disallowing write. However, they can suffer from race condition attacks [39] because they use mprotect() to change page permissions; that is, when a thread makes the code cache writable with mprotect(), other threads compromised by attackers can also manipulate the code cache with the same permission.

We apply libmpk to the three popular JavaScript engines (SpiderMonkey, ChakraCore, and v8) to enforce the WX security policy without the race condition problem while ensuring better performance. Unlike mprotect(), libmpk enables per-thread permission such that it can ensure only a JIT compilation thread has a write right to the code cache. We propose two approaches to implement the WX policy with libmpk.

One key per page.One key per pageOne key per page. A context-free solution is replacing mprotect() with libmpk APIs to perform fast permission switches on targeted pages in the code cache. All the protection keys are initialized with read-only permission when a new thread is created. We dedicate one protection key to one page when it is first time re-protected via mprotect(), and change its page permission to rwx. Later, we only need to call mpk_begin() and mpk_end() before and after when the JIT compiler updates the corresponding page. Based on the observation that mostly only one page is updated at a time, we still invoke mprotect() if multiple pages change permission.

One key per process.One key per processOne key per process. We also propose a new approach specialized for JavaScript engines, where only a single protection key is used to protect all code cache. More specifically, when pages are first time committed from the preserved memory region into the code cache, they are assigned with the protection key and their page permission is set to rwx. Whenever any page in the code cache is to be updated, the script engine needs to call mpk_begin() and mpk_end(). Although more pages become temporarily writable, the security of the code cache is ensured thanks to the per-thread view of the protection key.

5.3 In-Memory Key-Value Store

In-memory key-value stores, such as Memcached, are widely used to manage a large amount of data in memory to ensure low latency and high throughput. Since the performance is the most important requirement of it, it usually does not adopt security techniques that hinder its performance, even if it stores sensitive information. Especially, security techniques whose performance depends on the size of data (e.g., mprotect() and encryption) are avoided. This implies that, if an in-memory key-value store has arbitrary read or write vulnerabilities, attackers are able to leak or corrupt sensitive information.

We apply libmpk to an in-memory key-value store, Memcached, to secure almost its entire memory. libmpk’s performance is independent to the size of memory to secure, so it can efficiently work with Memcached even when the size of in-memory data is several gigabytes. More specifically, we secure slabs that contain actual values and hash tables that maintain key-value mappings by replacing Memcached’s malloc() function with mpk_malloc(), and let legitimate functions (e.g., ITEMkey() and assocfind()) access them by wrapping their call sites with mpk_begin() and mpk_end(). Note that, in our current implementation, we assign two different keys to slabs and hash tables, respectively, to narrow the attack surface. It is possible to use more keys to secure slabs in a fine-grained manner, e.g., differentiating them according to their sizes.

6 Evaluation

In this section, we evaluate libmpk in terms of its security implication and performance by answering the following questions:

  • What security guarantees does libmpk provide? (§ 6.1)

  • Does libmpk solve the security, scalability, and semantic-gap problems that existing MPK APIs suffer from without introducing much performance overhead? (§ 6.2)

  • Does libmpk have negligible performance impact and outperform mprotect() in real-world applications? (§ 6.3)

The same system environment explained in § 2.3 is used for performance evaluations.

6.1 Security Evaluation

We first evaluate the security benefits from libmpk regarding memory protection and isolation. For OpenSSL and Memcached, libmpk provides domain-based isolation to protect memory space that stores sensitive data. The permission for the particular memory space set by libmpk is locally effective, which also prevents malicious accesses from other compromised threads. For example, libmpk manages to prevent memory leakage from a protected domain to outside. All attack attempts that exploit a memory corruption vulnerability to leak or ruin sensitive data stored in the isolated memory space are killed by segmentation faults because they lack proper permission. To verify this, we mimic the Heartbleed vulnerability by deliberately introducing a heap-out-of-bounds read bug and inserting special characters as a decoy private key placed next to the victim heap region. When the vulnerability is triggered, OpenSSL hardened by libmpk crashes with invalid memory access. However, libmpk cannot fully mitigate memory leakage that originates inside the protected domain. Thus, developers should carefully design the domain to minimize the potential attack surface when using libmpk in their applications.

libmpk can be used by JavaScript JIT compilers to guarantee WX on the pages in the code cache. Unlike mprotect(), libmpk is immune to race condition attacks launched by compromised threads running in parallel due to the thread-local effectiveness of protection keys. When the JIT compiler relies on libmpk to switch the permission of a code page for later updates, other threads controlled by attackers cannot write malicious shellcode into the page simultaneously. To verify it, we introduce two custom JavaScript APIs for arbitrary memory read and write to SpiderMonkey and ChakraCore, and test a simple PoC that leverages these two APIs to locate the page of a function being compiled and write shellcode to the corresponding page. Both SpiderMonkey and ChakraCore crash with a segmentation fault at the end.

6.2 Microbenchmarks

We run several microbenchmarks to understand the performance behavior of APIs in libmpk.

Cache performance..Cache performance.Cache performance.. libmpk introduces cache to enable protection on more than 16 page groups, whose performance is affected by its eviction rate and hit rate, and the number of virtual keys in use. We run the following two microbenchmarks to check the cache performance.

<4, 100%><1, 100%>hitmissmprotect<4, 50%><#threads, eviction rate><1, 50%><4, 25%><1, 25%>Hit rates (%)
Figure 8: Latency of libmpk’s key cache with various hit rates, eviction rates, and different number of threads. mpk_mprotect() and mprotect() are invoked on a 4 KB page. Red line marks the overhead of mprotect(). When the hit rate is 100%, mpk_mprotect() is 12.2 faster than mprotect() for one thread and 3.11 faster for four threads.

Hit rate and eviction rate. The first benchmark measures cache performance with different hit rates, eviction rates, and number of threads. We run the benchmark with both one thread and four threads, where each thread warms up by filling the key cache to evade cold miss and invokes mpk_mprotect() on one page for a hundred times after 15 entries are filled. Figure 8 presents the evaluation results, where (1) the green box indicates the overhead incurred by the cache hit, which is dominated by the time cost on WRPKRU and maintaining internal data structures; (2) the blue box indicates the overhead incurred by the cache miss, which is dominated by the time cost on key eviction. More specifically, mpk_mprotect() needs to unset the protection key that is to be evicted and bind a new virtual key to it. We test the microbenchmark with three eviction rates that indicate the ratio of cache misses that eventually leads to key eviction. If a cache miss occurs without key eviction, mprotect() is invoked to change the permission of the pages.

Experimental results show that mpk_mprotect() outperforms mprotect() except when the cache hit rate is below 25% with an eviction rate above 50%. This is because, unlike mprotect(), mpk_mprotect() does not merge and split the VMAs of targeted pages. It becomes slow when being tested with four threads, but is still comparable with mprotect(), whose latency also increases in a multi-threading program.

Number of virtual keys. To evaluate how the number of used virtual keys affects the cache performance of libmpk, we re-implement WX in ChakraCore in a one key per page approach (see § 5.2) and set the eviction rate as 100%. To introduce an increasing number of pages to be protected (i.e., an increasing number of virtual keys to be used) during the execution of ChakraCore, we design a simple microbenchmark. The microbenchmark consists of a set of JavaScript files from 1.js to N.js, each of which contains N hot functions being invoked for 100,000 times than its previous JavaScript file. For such a hot function, ChakraCore allocates one more executable page to store the native code and performs nine permission switches on the page through one virtual key at runtime. Without any hot function, ChakraCore allocates one page in the code cache. We run the original ChakraCore (version 1.9.0.0-beta) and the modified one with our microbenchmarks (from 1.js to 35.js), and record the time cost of changing permission of the pages in the code cache (i.e., the execution time of VirtualProtect() and that of mpk_begin() and mpk_end()) in total. Each JavaScript file is executed 200 times, and the average time is presented in Figure 9.

Number of hot functionsmprotect()libmpk
Figure 9: Average time cost to update permission when original and modified ChakraCore JIT-compile an increasing number of hot functions demanding distinct virtual keys.

The result shows that with libmpk-based implementation of WX, the time cost on permission switches linearly increase when more hot functions are emitted and thus more virtual keys are allocated to protect the code pages of the hot functions. In particular, after 15 virtual keys are allocated (marked in red), the time cost increases slightly faster than before (marked in blue) due to cache eviction. Nevertheless, the ChakraCore hardened by libmpk still outperforms 3.2 than the original ChakraCore using mprotect() to enforce WX.

Memory overhead.Memory overheadMemory overhead. libmpk dedicates memory space to store its internal data structures for maintaining the metadata of these page groups under protection (see § 4.3). Each mpk_mmap() allocates 32 bytes of memory to store the information of a new page group (e.g., base address and size). libmpk maintains a hashmap to store the mapping between virtual keys and hardware keys for fast query and access. In current implementation, we pre-allocate 32 KB of memory for the hashmap, and its size will automatically expand when a program invokes mpk_mmap() more than about 4,000 times.

Number of threadsmprotect()(4,000 KB)mprotect()(400 KB)mprotect()(40 KB)mprotect()(4 KB)mpk_mprotect()
Figure 10: Latency of inter-thread permission synchronization using mpk_mprotect() and mprotect() calls on memory of varying sizes. mpk_mprotect() outperforms mprotect() 1.73 for a single page and 3.77 for 1,000 pages.

Synchronization latency.Synchronization latencySynchronization latency. Figure 10 shows the latency of inter-thread permission synchronization using mpk_mprotect() and mprotect() on memory of varying sizes. mpk_mprotect() is 1.73 faster than mprotect() when updating the permission of a single page. The latency of mprotect() increases with the number of pages it changes due to expensive operations on managing VMAs. Compared to mpk_mprotect(), mprotect() costs at least 3.78 to change the permission of 1,000 pages. The performance overhead of mpk_mprotect() is independent of the number of pages whose permission has been updated. Figure 10 also shows that when there are many threads, the latency of both mprotect() and mpk_mprotect() increases; mprotect() flushes more TLBs, whereas mpk_mprotect() creates many hooks in the kernel.

6.3 Application Benchmarks

We measure the performance overhead of libmpk in practice by evaluating three applications proposed in § 5.

OpenSSL.OpenSSLOpenSSL. The Apache HTTP server [11] (httpd) uses OpenSSL to implement SSL/TLS protocols. To evaluate the overhead caused by libmpk, which is introduced to protect private keys, we use ApacheBench to test httpd with both the original OpenSSL library and the modified one with libmpk. ApacheBench is launched 10 times and each time sends 1,000 requests of different sizes from four concurrent clients to the server. We choose the DHE-RSA-AES256-GCM-SHA256 algorithm with 1024 bits key as cipher suite in the evaluation. Figure 11 presents the evaluation result. On average, libmpk only introduces 0.58% performance overhead in terms of the throughput. The negligible overhead mainly comes from internal data structure maintenance in libmpk.

size of each request ( KB)originallibmpk
Figure 11: Throughput of original httpd and httpd hardened by libmpk. libmpk slows down httpd by at most 0.53%.
(a) SpiderMonkeymprotect()key/pagekey/process(b) ChakraCore
Figure 12: Octane benchmark scores of SpiderMonkey and ChakraCore with original and libmpk-based WX solutions. libmpk outperforms the original, mprotect()-based defense by at most 4.75% (SpiderMonkey) and 31.11% (Chakracore).
No prot.libmpkSDCG
Figure 13: Octane benchmark scores of original v8 and two modified versions of v8 ensuring WX by SDCG and libmpk. libmpk only introduces 0.81% overall performance overhead for WX in v8, compared with 6.68% caused by SDCG.

Just-in-time compilation..Just-in-time compilation.Just-in-time compilation.. We applied two proposed WX solutions based on libmpk, namely, one key per page and one key per process (§ 5.2) to both Spidermonkey (version 59.0) and ChakraCore (version 1.9.0.0-beta) and evaluated their performance with the Octane benchmark [15] which involves heavy JIT-compilation workloads at runtime. Each JavaScript program in the benchmark was directly executed by the original and modified script engines for 20 times, and the average score was recorded. Figure 12 shows the final results.

For SpiderMonkey, both libmpk-based approaches outperform the mprotect()-based approach on the total score, namely, 0.38% and 1.26%, which is consistent with the claim from Firefox developers that enabling WX with mprotect() in SpiderMonkey introduces less than 1% overhead for the Octane benchmark. The reason is that SpiderMonkey is designed to get rid of unnecessary mprotect() calls when its JIT compiler works. The performance scores of nearly all the programs increase through on key per page (at most 3.60% on Box2D) and one key per process (at most 4.75% on Box2D), except for SplayLatency protected by one key per page, whose score is dropped by 1.36%. When a large amount of new executable pages are allocated at runtime but updated few times afterward, the script engine fails to benefit from fast permission switches through WRPKRU, but suffers from intensive cache eviction when one key per page is applied.

Our two libmpk-based approaches improve ChakraCore by 1.01% and 4.39% on the total score of the Octane benchmark, respectively. ChakraCore is suitable for libmpk-based WX solutions since it only makes one page writable per time regardless of emitted code size. Note that one key per page increases the performance score of ChakraCore at most 7.96% when testing SplayLatency while one key per process improves the performance by mostly 31.11% on Box2D. Nevertheless, the score of zlib decreases by 2.12% when one key per process is applied. This is because when new executable pages are committed, they are protected with the single protection key, which requires an extra invoking of pkey_mprotect() on multiple pages. If these pages are rarely updated afterward, the introduced pkey_mprotect() calls hurt the overall performance.

The mprotect()-based approach is vulnerable to the race condition attack figured out by SDCG [39] (see § 5.2). SDCG protects the JIT code pages of v8 with WX by emitting the code in a dedicated process. The code pages are not writable in other processes, which prevents the attack. To demonstrate the performance advantage of our in-process libmpk-based approaches, which are also free of race condition attacks, we applied one of our approaches, one key per process, to Google v8 (version 3.20.17.1 used in [39]) and evaluated the performance through the Octane benchmark as well. Figure 13 presents the performance comparison among original v8, v8 with SDCG, and v8 with libmpk. Note that originally, v8 has not deployed WX to protect its code cache so far. Our approach only introduces 0.81% overall performance loss, compared with 6.68% caused by SDCG.

To summarize, our libmpk-based approaches, which are free of the race condition attack, outperform the mprotect()-based approach currently applied in practice to enforce WX protection on code cache pages with negligible overhead.

25050075010002505007501000#connectionsoriginalmpkbeginmpkmprotectmprotect
Figure 14: Throughput and unhandled concurrent connections of original Memcached and three versions of Memcached whose key-value pairs are protected by mpk_begin(), mpk_mprotect(), and mprotect(). mpk_begin()’s overhead is negligible in comparison to the original. mpk_mprotect() outperforms mprotect() 8.1 while ensuring same semantics.

In-memory key-value store..In-memory key-value store.In-memory key-value store.. To study the performance overhead of libmpk when protecting large memory, we evaluate the modified Memcached whose key-value pairs are isolated by libmpk. More specifically, the modified Memcached pre-allocates 1 GB memory, which is used instead of slab pages allocated by glibc malloc() to store key-value pairs. Besides the original Memcached, we also evaluate the Memcached whose key-value pairs are protected by mprotect(). To study the performance of mpk_mprotect() in real-world applications, we also create the Memcached guarded by libmpk with permission synchronized as another evaluation target for comparison. Each aforementioned version of Memcached launches with four concurrent threads, and we connect to it remotely through twemperf [40]. We create from 250 to 1,000 connections per second, and 10 requests are sent during each connection.

Figure 14 presents the evaluation results. The modified Memcached hardened by libmpk only has 0.01% overhead in terms of data throughput and almost no overhead regarding concurrent connections processed per second, which indicates that libmpk performs well even when protecting a huge number of pages. By contrast, mprotect() introduces nearly 89.56% overhead in terms of data throughput when protecting 1 GB memory in Memcached and a large number of unhandled concurrent connections accumulate in this case. This is because mprotect() involves page table traversing, which is considered expensive when dealing with a large number of pages. To evaluate the synchronization service of libmpk in practice, we also run Memcached protected by mpk_mprotect(). This design ensures the same semantics but outperforms mprotect() 8.1 regarding throughput.

libmpk provides the same functionality of mprotect() with much better performance when protecting huge size memory. Moreover, in multi-threading applications, using mprotect() to ensure in-thread memory isolation requires lock, which is not required when using libmpk due to its inherent property.

7 Discussion

We discuss potential attacks on both Intel MPK and libmpk.

Rogue data cache load (Meltdown).Rogue data cache load (Meltdown)Rogue data cache load (Meltdown). We found Intel MPK can suffer from the rogue data cache load, also known as the Meltdown attack [25, 19]. The rogue data cache load is possible because current Intel CPUs check the access permission to a specific memory page after they have loaded it into the cache. MPK is not an exception because Intel CPUs check the access rights of PKRU when checking the page permission at the same pipeline phase. This allows attackers to infer the content of a present (accessible) page even when its protection key has no access right. Since Intel is considering hardware-level mitigation techniques against the rogue data cache load [19], we believe this problem will be solved in the near future.

Control flow hijacking.Control flow hijackingControl flow hijacking. Developers call pkey_mprotect() to set a protection key for a group of pages followed by a series of WRPKRUs to change the permission of the page group. These two default interfaces provide a new attack surface when the control flow of a compromised process has been hijacked. More specifically, attackers can change the PKRU through WRPKRU to get permission for accessing an isolated memory space. The protection key for specific memory space can also be changed by attackers through pkey_mprotect(). To provide firm protection on isolated memory space, developers can adopt sandboxing techniques [10, 44, 23] to prevent attackers from invoking these two operations and harden the control flow of their applications [22, 45, 2, 21].

8 Related Work

libmpk abstracts MPK, which can primarily be used for memory isolation in the context of security. Though only the latest Intel processors provide MPK, other architectures also support similar hardware features for grouping pages, as discussed in § 1. We introduce proposed applications of MPK and ARM Domain, and other memory isolation mechanisms designed for different applications. Note that some of them have the threat models that are different from ours.

MPK applications.MPK applicationsMPK applications. During conducting our study, we noticed that there were a few ongoing studies using MPK to achieve different security goals. Burow et al. [6] leverage both MPK and memory protection extension (MPX) to efficiently isolate the shadow stack. ERIM [41] utilizes MPK to isolate sensitive code and data. In addition, XOM-Switch [46] relies on MPK to enable execute-only memory for unmodified binaries. Our effort on providing a software abstraction for MPK is orthogonal to these studies, which are all potential applications of libmpk. These schemes can leverage libmpk to achieve secure and scalable key management to create as many sensitive memory regions as required securely.

Memory protection with similar hardware features.Memory protection with similar hardware featuresMemory protection with similar hardware features. Memory protection mechanisms have been leveraging new hardware features for efficiency (e.g., software-based fault isolation [42, 34]). ARM has a hardware feature named as Domain [3] and IBM Power supports Storage Protection [18]. ARMlock [47], FlexDroid [35], and Shreds [7] rely on ARM Domain [3] to isolate untrusted program modules, third-party libraries, and sensitive code modules, respectively.

Although they have a similar high-level concept compared to MPK, their underlying designs make them have different low-level behaviors and potential benefits. Domain differs from MPK in two folds: the way of defining and switching permissions. To change the permission for one or more page groups using Domain, a thread updates a register called Domain Access Control Register (DACR) which defines a running thread’s access rights to a particular page group. Unlike MPK, Domain does not allow an application to update the register by itself. Instead, it requires the application to invoke a system call since only the OS kernel can update DACR. Furthermore, Domain does not support execute-only pages because it does not allow a thread to define an additional access rights for a page group. To make a page execute-only using MPK, a program can make the page executable through mprotect() and forbid page access by using MPK. By contrast, if a program creates non-readable pages through Domain, the processor cannot fetch any instructions from that page. IBM Storage Protection allows a program to create 32 different page groups, and uses two special registers to determine the permissions on every page group a running thread owns as MPK does. Similar to MPK, Storage Protection has a restriction on the number of page groups under control. Moreover, there does not exist any software abstraction to overcome this limitation. Nevertheless, Storage Protection provides protection keys for kernel memory space unlike MPK.

Software-based fault isolation (SFI).Software-based fault isolation (SFI)Software-based fault isolation (SFI). SFI [42] prohibits unintended memory accesses by inserting address masking instructions just before load and store instructions. Sandboxing mechanisms, such as Native Client (NaCl) [34, 14], relies on SFI to isolate untrusted code. Code-Pointer Integrity [22] also uses SFI to protect the code pointers from unsanitized memory accesses. Also, MemSentry [20] provides a unified memory isolation framework based on hardware features to reduce the performance overhead of SFI. SFI enables an application to partition its memory into multiple regions, but the cost of address masking limits the shape of partitions, which are commonly contiguous pieces of memory. By contrast, MPK enables an application to partition the memory into the regions with arbitrary shape. Further, the overhead of SFI on address masking increases by the number of isolated memory regions unlike MPK.

Multiple virtual address space.Multiple virtual address spaceMultiple virtual address space. Using multiple virtual address spaces (i.e., multiple page tables) for a program can protect the memory of sensitive or untrusted components from the others. SMV [17] uses multiple page tables to isolate the memory of threads in a single process from each other, which is similar with [5, 33, 43]. Other systems [26, 4, 9] also provide different memory views to individual threads or small execution units using separated page tables. Kenali [37] uses a page-table-based isolation mechanism to protect sensitive data in which a separate page table is created for each thread. Unlike libmpk, these mechanisms suffer from non-negligible performance overhead due to slow and frequent page table switches.

9 Conclusion

Intel MPK supports efficient per-thread permission control on groups of pages. However, MPK’s current hardware implementation and software interfaces suffer from security, scalability, and semantic-gap problems. To solve these problems, libmpk proposes a secure, scalable, and semantic-gap-mitigated software abstraction of MPK for developers to perform fast memory protection and domain-based isolation in their applications. Evaluation results show that libmpk incurs negligible performance overhead (<1%) for domain-based isolation and better performance for a substitute of mprotect() when adopted to real-world applications: OpenSSL, JavaScript JIT compiler, and Memcached.

References

  • [1] Linux kernel, v4.20, 2018. {https://elixir.bootlin.com/linux/v4.20-rc1/source/mm/mprotect.c#L630}.
  • [2] Abadi, M., Budiu, M., Erlingsson, U., and Ligatti, J. Control-flow integrity. In Proceedings of the 12th ACM Conference on Computer and Communications Security (CCS) (Alexandria, VA, Nov. 2005).
  • [3] ARM. ARM(R)Architecture Reference Manual ARMv7-A and ARMv7-R edition, 2018.
  • [4] Bittau, A., Marchenko, P., Handley, M., and Karp, B. Wedge: Splitting Applications into Reduced-Privilege Compartments. In Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation (NSDI) (San Francisco, CA, Apr. 2008).
  • [5] Brumley, D., and Song, D. Privtrans: Automatically partitioning programs for privilege separation. In Proceedings of the 13th USENIX Security Symposium (Security) (San Diego, CA, Aug. 2003).
  • [6] Burow, N., Zhang, X., and Payer, M. Shining light on shadow stacks. arXiv preprint arXiv:1811.03165 (2018).
  • [7] Chen, Y., Reymondjohnson, S., Sun, Z., and Lu, L. Shreds: Fine-grained Execution Units with Private Memory. In Proceedings of the 37th IEEE Symposium on Security and Privacy (Oakland)
  • [8] Dang, T. H., Maniatis, P., and Wagner, D. Oscar: A Practical Page-Permissions-Based Scheme for Thwarting Dangling Pointers. In Proceedings of the 26th USENIX Security Symposium (Security) (Vancouver, BC, Canada, Aug. 2017).
  • [9] El Hajj, I., Merritt, A., Zellweger, G., Milojicic, D., Achermann, R., Faraboschi, P., Hwu, W.-m., Roscoe, T., and Schwan, K. SpaceJMP: Programming with Multiple Virtual Address Spaces. In Proceedings of the 21st ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (Atlanta, GA, Apr. 2016).
  • [10] Ford, B., and Cox, R. Vx32: Lightweight User-level Sandboxing on the x86. In Proceedings of the 2008 USENIX Annual Technical Conference (ATC) (Boston, MA, June 2008).
  • [11] Foundation, A. S. Apache HTTP Server Project, 2018. https://httpd.apache.org/.
  • [12] Foundation, F. S. The gnu c library, 2018. {https://www.gnu.org/software/libc/manual/html_mono/libc.html#Memory-Protection}.
  • [13] Frassetto, T., Jauernig, P., Liebchen, C., and Sadeghi, A. IMIX: in-process memory isolation extension. In Proceedings of the 27th USENIX Security Symposium (Security)
  • [14] Google. NaCl SFI model on x86-64 systems. https://developer.chrome.com/native-client/reference/sandbox_internals/x86-64-sandbox.
  • [15] Google. The JavaScript Benchmark Suite for the modern web, 2017. https://developers.google.com/octane.
  • [16] Guan, L., Liu, P., Xing, X., Ge, X., Zhang, S., Yu, M., and Jaeger, T. TrustShadow: Secure Execution of Unmodified Applications with ARM TrustZone. In Proceedings of the 15th ACM International Conference on Mobile Computing Systems (MobiSys) (Niagara Falls, NY, June 2017).
  • [17] Hsu, T. C.-H., Hoffman, K., Eugster, P., and Payer, M. Enforcing Least Privilege Memory Views for Multithreaded Applications. In Proceedings of the 23rd ACM Conference on Computer and Communications Security (CCS) (Vienna, Austria, Oct. 2016).
  • [18] IBM. Power ISATM Version 3.0 B, 2017.
  • [19] Intel. Intel Analysis of Speculative Execution Side Channels, 2018.
  • [20] Koning, K., Chen, X., Bos, H., Giuffrida, C., and Athanasopoulos, E. No Need to Hide: Protecting Safe Regions on Commodity Hardware. In Proceedings of the 12th European Conference on Computer Systems (EuroSys) (Belgrade, Serbia, Apr. 2017).
  • [21] Koo, H., Chen, Y., Lu, L., Kemerlis, V. P., and Polychronakis, M. Compiler-assisted Code Randomization. In Proceedings of the 39th IEEE Symposium on Security and Privacy (Oakland) (San Francisco, CA, May 2018).
  • [22] Kuznetsov, V., Szekeres, L., Payer, M., Candea, G., Sekar, R., and Song, D. Code-Pointer Integrity. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI) (Broomfield, Colorado, Oct. 2014).
  • [23] Li, Y., McCune, J. M., Newsome, J., Perrig, A., Baker, B., and Drewry, W. MiniBox: A Two-Way Sandbox for x86 Native Code. In Proceedings of the 2014 USENIX Annual Technical Conference (ATC) (Philadelphia, PA, June 2014).
  • [24] Lie, D., Thekkath, C., Mitchell, M., Lincoln, P., Boneh, D., Mitchell, J., and Horowitz, M. Architectural Support for Copy and Tamper Resistant Software. In Proceedings of the 9th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (Cambridge, MA, Nov. 2000).
  • [25] Lipp, M., Schwarz, M., Gruss, D., Prescher, T., Haas, W., Fogh, A., Horn, J., Mangard, S., Kocher, P., Genkin, D., Yarom, Y., and Hamburg, M. Meltdown: Reading Kernel Memory from User Space. In Proceedings of the 27th USENIX Security Symposium (Security)
  • [26] Litton, J., Vahldiek-Oberwagner, A., Elnikety, E., Garg, D., Bhattacharjee, B., and Druschel, P. Light-Weight Contexts: An OS Abstraction for Safety and Performance. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI) (Savannah, GA, Nov. 2016).
  • [27] Mehta, N., and Codenomicon. The Heartbleed Bug, 2014. http://heartbleed.com/.
  • [28] Microsoft. ChakraCore is the core part of the Chakra Javascript engine that powers Microsoft Edge, 2018. https://github.com/Microsoft/ChakraCore.
  • [29] Mogosanu, L., Rane, A., and Dautenhahn, N. MicroStache: A Lightweight Execution Context for In-Process Safe Region Isolation. In Proceedings of the 21th International Symposium on Research in Attacks, Intrusions and Defenses (RAID) (Crete, Greece, Sept. 2018), pp. 359–379.
  • [30] Mozilla. Spidermonkey, 2018. https://developer.mozilla.org/en-US/docs/Mozilla/Projects/SpiderMonkey.
  • [31] Nagarakatte, S., Martin, M. M., and Zdancewic, S. Watchdog: Hardware for safe and secure manual memory management and full memory safety. In Proceedings of the 40th ACM/IEEE International Symposium on Computer Architecture (ISCA) (Portland, Oregon, June 2012).
  • [32] Nagarakatte, S., Zhao, J., Martin, M. M., and Zdancewic, S. SoftBound: Highly compatible and complete spatial memory safety for C. In Proceedings of the 2009 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI) (Dublin, Ireland, June 2009).
  • [33] Provos, N., Friedl, M., and Honeyman, P. Preventing Privilege Escalation. In Proceedings of the 12th USENIX Security Symposium (Security) (Washington, DC, Aug. 2003).
  • [34] Sehr, D., Muth, R., Biffle, C., Khimenko, V., Pasko, E., Schimpf, K., Yee, B., and Chen, B. Adapting Software Fault Isolation to Contemporary CPU Architectures. In Proceedings of the 19th USENIX Security Symposium (Security) (Washington, DC, Aug. 2010).
  • [35] Seo, J., Kim, D., Cho, D., Kim, T., and Shin, I. FlexDroid: Enforcing In-App Privilege Separation in Android. In Proceedings of the 2016 Annual Network and Distributed System Security Symposium (NDSS)
  • [36] Snow, K. Z., Monrose, F., Davi, L., Dmitrienko, A., Liebchen, C., and Sadeghi, A.-R. Just-in-time code reuse: On the effectiveness of fine-grained address space layout randomization. In Proceedings of the 34th IEEE Symposium on Security and Privacy (Oakland) (San Francisco, CA, May 2013).
  • [37] Song, C., Lee, B., Lu, K., Harris, W. R., Kim, T., and Lee, W. Enforcing Kernel Security Invariants with Data Flow Integrity. In Proceedings of the 2016 Annual Network and Distributed System Security Symposium (NDSS)
  • [38] Song, C., Moon, H., Alam, M., Yun, I., Lee, B., Kim, T., Lee, W., and Paek, Y. HDFI: Hardware-Assisted Data-Fow Isolation. In Proceedings of the 37th IEEE Symposium on Security and Privacy (Oakland)
  • [39] Song, C., Zhang, C., Wang, T., Lee, W., and Melski, D. Exploiting and Protecting Dynamic Code Generation. In Proceedings of the 2015 Annual Network and Distributed System Security Symposium (NDSS)
  • [40] Twitter. twemperf, 2018. https://github.com/twitter-archive/twemperf.
  • [41] Vahldiek-Oberwagner, A., Elnikety, E., Garg, D., and Druschel, P. ERIM: Secure and Efficient In-process Isolation with Memory Protection Keys. arXiv preprint arXiv:1801.06822 (2018).
  • [42] Wahbe, R., Lucco, S., Anderson, T. E., and Graham, S. L. Efficient Software-based Fault Isolation. In Proceedings of the 14th ACM Symposium on Operating Systems Principles (SOSP) (Asheville, NC, Dec. 1993).
  • [43] Wang, J., Xiong, X., and Liu, P. Between Mutual Trust and Mutual Distrust: Practical Fine-grained Privilege Separation in Multithreaded Applications. In Proceedings of the 2015 USENIX Annual Technical Conference (ATC) (Santa Clara, CA, July 2015).
  • [44] Yee, B., Sehr, D., Dardyk, G., Chen, J. B., Muth, R., Ormandy, T., Okasaka, S., Narula, N., and Fullagar, N. Native Client: A Sandbox for Portable, Untrusted x86 Native Code. In Proceedings of the 30th IEEE Symposium on Security and Privacy (Oakland) (Oakland, CA, May 2009).
  • [45] Zhang, C., Song, C., Chen, K. Z., Chen, Z., and Song, D. VTint: Protecting Virtual Function Tables Integrity. In Proceedings of the 2015 Annual Network and Distributed System Security Symposium (NDSS)
  • [46] Zhang, M., Sahita, R., and Liu, D. eXecutable-Only-Memory-Switch (XOM-Switch): Hiding Your Code From Advanced Code Reuse Attacks in One Shot. In Black Hat Asia Briefings (Black Hat Asia) (Singapore, Mar. 2018).
  • [47] Zhou, Y., Wang, X., Chen, Y., and Wang, Z. ARMlock: Hardware-based Fault Isolation for ARM. In Proceedings of the 21st ACM Conference on Computer and Communications Security (CCS) (Scottsdale, Arizona, Nov. 2014).