A bunch of links related to Linux kernel exploitation
Commodity OS kernels are known to have broad attack surfaces due to the large code base and the numerous features such as device drivers. For a real-world use case (e.g., an Apache Server), many kernel services are unused and only a small amount of kernel code is used. Within the used code, a certain part is invoked only at the runtime phase while the rest are executed at startup and/or shutdown phases in the kernel's lifetime run. In this paper, we propose a reliable and practical system, named KASR, which transparently reduces attack surfaces of commodity OS kernels at the runtime phase without requiring their source code. The KASR system, residing in a trusted hypervisor, achieves the attack surface reduction through two reliable steps: (1) deprives unused code of executable permissions, and (2) transparently segments used code and selectively activates them according to their phases. KASR also prevents unknown code from executing in the kernel space, and thus it is able to defend against all kernel code injection attacks. We implement a prototype of KASR on the Xen-4.8.2 hypervisor and evaluate its security effectiveness on Linux kernel-4.4.0-87-generic. Our evaluation shows that KASR reduces kernel attack surface by 64 vulnerabilities and 66 drivers (including their related CVEs) from executing. Besides, KASR successfully detects and blocks all 6 real-world kernel rootkits. We measure its performance overhead with three benchmark tools (i.e., SPECINT, httperf and bonnie++). The experimental results indicate that KASR imposes less than 1 averaged performance overhead compared to an unmodified Xen hypervisor.READ FULL TEXT VIEW PDF
Commodity OS kernels have broad attack surfaces due to the large code ba...
We present, MultiK, a Linux-based framework 1 that reduces the attack su...
One of the main issues in the OS security is to provide trusted code
ARM TrustZone technology is widely used to provide Trusted Execution
One of the main issues in the OS security is providing trusted code exec...
A container is a group of processes isolated from other groups via disti...
Modern code reuse attacks are taking full advantage of bloated software....
A bunch of links related to Linux kernel exploitation
Not ready yet
In order to satisfy various requirements from individuals to industries, commodity OS kernels have to support numerous features, including various file systems and numerous peripheral device drivers. These features inevitably result in a broad attack surface, and this attack surface becomes broader and broader with more services consolidated into the kernel every year. As a consequence, the current kernel attack surface gives an adversary numerous chances to compromise the OS kernel and exploit the whole system. Although we have moved into the virtualization and cloud era, the frustrating security threats is not being addressed. Instead it becomes even worse with the introduction of additional software stacks, e.g., a hypervisor layer. Recent years have witnessed many proposed approaches which realized the severity of this issue and made an effort to reduce the attack surface of the virtualized system. Specifically, schemes like NoHype (Szefer et al., 2011), XOAR (Colp et al., 2011) HyperLock (Wang et al., 2012) and Min-V (Nguyen et al., 2012) are able to significantly reduce the attack surface of the hypervisor. In addition, several other schemes have been proposed to reduce the huge kernel attack surface, which are summarized into the following three categories.
Build from Scratch. The first category attempts to build a micro-kernel with a minimal attack surface (Accetta et al., 1986; Herder et al., 2006b; Herder et al., 2006a; Klein et al., 2010). Although such micro-kernel schemes retrofit security, they are incompatible with legacy applications.
Re-Construction. The second category chooses to make significant changes to current monolithic kernel. Nooks (Swift et al., 2002), LXFI (Mao et al., 2011) and SUD (Boyd-Wickizer and Zeldovich, 2010) isolate buggy device drivers in order to reduce the attack surface to the core kernel. Considering that the core kernel is still large, Nested Kernel (Dautenhahn et al., 2015) places a small isolated kernel inside the monolithic kernel, further reducing the attack surface. Besides, strict access-control policies (Cook, 2013; Liakh et al., 2010; Smalley et al., 2001) and system call restrictions (Seccomp, 2005) also contribute a lot. A common limitation of these approaches is that they all rely on modifications of the kernel source code, which is usually not applicable.
Customization. The last category manages to tailor the kernel attack surface. Specifically, Tartler (Tartler et al., 2012) and Kernel Tailoring (Kurmus et al., 2013) patch kernel configurations that satisfy a particular workload and then recompile the kernel source code. Likewise, Lock-in-Pop (Li et al., 2017) patches and recompiles OS core libraries (i.e., glibc) to restrict selected applications’ access to certain kernel code. They do lack the distribution support due to the requirement of source code recompiling. Ktrim (Kurmus et al., 2011) and KRAZOR (Kurmus et al., 2014) break kernel code integrity by binary instrumenting kernel functions. They have to rely on kernel-specific features such as kprobes, a debugging mechanism that is varied in different kernel versions. Face-Change (Gu et al., 2014) is a Virtual Machine Introspection (VMI)-based technique to profile the kernel for a target application. Based on the profiling results, it could provide a minimal kernel code base for the application. However, it supports neither the Kernel level Address Space Layout Randomization (KASLR) (Cook, 2013) nor multiple-vCPU in a single Virtual Machine (VM). On top of that, it induces a worst-case overhead of %. Thus, these disadvantages impede its deployment in practice.
Overview. In this paper, we propose a reliable and practical virtualized system, named KASR, which is able to transparently reduce the attack surface of a commodity OS kernel at runtime.
Consider a specified application workload (e.g., an Apache server), whose operations do not necessarily need all kernel services. Instead, only a subset of the services are invoked to support both the target Apache process and the kernel. For example, both of them always require code blocks related to memory management (e.g., kmalloc, kfree, get_page) and synchronization mechanisms (e.g., _spin_lock). Apart from that, certain used kernel functions are only used during a specific period of kernel’s lifetime and remain unused for the rest of the time. For instance, the initialization (e.g., kernel_init) and power-off actions (e.g., kernel_power_off) will only be taken when the kernel starts up and shuts down, respectively. In contrast to these used kernel code, many other kernel services are never executed. We call them unused kernel code in this paper. The unused kernel code resides in the main memory, contributing to a large portion of the kernel attack surface. For example, a typical kernel vulnerability, e.g., CVE--, is exploited via a crafted system call perf_event_open that is unused or never invoked in the Apache workload.
With that observation, KASR achieves the kernel attack surface reduction in two steps. The first step is to reliably deprive unused code of executable permissions. Commodity OS kernels are designed and implemented to support all kinds of use cases (e.g., the Apache server and Network File System service), and therefore there will be a large portion of kernel code (e.g., system call handlers) unused for a given use case. By doing so, this step could effectively reduce a large portion of the attack surface. The second step transparently segments used code and selectively activates it according to the specific execution demands of the given use case. This segmentation is inspired by the observation that certain kernel code blocks (e.g., kernel_init) only execute in a particular period, and never execute beyond that period.
As a consequence, KASR dramatically reduces the attack surface of a running OS kernel. In addition, as KASR guarantees that only used code could execute, it defends against all kernel-level code injection attacks. Besides, it is friendly to KASLR (Cook, 2013), and helps facilitate some other security features, e.g., by reducing the size of the control flow graph, so as to make Control-Flow Integrity (CFI) (Abadi et al., 2005) more efficient in invariant enforcement.
We implement a KASR prototype on a private cloud platform, with Xen as the hypervisor and Ubuntu Server LTS as the commodity OS. The OS kernel is unmodified Linux version --generic. KASR only adds about SLoC to the hypervisor code base. We evaluate its security effectiveness under the given use cases (e.g., Linux, Apache, MySQL and PHP (LAMP)-based server). The experimental results indicate that KASR reduces more than kernel attack surface at the granularity of code pages, trims off of in-memory Common Vulnerabilities and Exposures (CVEs) and of system calls, and prohibits of on-disk device drivers including their CVEs from being loaded into memory. In addition, KASR successfully detects and blocks all real-world kernel rootkits. We also measure the performance overhead using several popular benchmark tools as given use cases, i.e., SPECint, httperf and bonnie++. The overall performance overheads are , and on average, respectively.
Contributions. In summary, we make the following contributions:
Propose a new approach to reliably and practically reduce the attack surfaces of commodity OS kernels.
Design and implement a practical KASR system on a private cloud platform.
Evaluate the security effectiveness of the KASR system with real-world kernel rootkits.
Measure the performance overhead of the KASR system using several popular benchmark tools.
Organization. The rest of the paper is structured as follows. In Section 2, we briefly describe our system goals and a threat model. In Section 3, we present the kernel attack surface, its measurement and the rationale of its reduction. We introduce in detail the system architecture of KASR in Section 4. Section 5 and Section 6 present the primary implementation of KASR and its performance evaluation. In Section 7 and Section 8, we discuss limitations and future work, and compare our system with existing works, respectively. At last, we conclude this paper in Section 9.
Before we describe our design, we specify the threat model and the design goals.
In this paper, we focus on reducing the attack surfaces of commodity OS kernels in a virtualized environment. Currently, most personal computers, mobile phones and even embedded devices are armed with the virtualization techniques, such as Intel (Intel, Inc., 2011), AMD (AMD, Inc., 2005) and ARM virtualization support (ARM, Inc., 2005b). Thus, our system can work on such devices.
We assume a hypervisor or a Virtual Machine Monitor (VMM) working beneath the OS kernel. The hypervisor is trusted and secure as the root of trust. Although there are vulnerabilities for some existing hypervisors, we can leverage additional security services to enhance their integrity (Wang and Jiang, 2010; Cheng and Ding, 2013; Azab et al., 2010) and reduce their attack surfaces (Szefer et al., 2011; Colp et al., 2011). As our system relies on a training-based approach, we also assume the system is clean and trusted in the training stage, but it could be compromised at any time after that.
We consider threats coming from both remote adversaries and local adversaries. A local adversary resides in user applications, such as browsers and email clients. The kernel attack surface exposed to the local adversary includes system calls, exported virtual file system (e.g., Linux proc file system) for user applications. A remote adversary stays outside and communicates with the OS kernel via hardware interfaces, such as a NIC. The kernel attack surface for the remote adversary usually refers to device drivers.
Our goal is to design a reliable, transparent and efficient system to reduce the attack surfaces of commodity OS kernels.
G1: Reliable. The attack surface should be reliably and persistently reduced. Even if kernel rootkits can compromise the OS kernel, they cannot enlarge the reduced attack surface to facilitate subsequent attacks.
G2: Transparent. The system should transparently work for the commodity OS kernels. Particularly, it neither relies on the source code nor breaks the kernel code integrity through binary instrumentation. Source code requirement is difficult to be adopt in practice. And breaking the code integrity raises compatibility issues against security mechanisms, such as Integrity Measurement Architecture.
G3: Efficient. The system should minimize the performance overhead, e.g., the overall performance overhead on average is less than .
Among these goals, G1 is for security guarantee, while the other two goals (G2 and G3 ) are for making the system practical. Every existing approach has one or more weaknesses: they either are unreliable (e.g., Lock-in-Pop (Li et al., 2017)), or depend on source code (e.g., SeL4 (Klein et al., 2010)), or break kernel the code integrity (e.g., Ktrim (Kurmus et al., 2011)), or incur high performance overhead (e.g., Face-Change (Gu et al., 2014)). Our KASR system is able to achieve all the above goals at the same time.
We first present how to measure the attack surface of a commodity OS kernel, and then illustrate how to reliably and practically reduce it.
To measure the kernel attack surface, we need a security metric that reflects the system security. Generally, the attack surface of a kernel is measured by counting its source line of code (SLoC). This metric is simple and widely used. However, this metric takes into account all the statical source code of a kernel, regardless of whether it is effectively compiled into the kernel binary. To provide a more accurate security measurement, Kurmus et. al (Kurmus et al., 2013) propose a fine-grained generic metric, named GENSEC, which only counts effective source code compiled into the kernel. More precisely, in the GENSEC metric, the kernel attack surface is composed of the entire running kernel, including all Loadable Kernel Modules (LKMs).
However, the GENSEC metric only works with the kernel source code, rather than the kernel binary. Thus it is not suitable for a commodity OS with only a kernel binary that is made of a kernel image and numerous module binaries. To fix this gap, we apply a new KASR security metric. Specifically, instead of counting source lines of code, the KASR metric counts all executable instructions.
Similar to prior schemes that commonly use SLoC as the metric of the attack surface, the KASR metric uses the Number of Instructions (NoI). It naturally works well with instruction sets where all the instructions have an equal length (e.g., ARM instructions (ARM, Inc., 2005a)). However, with a variable-length instruction set (e.g., x86 instructions (Intel, Inc., 2011)), it is hard to count instructions accurately. In order to address this issue on such platforms, we use the Number of Instruction Pages (NoIP). NoIP is reasonable and accurate due to the following reasons. First, it is consistent with the paging mechanism that is widely deployed by all commodity OS kernels. Second, the kernel instructions are usually contiguous and organized in a page-aligned way. Finally, it could smoothly address the issue introduced by variable-length instructions without introducing any explicit security and performance side-effects. In this paper, the KASR metric depends on NoIP to measure the kernel attack surface.
In a hardware-assisted virtualization environment, there are two levels of page tables. The first-level page table, i.e., Guest Page Table (GPT), is managed by the kernel in the guest space, and the other one, i.e., Extended Page Table (EPT), is managed by the hypervisor in the hypervisor space. The hardware checks the access permissions at both levels for a memory access. If the hypervisor removes the executable permission for a page in the EPT, then the page can never be executed, regardless of its access permissions in the GPT. These mechanisms have been widely supported by hardware processors (e.g., Intel (Intel, Inc., 2011), AMD (AMD, Inc., 2005), and ARM (ARM, Inc., 2005b)) and commodity OSes.
With the help of the EPT, we propose to reduce the attack surface by transparently removing the executable permissions of certain kernel code pages. This approach achieves all system goals listed before. First, it is reliable (achieving G1) since an adversary in the guest space does not have the capability of modifying the EPT configurations. Second, the attack surface reduction is transparent (achieving G2), as the page-table based reduction is enforced in the hypervisor space, without requiring any modifications (e.g., instruction instrumentation) of the kernel binary. Finally, it is efficient (achieving G3) as all instructions within pages that have executable permissions are able to execute at a native speed.
We firstly elaborate the design of the KASR system. As depicted in Figure 1, the general working flow of KASR proceeds in two stages: an offline training stage followed by a runtime enforcement stage. In the offline training stage, a trusted OS kernel Kern is running beneath a use case (e.g., user application ) within a virtual machine. The KASR offline training processor residing in the hypervisor space, monitors the kernel’s lifetime run, records its code usage and generates a corresponding database. The generated kernel code usage database is trusted, as the system in the offline training stage is clean. Once the generated database becomes stable and ready to use, the offline training stage is done.
In the runtime enforcement stage, the KASR module, running the same virtual machine, loads the generated database and reduces the attack surface of Kern. The kernel attack surface is made up of the kernel code from the kernel image as well as loaded LKMs. A large part of the kernel attack surface is reliably removed (the dotted square in Figure 1). Still, the remaining part (the solid shaded-square in Figure 1) is able to support the running of the use case . The attack surface reduction is reliable, as the hypervisor can use the virtualization techniques to protect itself and the KASR system, indicating that no code from the virtual machine can revert the enforcement.
Commodity OSes are designed and implemented to support various use cases. However, for a given use case (e.g., ), only certain code pages within the kernel (e.g., Kern) are used while other code pages are unused. Thus, the KASR offline training processor can safely extract the used code pages from the whole kernel, the so-called used code extraction. On top of that, the used code pages can be segmented into three phases (e.g., startup, runtime and shutdown). The code segmentation technique is inspired by the observation that some used code pages are only used in a particular time period. For instance, the init functions are only invoked when the kernel starts up and thus they should be in the startup phase. However, for certain functions, e.g., kmalloc and kfree, they are used during the kernel’s whole lifetime and owned by all three phases. The KASR offline training processor uses the used code extraction technique (Section 4.1.1) to extract the used code pages, and leverages the used code segmentation technique (Section 4.1.2) to segment used code into different phases. All the recorded code usage information will be saved into the kernel code usage database, as shown in Figure 2.
The database will become stable quickly after the KASR offline processor repeats the above steps several times. Actually, this observation has been successfully confirmed by some other research work (Kurmus et al., 2013). For instance, for the use case of LAMP, a typical httperf (Mosberger and Jin, 1998) training of about ten minutes is sufficient to detect all required features, although the httperf does not cover all possible paths. This observation is reasonable due to the following two reasons. First, people do not update the OS kernel frequently, and thus it will be stable within a relatively long period (e.g., a year). Second, although the user-level operations are complex and diverse, the invoked kernel services (e.g., system calls) are relatively stable, e.g., the kernel code that handles network packets and system files is constantly the same.
A key requirement of this technique is to collect all used pages for a given workload. It means that the collection should cover the whole lifetime of an OS kernel, from the very beginning of the startup phase to the last operation of the shutdown phase. A straightforward solution is to use the trace service provided by the OS kernel. For instance, the Linux kernel provides the ftrace feature to trace the kernel-level function usage. However, all existing integrated tracing schemes cannot cover the whole life cycle. They either miss the beginning of the startup phase before they are turned on, or the last of the shutdown phase after they are turned off. For example, ftrace always misses the code usage of the startup phase (Kurmus et al., 2013) before it is enabled. Extending the trace feature requires modifying the kernel source code, and thus conflicts with our system goals (e.g., G2). To avoid kernel code modification and cover the whole life cycle of the OS kernel, we propose a hypervisor-based KASR offline training processor. The offline training processor, working in the hypervisor space, starts to run before the kernel initializes and remains operational after the kernel exits.
In the following, we will discuss how to trace and identify the used code pages in the kernel image and loaded LKMs.
Kernel Image Tracing. Before the kernel starts to run, the offline training processor removes the executable permissions of all code pages of the kernel image. By doing so, every code execution within the kernel image will raise an exception, driving the control flow to the offline training processor. In the hypervisor space, the offline training processor maintains a database recording the kernel code usage status. When getting an exception, the offline training processor updates the corresponding record, indicating that a kernel code page is used. To avoid this kernel code page triggering any unnecessary exceptions later, the offline training processor sets it to executable. As a result, it is guaranteed that only the newly executed kernel code pages raise exceptions and the kernel continues running, thus covering the lifetime used code pages of the kernel image. Note that the offline training processor filters out the user-space code pages by checking where the exception occurs. (i.e., the value of Instruction Pointer (IP) register).
Kernel Modules Tracing. The above tracing mechanism works smoothly with the kernel image, but not with newly loaded LKMs. All LKMs can be dynamically installed and uninstalled into/from memory at runtime, and the newly installed kernel modules may re-use the code pages that have already been released by other modules. Subsequently, such code pages are missed if we still follow the kernel image tracing mechanism as the pages do not trigger any execution exceptions.
To this end, a sliding-window-based approach is proposed to capture all loaded modules. The sliding window has a page granularity and its size is configurable. For instance, the window size is in Figure 3. The code pages within the sliding window are executable and all other code pages are non-executable. This feature guarantees that the sliding window is able to capture all newly executed code pages through execution exceptions. The window update policy is also configurable and by default it is First-in, First-out. When the first two exceptions occur, the offline training processor stores the page identifications of the corresponding code pages (i.e., Page and Page) in the window, and sets them to executable. When a third code page (i.e., Page) executes, the offline training processor evicts the oldest page (i.e., Page) out of the window, sets it to non-executable, and pushes Page into the window, as shown in Figure 3 . The end user is allowed to customize the window size and the update policy on-demand (see Section 5.2). Obviously, the sliding-window based tracing approach is also suitable for the kernel image tracing.
Page Identification. The traced information is saved in a database, and the database reserves a unique identity for each code page. It is relatively easy to identify all code pages of the kernel image when its address space layout is unique and constant every time the kernel starts up. Thus, a Page Frame Number (PFN) could be used as the identification. However, if the OS kernel enables the KASLR technology (Cook, 2013), the PFN of a code page is no longer constant. Likewise, this issue also occurs with the kernel modules, whose pages are dynamically allocated at runtime, and each time the kernel may assign a different set of PFNs to the same kernel module. A possible approach is to hash every page content as its own identity. It works for most of the code pages but will fail for those that have instructions with dynamically determined opcodes, e.g., for the call instruction, it needs a relative offset as its destination address, and this offset may be different each time.
To address this issue, we propose a multi-hash-value approach. Specifically, instead of computing one hash value upon a whole code page, we pick short code blocks out of a page, compute a hash value for each one of them and store all the values into an array as the page’s identity. If out of (e.g., more than one half) hash values in two pages are matched, it indicates that they are the same page. False positives occur if the ratio of to is too low while the page identification may fail if the ratio is too high. Based on our experiments, we resolve these issues by choosing a proper ratio (i.e., = ). Note that the code blocks do not have to cover a whole page as long as they can identify a page. Besides, if the code blocks’ starting page offset is predefined, attackers may craft a malicious code page where the blocks remain intact and the rest blocks are attack payloads. In that case, the KASR module will regard the page as a used kernel page within the code-usage database and further allow it to execute in the runtime enforcement stage (see Section 4.2). In order to mitigate the attack, where to pick the code blocks within a page is a secret value, which users determine before the offline training stage starts. And then it is stored in the hypervisor space.
This technique is used to segment the used code into several appropriate phases. By default, there are three phases: startup, runtime, and shutdown, indicating which phases the used code have been executed in. When the kernel is executing within one particular phase out of the three, the offline training processor marks corresponding code pages with that phase. After the kernel finishes its execution, the offline training processor successfully marks all used code pages and saves their records into the database. To be aware of the phase switch, the offline training processor captures the phase switch events. For the switch between startup and runtime, we can choose the event when the first user application starts to run, while for the switch between runtime and shutdown, we can choose the execution of the reboot system call as the switch event.
When the offline training stage is done and a stable database has been generated, KASR is ready for runtime enforcement. As shown in Figure 4, the KASR module loads the generated database for a specific workload, and reduces the kernel attack surface in two steps:
Lifetime Segmentation. It aims to further reduce the kernel attack surface upon the permission deprivation. As shown in Figure 4, it transparently allows the used kernel code pages of a particular phase to execute while setting the remaining pages to non-executable.
All instructions within the executable pages can execute at a native speed, without any interventions from the KASR module. When the execution enters the next phase, the KASR module needs to revoke the executable permissions from the pages of the current phase, and set executable permissions to the pages of the next phase. To reduce the switch cost, the KASR module performs two optimizations. First, if a page is executable within the successive phase, the KASR module skips its permission-revocation and keeps it executable. Second, the KASR module updates the page permissions in batch, rather than updating them individually.
As system calls are the major runtime interface that an adversary may utilize to launch attacks against the OS kernel, it is meaningful for the KASR system to have a particular optimization on system calls. The optimization aims to completely remove all unused system call handlers and selectively enable the used handlers according to the phase switch above. However, the previous two techniques are page-based and cannot be directly applied to the system call Optimization, because the unused and used system call handlers may coexist on the same code pages.
Fortunately, all system call entries are located in a system call table, which is easy to manipulate. The KASR system extracts all entries of the system calls from this table, removes unused ones and segments used ones into different phases. Specifically, in the offline training stage, the offline training processor records the entries of all pages. If one entry occurs, the offline training processor marks the corresponding system call as used. Finally, the offline training processor produces the system call usage. In the runtime enforcement stage, the KASR module relies on the usage information to generate fake system call tables - retaining all used system call entries and removing the others. When the phase switches (e.g., from the init phase to the runtime phase), the KASR module switches the corresponding system call table accordingly. This optimization is efficient (i.e., does not introduce extra performance overhead to the KASR system) and effective (i.e., achieves a significant reduction of the system-call interface).
This section presents the implementation details of the KASR database, including database in-memory data structure, database population as well as database saving and loading operations.
Basically, the database consists of two single-linked lists, which are used to manage the pages of kernel image and loaded modules respectively. Both lists have their own list lock to support concurrent updates. Every node of each list representing a page is composed of a page ID, a status flag, a node lock as well as a node pointer pointing to its next node. The page ID is used to identify a page especially during the database updates. The status flag indicates the phases (e.g., startup and runtime phases) of a page, and the node lock is required to avoid race conditions and thus other nodes can be processed in parallel, the so called fine-grained locking.
In our implementation, the generic OS kernel does not support kernel-level randomization, and thus it is relatively simple to identify a code page of the kernel image. Specifically, the base-address range of the kernel image remains the same every time the kernel starts up, from to . The relationship between linear addresses and physical addresses are constant one-to-one mappings. Thus, the database can use the PFN as the page ID.
For the kernel modules, every time they are installed, their base addresses are different, making a relative offset of instructions such as call and jump variable. As a result, these varied opcodes make the page content different each time even for the same page. To choose proper page IDs for such code pages, we use the multi-hash-value based page ID. As stated in Section 4, the multi-hash-value page ID has two factors ( and ), and adjusting the two factors will directly affect the performance and the accuracy. For instance, if the rate is too small, KASR can gain better performance, but it may suffer from false positives. If the rate is too large, we have to pay an additional performance overhead for identifying one page. Thus, selecting reasonable values for both and is expected for balancing between performance and accuracy. In practice, we enumerate all possible conditions and finally determine that should be while should be . Based on these two values, there are no false positives and the performance overhead is low.
The status flag uses three bits to reserve three values (, , ), corresponding to the startup, runtime and shutdown phases. The status flag is initialized as and then becomes whenever the kernel boots up (i.e., startup phase). Once the kernel switches from the startup phase to the runtime phase, or from the runtime phase to the shutdown phase, appropriate events are triggered to request the offline training processor to update the status flag accordingly. In our implementation, all code pages of the guest OS are deprived of executable permissions. Once the OS starts to boot, it will raise numerous EPT exceptions. In the hypervisor space, there is a handler (i.e., ept_handle_violation) responding to the event, and thus the offline training processor is able to mark the runtime phase’s start by intercepting the first execution of use-space code pages as well as its end by intercepting the execution of reboot system call.
The KASR offline training processor relies on a sliding window based approach to build up its database. To avoid missing any code page, an intuitive approach is to make the sliding window contain only one page. However, we find that the one-page sliding window halts the system at runtime. The reason is that the x86 instructions have variable lengths. As a consequence, an instruction may cross a page boundary, meaning that the first part of the instruction is at the end of a page, while the rest is in the beginning of the next page. Under such situations, the instruction fetching will result in infinite loops (i.e., trap-and-resume loops) in the one-page-based sliding window mechanism. Note that this corner case will never occur on the ARM platform as all ARM instructions are of the same length and thus no instruction crosses the page boundary.
To address the above problem, we propose to set the minimum size of the sliding window to . Besides addressing the cross-page-boundary issue, the sliding window with the larger size could also accelerate the tracing performance. However, it would potentially miss certain pages, e.g., if one loadable kernel module only has one page (smaller than the window size), we may miss the page collection of this module. Fortunately, we observed that all existing legacy kernel modules are larger than pages.
To the end, it is far from enough to obtain all used pages by running the offline training stage just once. Thus, it is necessary to repeat this stage for multiple rounds until the database size becomes stable. In our experiments, rounds are enough to get a stable database (see Section 6).
The database is generated in the hypervisor space, and stored in the hard disk for reuse. There are two possible ways to save and load the database. The straightforward approach is to add a disk and file system driver in the hypervisor to allow KASR to directly write/read the database, which certainly adds hundreds of extra SLoC into the hypervisor space and might cause potential vulnerabilities. To minimize the size of the hypervisor, the way we choose is to re-use the existing drivers and exported interfaces in the privileged domain (e.g., domain 0 of Xen), which eliminates the unnecessary security risks and is more flexible. We developed a tiny tool in the privileged domain to explicitly save the database into the domain’s disk after the offline training stage, and load the existing database into the hypervisor space upon the runtime enforcement stage.
We have implemented a KASR prototype on our private cloud platform, which has a Dell Precision T PC with eight CPU cores (i.e., Intel Core Xeon-E) running at GHz. Besides, Intel VT-x feature is enabled and supports the page size in the granularity of KB. Xen version is the hypervisor while Hardware-assisted Virtual Machine (HVM) is the Ubuntu Server LTS, which has an unmodified Linux kernel of version --generic with four virtual CPU cores as well as GB physical memory. KASR only adds around K SLoC in Xen.
In the rest of this section, we evaluate its effectiveness through the reduction rates of the kernel attack surface, in-memory Common Vulnerabilities and Exposures (CVEs) and system calls. The use cases we choose are SPECint, httperf, bonnie++, LAMP (i.e., Linux, Apache, MySQL and PHP) and NFS (i.e., Network File System). Furthermore, we test and analyze its security through real-world kernel rootkits. Also, we measure the performance overhead introduced by KASR through the selected use cases above. The experimental results demonstrate that we can effectively reduce kernel attack surface by , in-memory CVE vulnerabilities by , system calls by , prohibit respective % and % of on-disk device drivers and their-related CVEs from executing, safeguard the kernel against popular kernel rootkits and impose negligible (less than ) performance overhead on all use cases.
|Cases||Original Kernel||Permission Deprivation||Lifetime Segmentation|
|Page (#)||Used Page (#)||Reduction (%)||Runtime Used Page (#)||Reduction (%)|
In the runtime enforcement stage, we measure the kernel attack surface reduction through three representative benchmark tools, namely, SPECint, httperf and bonnie++ and two real-world use cases (i.e., LAMP and NFS).
SPECint (Standard Performance Evaluation Corporation, 2006) is an industry standard benchmark intended for measuring the performance of the CPU and memory. In our experiment, the tool has sub-benchmarks in total and they are all invoked with a specified configuration file (i.e., linux64-ia32-gcc43+.cfg).
On top of that, we measure the network I/O of HVM using httperf (Mosberger and Jin, 1998). HVM runs an Apache Web server and Dom tests its I/O performance at a rate of starting from to requests per second ( connections in total).
Also, we test the disk I/O of the guest by running bonnie++ (Bonnie, 1999) with its default parameters. For instance, bonnie++ by default creates a file in a specified directory, size of which is twice the size of memory.
Besides, we run the LAMP-based web server inside the HVM. Firstly, we use the standard benchmark ApacheBench to continuously access a static PHP-based website for five minutes. And then a Web server scanner Nikto (Sullo, 2012) starts to run so as to test the Web server for insecure files and outdated server software and also perform generic and server type specific checks. This is followed by launching Skipfish (Michal et al., 2010), an active web application security reconnaissance tool. It operates in an extensive brute-force mode to carry out comprehensive security checks. Running these tools in the LAMP server aims to cover as many kernel code paths as possible.
Lastly, the other comprehensive application is NFS. HVM is configured to export a shared directory via NFS. In order to stress the NFS service, we also use bonnie++ to issue read and write-access to the directory.
All results are displayed in Table 1. Note that the average results for SPECint are computed based on sub-benchmark tools. We determine two interesting properties of the kernel attack surface from this table. First, the attack surface reduction after each step is quite significant and stable for different use cases. Generally, the attack surface is reduced by roughly % and % after the permission deprivation and lifetime segmentation, respectively, indicating that less than half of the kernel code is enough to serve all provided use cases. Second, complicated applications (i.e., LAMP and NFS) occupy more kernel code pages than the benchmarks, indicating that they have invoked more kernel functions.
Kernel vulnerabilities are a serious practical security problem, thus we characterize the reduced kernel attack surface in the metric of reduced CVEs. Considering that some kernel functions (e.g., architecture-specific code) are vulnerable and contain CVE vulnerabilities, they are never loaded into memory during the kernel’s lifetime run and do not contribute to the attack surface. As a result, we only consider the CVEs that exist in the kernel memory and can be triggered.
We investigate CVE bugs of recent two years that provide a link to the GIT repository commit and identify CVE-related functions that exist in the kernel memory of all five use cases, making exploitation possible. It can be seen from Figure 5 that KASR has removed % of in-memory CVEs. To be specific, some vulnerable kernel functions within the unused kernel code pages are deprived of executable permissions in the first step. For example, the ecryptfs_privileged_open function in CVE-2016-1583 before Linux kernel- is unused, thus being eliminated. After the second step, some other vulnerable functions are also removed (e.g., icmp6_send in CVE--).
As shown in Figure 5, KASR has successfully trimmed more than half of the system calls (abbreviated to syscalls in Linux) for all use cases. The original number of syscalls located in the syscall table is in total. After KASR is enabled, an average of syscalls is removed at runtime.
Specifically, the first step completely eliminates all unused syscalls. It removes non-standard Linux extension syscalls (e.g., process_vm_readv), obsolete syscalls (e.g., olduname), and rarely used syscalls (e.g., vm86 ). After the second step, syscalls used in both startup and shutdown phases are removed. One representative syscall in the startup phase is chroot, which sets the root directory at the beginning of the system bootup. While in the shutdown phase, one familiar syscall is reboot which restarts or powers off the system.
As device drivers are more buggy than the core kernel (Swift et al., 2002), we also perform measurements for them alone in every use case. Devices drivers (Kadav and Swift, 2012) are categorized into four classes, (i.e., Char, Block, Net and Others). The class Others refers to particular drivers such as virtual device drivers of Xen within the HVM. All drivers are located in the directory of /lib/modules/--generic and they are allowed to be loaded into memory in an on-demand way. KASR dictates that only drivers that have been traced are permitted to execute while the other drivers are strictly prohibited.
In Table 2, the device driver modules (#) and their related CVE number (#) of every sub-class are listed in two groups. Take the class Char as an example, the original group has driver modules which contain CVEs. And the KASR group has only CVE included by driver modules. The original group indicates that all the drivers on disk can be loaded into memory, while the KASR group allows only a certain number of drivers to have access to memory. As shown in Table 2, loaded drivers in total in the KASR group, in every use case, only account for % of that (i.e., ) in the original group, indicating that % drivers are unnecessary and never invoked. Correspondingly, % (i.e., out of ) CVEs included by the drivers cannot be triggered.
|Class||Mod(#), CVE(#)||Mod(#), CVE(#)|
Even though the kernel attack surface is largely reduced by KASR, still there may exist vulnerabilities in the kernel, which could be exploited by rootkits. We demonstrate the effectiveness of KASR with real-world kernel rootkits. Specifically, we have selected popular real-world kernel rootkits coming from a previous work (Riley et al., 2008) and the Internet. These rootkits work on typical Linux kernel versions ranging from to , representing the state-of-the-art kernel rootkit techniques. All these rootkits launch attacks by inserting a loadable module and they can be divided into three steps:
inject malicious code into kernel allocated memory;
hook the code on target kernel functions (e.g., original syscalls);
transfer kernel execution flow to the code.
KASR is able to prevent the third step from being executed. Specifically, rootkits could succeed at Step-1 and Step-2, since they can utilize exposed vulnerabilities to modify critical kernel data structures, inject their code and perform target-function hooking so as to redirect the execution flow. However, they cannot execute the code in Step-3, because KASR decides whether a kernel page has an executable permission. Recall that KASR reliably dictates that unused kernel code (i.e., no record in the database) has no right to execute in the kernel space, including the run-time code injected by rootkits. Therefore, when the injected code starts to run in Step-3, EPT violations definitely will occur and then be caught by KASR.
The experimental results from Table 3 clearly show that KASR has effectively defended against all rootkits. In fact, KASR is able to prevent unauthorized code from running and all the above rootkits have to inject malicious (unauthorized) code to launch attacks. Note that hardware-assisted VMBR (abbreviated for Virtual-Machine Based Rootkits) such as Blue Pill (Rutkowska, 2006) and Vitriol (Zovi, 2006) are no exceptions, because such rootkits still require unauthorized loadable modules to load and execute. As a result, KASR is able to defend against all (known and unknown) kernel rootkits that rely on code-injection attacks.
In this section, we evaluate the performance impacts of KASR on CPU computation, network I/O and disk I/O respectively. Benchmark tools are conducted with two groups, and their measurements are shown in Tables 4, 5 and 6, respectively.
|Programs||Original ()||KASR ()||Overhead|
|Request Rate||Net I/O||Net I/O||Overhead|
Table 4 lists the SPECint benchmark results. The performance overhead caused by KASR within every program is quite small and stable. In particular, the maximum performance overhead is % while the average performance overhead is % for the overall system. Table 5 illustrates the network I/O results in different request rates. The overhead ranges from % to % and the average is only %. The experimental results in Table 6 are generated based on two test settings, i.e., sequential input and sequential output. For each setting, the read, write and rewrite operations are performed and their results indicate that KASR only incurs a loss of % on average.
We take LAMP server as an example to illustrate the offline training efficiency, indicating how fast to construct a stable database for a given workload. Specifically, we repeat the offline training stage for several rounds to build the LAMP database from the scratch. After the first round, we get code pages, % of the final page number. After that, successive offline training rounds are completed one by one, each of which updates the database incrementally based on previous one, ensuring that the final database contains all necessary pages. From Figure 6, it can be seen that the database as a whole would become steady after multiple rounds (i.e., in our experiments). This observation is also confirmed in other cases.
In fact, it is still time-consuming to build a particular database from scratch. To further accelerate this process, we attempt to do the offline training stage from an existing database. In our experiments, we integrate every database generated respectively for SPECint, httperf, bonnie++ into a larger one, and try to generate the LAMP database based on it. Interestingly, we find that only rounds are enough to generate the stable database for LAMP, shown in Figure 7, significantly improving the offline training efficiency.
In this section, we will discuss known issues about KASR, which are summarized below.
Similar to Ktrim (Kurmus et al., 2011), KRAZOR (Kurmus et al., 2014) and Face-Change (Gu et al., 2014), KASR also uses a training-based approach. As the approach might miss some corner cases, it may cause KASR to mark certain pages that should be used as unused, resulting in an incomplete offline training database. Theoretically speaking, it is possible for such situations to occur. However, in practice, they have never been observed in our experiments so far. Interestingly, Kurmus al et. (Kurmus et al., 2013) found that a small offline training set is usually enough to cover all used kernel code for a given use case, implying that the corner cases usually do not increase the kernel code coverage. If the generated database is incomplete, EPT violations may have been triggered at runtime. For such situations, KASR has two possible responses. One is to directly stop the execution of the guest, which is suitable for the security sensitive environment where any violations may be treated as potential attacks. The other one is to generate a log message, which is friendly to the applications that have high availability requirements. The generated log must contain the execution context and the corresponding memory content, which will facilitate a further analysis, e.g., system forensics.
Although its current implementation is based on the x86 platform, the design of KASR can be ported to the ARM platform. Currently, the virtualization technique is available on the ARM platform, and we do not see any technical barrier to the migration. We even identify several advantages on ARM platform, e.g., the ARM instructions are of equal lengths and thus it naturally avoids the trouble introduced by the variable-length instructions (see Section 3.1). Besides, different from existing works in Section 1, KASR essentially requires no kernel source code to achieve an efficient kernel attack surface reduction. Thus, it provides a generic approach to enhancing the security of commodity OS kernels (e.g., Windows).
By default, we have three segmented phases. Actually, the whole lifecycle could be segmented into more phases, corresponding to different working stages of a user application. Intuitively, a more fine-grained segmentation will achieve a better kernel attack surface reduction. Nonetheless, more phases will introduce more performance overhead, such as the additional phase switches. In addition, it will increase the complexity of the KASR offline training processor, and consequently increases the trusted computing base (TCB). At last, the KASR module has to deal with the potential security attacks, e.g., malicious phase switches. To prevent such attacks, a state machine graph of phases should be provided, where the predecessor, successor and the switch condition of each phase should be clearly defined. At runtime, the KASR module will load this graph and enforce the integrity: only the phase switches existing in the graph are legal, and any other switches will be rejected. This solution is similar to the CFI.
Besides the code-injection attacks, KASR may alleviate other types of kernel rootkits: (1) code reuse attacks, and (2) Data-Oriented Programming attacks (Hu et al., 2015).
For code reuse attacks such as ROP (Prandini and Ramilli, 2012) and JOP (Bletsch et al., 2011), they do not inject any code, instead they reuse existing code to select useful code blocks (i.e., gadgets) to launch attacks. As KASR is able to reduce/remove more than half of the code pages, the number of gadget candidates will be correspondingly reduced. Consequently, it raises the bar of launching code reuse attacks due to lack of expected gadgets.
The DOP attack (Hu et al., 2015) mainly relies on chained gadgets and gadget dispatchers to launch memory exploits, which strictly conforms to the CFI (Abadi et al., 2005). Thus, such attacks cannot be detected by CFI. Fortunately, both the gadgets and the dispatchers are still composed of existing code blocks in the kernel space. It is thus possible for KASR to alleviate the attack. Demonstrating KASR against code-reuse/DOP rootkits is our future work.
In this section, we make a comparison of different approaches that reduce the kernel attack surface. As discussed before, in order to reduce the kernel attack surface, existing approaches either build a security-enhanced kernel from scratch (i.e., a new kernel architecture), or re-construct current kernel architecture, or customize an OS kernel.
Build From Scratch. Designing a kernel as small and modular as possible is always the primary goal for micro-kernel architectures (Accetta et al., 1986; Herder et al., 2006b; Herder et al., 2006a; Klein et al., 2010), among which Mach (Accetta et al., 1986) is one of the earliest examples of a micro-kernel, implements a small but extensible system kernel, thus achieving a minimal trusted computing base (TCB). However, compared to a native monolithic UNIX kernel, Mach has serious performance issues. Sel4 (Klein et al., 2010) is proposed as the world’s first OS that achieves a high degree of assurance through formal verification. Besides, it works significantly faster than any other micro-kernels on the supported processors. Note that KASR works transparently for the guest OS and thus is complementary to this category of approaches.
Re-Construction. As device drivers are more buggy or vulnerable than core kernel (Swift et al., 2002), Nooks (Swift et al., 2002), LXFI (Mao et al., 2011) and SUD (Boyd-Wickizer and Zeldovich, 2010) attempt to isolate device drivers to protect the core kernel from being compromised. A common drawback of these techniques is that they do trust the core kernel. On the contrary, Nested Kernel (Dautenhahn et al., 2015) does not trust the whole monolithic kernel. Instead, it introduces a small isolated kernel, removing the monolithic kernel from TCB. Alternatively, there are efforts made to enforce access control within the monolithic kernel (Cook, 2013; Liakh et al., 2010; Smalley et al., 2001) or restrict the use of system calls (Seccomp, 2005).
Customization. This category will not make any changes to the kernel and can be further divided from two perspectives (i.e., kernel level and hypervisor level). Kernel customizations (Tartler et al., 2012; Kurmus et al., 2013; Heinloth, 2014) present automatic approaches of trimming kernel configurations adapted to specific use cases so that the tailored configurations can be applied to re-compile the kernel source code, thus largely reducing the kernel attac surface. Similarly, Lock-in-Pop (Li et al., 2017) modifies and recompiles OS core libraries (i.e., glibc) to restrict an application’ access to certain kernel code. In contrast, both Ktrim (Kurmus et al., 2011) and KRAZOR (Kurmus et al., 2014) insert a loadable kernel module that relies on kprobes to trim off unused kernel functions and prevent them from being executed.
In the virtualization environment, both Secvisor (Seshadri et al., 2007) and NICKLE (Riley et al., 2008) only protect original kernel TCB and do nothing to reduce it. Taking a step further, Face-Change (Gu et al., 2014) profiles the kernel code for each target application and uses the Virtual Machine Introspection (VMI) technique to detect process context switch and thus provide a minimized kernel TCB for each application. However, Face-Change has three disadvantages: (1) Its worst-case runtime overhead for httperf testing Apache web server is %, whereas our worst overhead is % (see Table 5), making it impractical in the cloud environment. (2) Its design naturally does not support KASLR, which is an important kernel security feature to defend against code-reuse attacks and has been merged into the Linux kernel mainline in kernel version 3.14. In contrast, KASR is friendly to the security feature. (3) While multiple-vCPU support is critical to system performance in the cloud environment, it only supports a single vCPU within a guest VM, whereas KASR allocates four vCPUs to the VM.
Commodity OS kernels provide a large number of features to satisfy various demands from different users, exposing a huge surface to remote and local attackers. In this paper, we have presented a reliable and practical approach, named KASR, which has transparently reduced attack surfaces of commodity OS kernels at runtime without relying on their kernel source code. We implemented KASR on the Xen hypervisor and evaluated it using the Ubuntu OS with an unmodified Linux kernel. The experimental results showed that KASR has efficiently reduced of kernel attack surface, of in-memory CVE vulnerabilities and of system calls, and prevented respective and % of on-disk device drivers and their related CVEs from running in all given use cases (i.e., SPECINT, httperf, bonnie++, LAMP and NFS). In addition, KASR defeated all real-world rootkits. KASR incurred low performance overhead (i.e., less than on average) to the whole system.
In the near future, our primary goals are to apply KASR to the kernel attack surface reduction of a Windows OS and then transport KASR onto the ARM platform.