General-purpose operating systems (OSes) have, over time, bloated in size. While this is necessitated by the need to support a diverse set of applications and usage scenarios, a significant amount of the kernel code is typically not required for any given application. For example, Linux supports more than system calls and contains code for supporting different filesystems, network protocols, hardware drivers, etc. all of which may not be needed for every application or deployment. While a minimal off-the-shelf install of Ubuntu 16.04 (running kernel 4.4.1) produces a kernel binary with an text section, many of the applications that we profiled (refer to § 6 for more details) only use about of it.
In addition to performance issues, unused kernel code (when mapped into an application’s process memory) represents an attack surface – especially if vulnerabilities exist in the unused parts of the kernel code. Such vulnerabilities, while less common than those in applications, are still found with regular frequency.222Over vulnerabilities have been found in the Linux Kernel since 2010 (https://nvd.nist.gov/). Since OS kernels are often considered to be a part of the trusted computing base (TCB) for many systems, this attack surface poses a significant risk. Today, there exist many known exploits that take advantage of kernel vulnerabilities (e.g., CVE-2017-16995333https://access.redhat.com/security/cve/cve-2017-16995).
Researchers have explored different techniques to reduce kernel code (e.g., [1, 2, 3, 4, 5]). For example, (a) building application specific unikernels , (b) tailoring kernels through build configuration editing [2, 3], providing specialized kernel views for each application [4, 5] among others. These approaches, either need application level changes , or need expert knowledge about (and manual intervention in) the selection of configurations – they also sacrifice the amount of kernel reduction achieved to support multiple applications [2, 3], or incur significant performance overheads  or can only specialize the kernels at a coarse page level granularity . Note: “granularity”, in this context, refers to sizes of code chunks that are considered for elimination; some techniques eliminate kernel code at the page level  while others may choose to do it at a basic block level . We show that our framework can evaluate systems with different levels of code reduction granularity (with the obvious result that with a finer granularity of code reductions, a greater amount of kernel code can be eliminated (§ 6.1)).
In this paper we introduce MultiK (§ 2), a flexible, automated, framework for (a) orchestrating multiple kernels (full or specialized) at (b) user chosen code reduction granularity and near-zero runtime overheads – without the need for virtualization. In addition, MultiK does not introduce additional overhead if a finer-grain granularity is chosen. To demonstrate this, we evaluate MultiK on three levels of granularity using two specialization tools (§ 4), (i) D-Kut, a run-time profiling-based tool that can tailor the Linux kernel at different granularity (we perform basic block- and function/symbol-granularity tailoring in this paper) and (ii) S-Kut, a syscall-based tool that tailors the kernel based on the system calls used by the application. MultiK is able to simultaneously run multiple applications that either (i) have their kernels specialized with any one of these tools or (ii) use the full, unmodified, kernel. Note that, though the two approaches (D-Kut and S-Kut) are complementary – they can either be used independently or in conjunction with other tools/frameworks444In fact, MultiK is not beholden to these tools. System designers may use their favorite profiling methods/tools. The kernel profiles that are obtained can be easily integrated with the MultiK framework..
Figure 1 shows the high-level architecture of our MultiK framework. A vanilla kernel is profiled (Stage 1) using the application that is supposed to run on it. Note that in contrast to existing approaches, the system designer can choose the level of granularity and code reduction techniques (Stage 2). At runtime (Stage 3), the MultiK framework launches the application on its specialized kernel (one application per specialized kernel555Though this can be generalized as explained in § 7.). The entire framework executes on a “base” (vanilla) kernel. The details of each of these stages/components are explained in the rest of the paper.
MultiK imposes very small, almost negligible, runtime overheads () regardless of the granularity chosen (§ 6). User programs run unmodified and natively. In fact, the entire framework is transparent to application developers. Our framework can also easily integrate with container-based systems such as Docker 666https://www.docker.com (see our experiments in Table 9) – which are popular for deploying applications.
In summary, this paper makes the following contributions:
Design and implement MultiK (§ 3), a flexible framework for dispatching multiple kernels (specialized, unmodified or a combination of both) on the same machine without the need for virtualization.
Presents an evaluation for attack surface reduction and performance overheads; the proposed framework shows virtually no runtime overhead while significantly reducing the attack surface both in terms of the amount of eliminated code and CVEs.
2 Background and Design Goals
Kernel specialization for attack surface reduction has been studied in multiple previous works. Those prior works all aimed to (i) identify unused parts of a bloated kernel and (ii) remove the identified parts either statically or disabled them at runtime. Debloating kernels in this way removes software vulnerabilities present in unused kernel code, reducing its attacks surface and thereby contributing to a system’s security. In the following, we address limitations of existing kernel attack surface reduction approaches and then set our threat model and design goals. At the end, we illustrate an overview of our approach in MultiK to achieve our design goals.
2.1 Limitations of existing approaches
Complexity in handling Kconfig.Complexity in handling KconfigComplexity in handling Kconfig. One of the prominent approaches to de-bloating Linux kernel is to use the kernel build configuration system, kconfig. However, working with those configurations is a complex job not only because there are over 300 feature groups and more than 20,000 individual configuration options to choose from, but also because most of these options have many dependencies that further contribute to the complexity of the system. While approaches to automate kconfig to tailor the Linux kernel have been proposed [2, 3], they often require (manual) maintenance of whitelist and blacklist configuration options — these lists quickly become irrelevant as applications and kernel evolve.
Specializing a kernel for running many applications.Specializing a kernel for running many applicationsSpecializing a kernel for running many applications. When running multiple application on a system, those approaches [2, 3] tailor the kernel for the combined set of applications. This negatively affects the attack surface reduction because the kernel will need to contain code for serving requests from all such applications, and not just for serving a single application. For instance, orthogonal applications (e.g., Apache and ImageMagick) only share about of the kernel code for their usage. Even similar applications (e.g., Apache and vsftpd) share about of the kernel and often leave as much as not shared. Hence, there is a high likelihood that unused code stays in the final kernel.
Use of virtualization and specialization granularity.Use of virtualization and specialization granularityUse of virtualization and specialization granularity. To address some of these concerns, the Face-Change system proposed the customization of kernels for each application . However, their system is implemented with a hypervisor component, and this makes applications and their kernels run in a virtual machine (with associated performance penalties). Further, due to the use of a VMI-based approach for determining the appropriate kernel ”view”, Face-Change incurs additional runtime overheads (about for Apache and around for Unixbench).
KASR system  eliminates the performance overhead (keeps it within ) but still requires applications to run in virtual machines. Moreover, their kernel specialization is limited to a coarse page-level granularity that can still allow unnecessary and potentially vulnerable kernel code to remain in the system. With MultiK we aim to overcome some of these limitations as we discuss in § 2.3.
Application kernel profile completeness..Application kernel profile completeness.Application kernel profile completeness.. A precondition of using any of these kernel reduction systems is an accurate and complete profile of the kernel facilities that applications depend on. If this profile is incomplete, then a benign, uncompromised application may try to execute code in the kernel that is not part of the profile. Unfortunately, executing code that isn’t in the customized kernel due to an incomplete profile is indistinguishable from a compromised application trying to execute code that the original application cannot invoke. The need for a complete profile is a limitation of all kernel reduction systems [2, 3, 4, 5], including MultiK.
2.2 Threat Model
In MultiK, we assume the following for attackers:
Local, user-level execution is given without physical access to the machine.Local, user-level execution is given without physical access to the machineLocal, user-level execution is given without physical access to the machine. We assume our attackers are limited to launching local/remote attacks on kernel from user privilege (i.e., ring 3) without having any physical access to the machine.
Firmware, BIOS and processor are trusted.Firmware, BIOS and processor are trustedFirmware, BIOS and processor are trusted. Attacks on the kernel originating from components at levels lower than the kernel are out of scope for this paper.
Hardware devices attached to the machine are trusted.Hardware devices attached to the machine are trustedHardware devices attached to the machine are trusted. Similarly, DMA attacks and other attacks from hardware devices are out of scope.
This threat model covers general kernel exploit scenarios, i.e., launching an attack from user-level to interfere with kernel-level execution. For instance, the followings are examples of valid attacks on MultiK:
Privilege escalation attacks from user to kernel.
Control-flow hijacking attacks (arbitrary code execution) in kernel.
Information leaks (arbitrary read) from kernel to user.
Unauthorized kernel data overwriting (arbitrary write) originating from user.
2.3 Design Goals
The overarching goal of MultiK is to reduce the kernel attack surface of a system by generating a minimal kernel for running a set of user applications. To achieve this goal, we aim to build a system that can do this in an efficient (i.e., no runtime overhead) and transparent (i.e., no application changes) manner. We elaborate on the design goals of MultiK next.
Flexible and fine-grained attack surface reduction.Flexible and fine-grained attack surface reductionFlexible and fine-grained attack surface reduction. The first design goal of MultiK is to only permit the kernel code identified as necessary in an application’s profile to run when the application is running. Some prior work  customizes kernels at feature granularity, by minimizing build-time configuration options required for supporting a specific application (e.g., if access to USB mass storage is not required, CONFIGUSBSTORAGE can be disabled to exclude code that deals with interfacing with USB devices). Sometimes such features are big, and because the feature is its minimal granularity, it often contains more code than required even if only parts of features are used by applications. Similarly, KASR  customizes kernels at a page granularity by dynamically removing the executable permission for specific kernel code pages if the code in those page are not being used by the application. However, this approach overestimates required kernel code by including a whole page (4 KB) even if application requires only parts of it. In this regard, we aim to design MultiK to support granularity down to the basic block level. We note that in theory, MultiK can even support byte-level granularity, but in practice since it would not make sense on current CPU architectures to permit a subset of instructions in a basic block to execute without permitting all the instructions in the basic block to execute, we only evaluate MultiK down to the basic block granularity. Further, prior works often could only customize at a specific granularity (e.g., page level for KASR , feature level for Tailor ). Our goal is to design MultiK to be a framework that can orchestrate kernels specialized at different specialization granularity.
Fine-grained security domain for customization.Fine-grained security domain for customizationFine-grained security domain for customization. Another design goal for MultiK is to customize kernel at a fine-grained security domain level. A security domain can be a single process or an instance of a container.
Previous works customized the kernel for a whole system (i.e., all applications or an application stack running on the machine)  or to run a whole virtual machine . Such customized kernels will include the union of kernel code required by every application running on the machine. Customizing kernels for all applications together does not minimize the attack surface as each application will have more kernel code mapped than is required . As a result, MultiK aims to support specialized kernels for each process so that every specialized kernel has code only necessary for the process.
Efficiency.EfficiencyEfficiency. MultiK should minimize the performance overhead. Specifically, we aim to have near-zero (< 1%) as shown in § 6.4 run-time performance overhead and not interfere with the application execution.
Transparency.TransparencyTransparency. MultiK should not require application source code, application code instrumentation or application changes. Further, customized kernels must be compatible with the target application and should able to support regular use-cases or workload of the application. To this end, we design MultiK to interact only with the kernel space to maximize compatibility and to support user-level applications transparently.
Flexibility in sharing system resources.Flexibility in sharing system resourcesFlexibility in sharing system resources. Some applications work and interact with each other closely (e.g., Apache + git for GitLab and Apache + ImageMagick for MediaWiki). For such applications MultiK should allow applications to interact using system resources (e.g., IPC, locks, etc.) to maximize the flexibility in application interactions. Employing virtual machine based solutions and address space isolation techniques (SFI , XFI , etc.), will reduce the flexibility and ease of such interactions. Ideally, MultiK should let normal interactions among applications as if they are all running on a machine with a single kernel. Running multiple applications in docker containers on the same machine exemplifies this goal. In other words, we would like our design to provide customized kernels for each application (or docker container) to strike a balance between the isolation of attack surfaces and the flexibility (i.e., to allow normal application interactions).
2.4 Our Approach
MULTIK: Multi-Kernel Orchestration.MULTIK: Multi-Kernel OrchestrationMULTIK: Multi-Kernel Orchestration. We present MultiK (Figure 2), a kernel runtime that deploys tailored kernel code for each application. MultiK customizes the kernel code before launching an application by consulting the kernel profile of the target application. Specifically, MultiK makes (a) a copy of the entire kernel text to a new physical memory region, (b) removes the unused parts from this kernel copy by using the application’s kernel profile and (c) deploys the application with this tailored kernel. To transparently and efficiently deploy tailored kernel code for each application, page table entries for the kernel text are updated. That is, MultiK alters virtual-to-physical page mappings to point the application to the new customized kernel. By doing this, we can switch automatically the kernel between applications (with near-zero performance overhead) because the page table will be switched as part of a context switch.
MultiK can work with application kernel profiles produced at different granularities and using different profiling techniques. We use two specialization techniques (presented below) to demonstrate and evaluate MultiK.
D-KUT: Dynamic tracing.D-KUT: Dynamic tracingD-KUT: Dynamic tracing. D-Kut (Figure 6) is a tool to identify the kernel code necessary for an application. We leverage dynamic tracing of the kernel execution while the system is running the target application with its use cases. This approach has also been used in previous work (e.g., [2, 3, 4, 5]). The purpose of this step is to determine which parts of kernel code to remove before application deployment. In particular, we will remove any code that is not included in the dynamic execution traces. Our tracer collects and stores executed parts of the kernel at basic-block granularity (the smallest granularity) to maximize attack surface reduction. For tracing to be effective and for the customized kernel to not impact application execution, tracing requires a good set of use cases that are representative of an application’s kernel usage. Previous work (e.g., ) has found that a modest amount of test cases and tracing runs are sufficient to get good coverage of an application’s kernel profile. Here we assume that such test cases are supplied by application developers or deployers (e.g., from unit tests, benchmarks, workloads etc.). Generating the workloads or test cases is not in scope for this work because MultiK does not limit how the kernel is profiled. Since different applications exercise different parts of the kernel, we need to carry out the tracing for each application and store them as an application’s kernel profile. This collected kernel profile is used by MultiK when it generates a tailored kernel for an application at runtime. Note that MultiK works independently of the kernel customization technique and can use a kernel profile generated by any existing or future customization techniques.
S-KUT: Syscall Analysis.S-KUT: Syscall AnalysisS-KUT: Syscall Analysis. S-Kut is an approach to expand the kernel profile and enhance the reliability. To be more specific, S-Kut tries to include rarely-triggered exception/ error handling code while being able to remove a large portion of the kernel. We use the compiler features to analyze the kernel source code. By doing this, we can have precise information of what code should be included. For this approach to be effective, the deployer has to have a list of system calls that the application uses by either static (i.e., symbolic execution) or dynamic methods (i.e., strace). MultiK does not address and limit how the list of system calls is generated.
Benefits.BenefitsBenefits. An immediate benefit of removing large portions of unused kernel code is the resulting attack surface reduction. Vulnerabilities (both known and unknown) present in the removed code will never exist in the resulting runtime kernels. Our evaluation results show that MultiK successfully removes large portions of kernel code (e.g., unused kernel functions, system calls and loadable kernel modules) from the application’s memory space. In many instances we achieve more than reduction in kernel code, i.e., .text (see Table 2 in § 6.1) and consequently eliminate many vulnerabilities. For instance, MultiK eliminates 19 out of total 23 vulnerabilities (listed as CVEs) in Linux Kernel 4.4.1 when tailored for the Apache web server (see § 6.1). Although MultiK does not aim to detect and eliminate specific vulnerabilities, it can reduce the system’s overall risk of compromise and could serve as one layer of a defense-in-depth  strategy.
3 MultiK: Orchestrating Specialized Kernels
MultiK is a kernel mechanism to orchestrate customized kernels efficiently and transparently for each application at runtime. The goal of MultiK is to provide each application access to only kernel code that is customized to that application. By doing so, MultiK will reduce the kernel attack surface by removing a large part of unused kernel code from the virtual memory of the application processes.
Overview.OverviewOverview. Figure 3 illustrates how MultiK works with multiple applications running on the system. During the launch of each application, MultiK maps the customized kernel for the application into a new physical memory region. MultiK then updates the page table entries (of the kernel code) for that application process to redirect all further kernel execution to the customized kernel. This update to the page table will completely remove the original (full) kernel from an application’s virtual memory view and switch the view to the customized one. Additionally, this update will guarantee that the CPU that runs the application can only work with the customized kernel code. This is because the process context switching in the operating system will switch the page table automatically. Hence virtual addresses in the CPU will only refer to the customized view of the kernel code.
This approach is promising because once the page table is updated during application launch (one-time cost), no runtime intervention is required to switch to the customized kernel. This will significantly enhance the runtime performance. In contrast to this approach, prior work relies on changing extended page table (EPT) permissions at runtime (in KASR ). This has a limitation of page-level permission deprivation, or relies on virtual machine introspection (VMI) to customize page table entries at each context switch (in FACE-CHANGE ). All of which incur nontrivial runtime performance overheads.
Although our design is conceptually simple, MultiK must overcome a few challenges:
Sharing system resources (§ 3.2): Running multiple applications in the same system could benefit from resource sharing among the applications. For instance, processes can communicate efficiently via inter-process communication (IPC) mechanisms available in the system such as locks, signals, pipes, message queues, shared memory, sockets, etc. and processes can also share hardware devices attached to the system. Such sharing is transparent by design as applications share the same kernel (i.e., code) and share the same kernel memory space (i.e., kernel data). However, running a customized (a different) kernel per each application could interfere with such transparent sharing. For instance, running a customized kernel in a virtual machine requires virtualizing hardware and shared resources among applications that introduces compatibility and performance issues. Additionally, having a different memory layout for each application’s kernel would make the data sharing even harder. For instance, a data structure prepared in one application cannot be used in a different application and might even require transformation.
Handling hardware interrupts (§ 3.3): A hardware interrupt can occur at any time regardless of its customized kernel view. For instance, while running a customized kernel for a non-networking application (such as gzip), the CPU could encounter a hardware interrupt from the network interface card (NIC). In such a case, the execution will be redirected to the interrupt handler. If the customized kernel is not equipped to handle the interrupt, then the system could either become unstable or important events could be missed. An easy workaround would be to include all hardware interrupt handlers in the customized kernel, as FACE-CHANGE  does. However, such an approach will increase the kernel attack surface by adding unnecessary kernel code for an application that doesn’t need access to particular hardware.
3.2 Deploying Customized Kernels
When deploying customized kernels in MultiK, applications should only have a view of the kernel customized for them. At the same time, we would like to ensure that customized kernels running in one system could share system resources, e.g., share the memory space, available IPC mechanisms etc. To achieve these goals in a manner transparent to the application, we update an application’s page table entirely (corresponding to kernel text) to point to the customized kernel and we do this when launching the application (i.e., at execve()).
Deploying the customized kernel by updating kernel text page table entries gives us the following benefits. First, sharing system resource among multiple kernels becomes straight forward. Each application shares the kernel data memory space because we only alter the page table entries for kernel text. Sharing system resources via shared memory, direct memory access, or sharing kernel data structures is transparent. Second, loading and switching between kernels is simple and efficient. Loading a custom kernel will only incur the costs for mapping of new code section and updating page table entries, which is simple to do as part of the execve system call. Additionally, switching between the kernels is simple because a typical context switch involves changing the current page table pointer (CR3 register in x86) to that of the newly scheduled application, and this will automatically switch the kernel text too. In short, the goal of providing applications access to only customized kernel code will be met in a transparent manner. Therefore, MultiK always isolates the application’s kernel view to the one customized for it.
Figure 4 (from to ) shows how MultiK switches the page table entries during application launch. To intercept the launching of an application ( ), MultiK places a hook in the execve system call and maps the customized kernel in that hook ( ). In particular, MultiK allocates a new physical memory region and copies customized kernel code to that region ( ). Customized kernel code could either be generated on the fly using a pre-learned kernel profile for the application or could be pre-generated and stored to reduce application launch overhead. We follow the former approach since it only incurs a small one-time launch delay of . Next, MultiK updates the application’s page table entries for kernel text to point to this new physical memory region ( ). After this point, the application can only access the customized kernel text because the original full kernel image will never exist in its virtual memory space. Then, MultiK gives control back to the execve system call to handle the loading and linking of the user space, and finally, we let the execution continue in the user space.
The deployment of the customized kernel and the application is finished at this stage because kernel switching for each application will be handled automatically by the regular context switching mechanism as previously discussed. For instance, Figure 3 shows a case of three applications, Apache, Php-Fpm and MySQL running in a MultiK system. When a processor (CPU) runs Apache in the userspace, because this application’s page table maps kernel code to the kernel customized for Apache, the CPU can interact only with the customized kernel and cannot access the original full kernel code. When a context switch happens, say, switching to php-fpm, the regular context switching mechanism will change the value of CR3 register to the page table of php-fpm. By design, the page table for php-fpm will redirect all accesses to kernel code to the kernel customized for php-fpm. No matter how a context switch happens in the system, MultiK ensures that an application can only have a view of kernel code that is customized for it, thus reducing the potential kernel attack surface available to any application.
Sharing system resources.Sharing system resourcesSharing system resources. We design MultiK to allow sharing of system resources such as hardware devices, kernel data structures (e.g., task_struct, etc.), inter-process communication (IPC) mechanisms (i.e., pipes, UNIX domain sockets, mutexes), shared memory, etc.– to maximize a system’s flexibility and compatibility. More specifically, we design MultiK to work with existing container mechanisms (such as Docker) and this requires the sharing of system resources while running customized kernels for each container. MultiK achieves this by i) fully sharing kernel data among customized kernels and ii) having the same memory layout for all customized kernel text.
First, kernel data and its memory space can be shared as is because MultiK does not modify any of it. In particular, any resource sharing that does not involve any kernel text (such as passing of kernel data structures or device access via DMA) would not be affected by the deployment of MultiK. However, resource sharing that involves kernel text, for instance, a data structure that refers to the kernel functions such as function pointers (e.g., file operations), would be affected if any of the customized kernel does have a different memory layout. To resolve this issue, MultiK requires customized kernels to have the same memory layout as the original kernel. That is, a customized kernel will be mapped at exactly the same virtual address space as that of the original kernel but the customized kernel text (code) will mask parts of the kernel not required by that specific application. Although this approach would create some memory overheads ( for each customized kernel), this allows MultiK to enable sharing of function pointers among multiple customized kernels. As we see (§ 6.4.1) this memory overhead does not limit how many applications we can run in parallel in practice.
3.3 Handling Hardware Interrupts
To handle hardware interrupts we exploit deferred interrupt handling  to keep the customized kernels small. Hardware interrupts are problematic for customized kernels if the corresponding handler does not exist in the customized kernel. Because hardware interrupts caused by the system could be delivered at any time, even an application that does not utilize any of the associated hardware could receive the hardware interrupt requests. Missing hardware interrupt handlers in customized kernels will cause such interrupt requests to fail and the failure to handle such interrupts could make the system unstable. One way to work around this issue is to include all interrupt handling routines in all customized kernels. However, such an approach would unnecessarily increase the potential attack surface of customized kernels.
Figure 5 illustrates we handle such interrupts in our framework. MultiK includes only the top-half hardware interrupt handlers that are compiled into the kernel by whiltelisting (in all customized kernels). The top-half handlers are smaller because their job is to transform a hardware interrupt request into a software interrupt request (softirq). Consequently, our customized kernels only deal with the top-half of any hardware interrupt and delegate the actual handling of interrupts to the kernel threads run by ksoftirqd. Hence, when hardware interrupts (e.g., timer, network or block device interrupts) arrive, our customized kernel will run the top-half handler to store a softirq vector to delegate the interrupt handling to ksoftirqd. The bottom-half of the interrupt will then be handled by ksoftirqd when it runs.
KERNEL0.KERNEL0KERNEL0. To handle the bottom-half of the interrupts, we run ksoftirqd on a general purpose regular kernel (that includes all parts that the system requires), that we refer to as Kernel0. An example of Kernel0 is a kernel in a distribution’s package without any customization such as linux-image-4.15.0-39-generic in Ubuntu 18.04 LTS. This kernel not only handles hardware interrupts but also takes care of system-wide events such as booting, shutdown, etc. Kernel0 also serves as a baseline template for kernel customization because it contains the entire kernel code required by the system. In MultiK, customized application kernels are generated by cloning the kernel .text region of Kernel0 and masking parts of the kernel that are not needed by the application based on the application’s kernel profile.
4 Generating Application Kernel Profiles
Application kernel profiles identify which parts of the kernel code are used by an application and which parts are not. We use ’granularity’ as a unit to quantify the precision of the specialization profiles. The following granularity levels are used in the paper: (i) Basic block level: Basic block is a set of instructions that are always executed as a unit without any jumps or branches in between. The CPU either executes all the instructions in the basic block or executes none. This makes it one of the most precise levels of specialization . (ii) Symbol level: Kernel interfaces are exported as symbols using EXPORTSYMBOL(). At this level of granularity all the instructions that make up the interface are included in the profile even if only certain code paths of the interface are actually being used. (iii) System call level: Given that syscalls are the main interface through which user space applications interact with the kernel, we can enumerate the syscalls that are used by the application and eliminate those that aren’t [10, 11]. (iv) Page level: When a binary is loaded into memory, the (memory) smallest unit that can be referred to is a page (4 KB or more). Tracing methods that track instruction that are being executed in the memory can only do so on a page level . This will result in an entire page of instructions being included in the profile even if only a single instruction was run from that page. (v) Feature level: The kernel configuration system, kconfig, determines kernel features that need to be built into the kernel based on the state of certain configuration options. With this code can be included or eliminated only at a feature level, producing a much coarser and was used in . MultiK is flexible enough to use application kernel profiles created using different profiling granularities in order to deploy customized kernels. We present and evaluate one trace-based (D-Kut) and one syscall-based approach (S-Kut) to demonstrate application kernel profile generation.
4.1 D-Kut: Dynamic Instruction Tracing
D-Kut (see Figure 6) profiles the kernel at a basic-block level granularity. This is in contrast to KASR , a recently proposed approach for specializing the Linux kernel binary, that profiles the kernel usage of applications at page level ( default page size) granularity. When profiled at page level granularity, unused instructions present in the neighborhood of used instructions within the range also get included in the final in-memory kernel binary.
Tracing setup..Tracing setup.Tracing setup.. We use QEMU, a full system emulator to run the user application along with the vanilla kernel we need to trace. With QEMU we trace every instruction that the systems executes using the program counter (pc) register by utilizing the exectbblock. This basically helps us capture the addresses of instructions that are being executed.
Segmentation..Segmentation.Segmentation.. The trace from above includes instructions from (1) System boot (2) Applications Runs and (3) System shutdown. The instructions from (1) and (3) are clearly not required by the application. To separate this out from the trace we use the mmap syscall to taint/touch a specific memory address right before the application is run and right after the application is terminated. This helps us mark the start and end of our application and helps filter out the boot/shutdown instructions from the traces.
Background processes / daemons..Background processes / daemons.Background processes / daemons.. During a typical kernel boot it starts a bunch of background processes/daemons that provide useful services. These processes are always running with the application process. They can add noise to the traces that we obtain. To filter these out we use a custom init script that only starts daemons required by the application we are interested in. This eliminates noise from unnecessary background processes in the traces.
Invariants in execution..Invariants in execution.Invariants in execution.. Kernel behavior is not deterministic when running an application. Execution paths in the kernel could change because not all values are invariants (e.g., network conditions, CPU frequencies and time). Besides, QEMU drops trace events occasionally due to the full buffer. Therefore to capture all the possible code paths a kernel might take for a given application, we repeat the tracing process multiple times. In our experiments we observed that it takes between 10 to 15 runs before the code paths stabilize and the trace can be used with confidence as no new code paths are added for a given set of input. This need for multiple trace runs has also been reported in prior work [3, 5].
Granularity..Granularity.Granularity.. QEMU produces a trace with basic block level granularity. We can post-process basic-block level traces to obtain symbol-level trace by including the entire symbol corresponding to each of the blocks. Symbol-level trace requires less number of runs because the post-processing makes the trace more inclusive. In our experiments we observe that even at this symbol granularity the trace obtained is much smaller than the ones produced by a page level trace used by KASR , see Table 2.
Compatibility with other specialization techniques..Compatibility with other specialization techniques.Compatibility with other specialization techniques.. D-Kut only requires the final bootable kernel binary to produce a trace. Thus a kernel produced as a result of any other specialization technique can be further specialized with our tool. Thus D-Kut can complement and take advantage of other profiling techniques.
4.2 S-Kut: Syscall Analysis
S-Kut is a syscall-based analysis technique to increase the reliability of the application kernel profile. We track all possible functions that a system call can use for all such calls made by an application. This list of functions is built by analyzing the register transfer language (RTL) dumped by GCC when compiling the kernel with -fdump-rtl-expand. We obtain the approximate list of system calls that an application issues by using strace to intercept and record all system calls made in that execution context. This approach does not guarantee a complete list of system calls if some system calls are not triggered during tracing. We then expand the application profile by combining functions called by possible system calls with the original profile. In our experiments, by using this technique, we increase the coverage of the profile generated by D-Kut at symbol granularity. The result in § 6.1 shows that more than of the kernel can be reduced and of the CVEs are mitigated when including functions used by system calls in the profile.
|Component||Lines of code|
|D-Kut- QEMU||3||LoC of C|
|D-Kut- Scripts||272||LoC of Python|
|S-Kut- Make||1||LoC of Makefile|
|S-Kut- Scripts||26||LoC of Python|
|S-Kut- Scripts||96||LoC of Golang|
|MultiK- Linux Kernel 4.4.1||477||LoC of C|
We implement MultiK on Linux 4.4.1 777 We choose an older kernel to demonstrate MultiK’s capability in reducing vulnerabilities by listing affected CVEs. Note that MultiK can be applied to the newest kernels as well to remove potential vulnerabilities. running on Intel Core i7-8086K (4.00 GHz) CPU. We use the procfs interface to provide an application identifier and the corresponding application kernel profile to the respective application. We generate application kernel profiles in an offline step using D-Kut and S-Kut tools (see § 4). We hook the execve syscall so that when it is invoked to launch an application (that has a corresponding application kernel profile) we generate and deploy a specialized kernel as described in § 3. Table 1 lists the total amount of code for MultiK. We built MultiK with lines of C, D-Kut with lines of code and S-Kut with lines of code.
Our customized kernels are masked in a special manner – by overwriting them with a special one-byte instruction sequence 0xcc (instruction int3). This acts a fall-back mechanism for detecting (and reporting) when an application tries to execute code not available in its customized kernel. We choose int3 not only because this instruction could raise a software interrupt (that our kernel can intercept to thwart unexpected execution) but also because it is a one-byte instruction whose semantics cannot be changed by arbitrary code execution resulting from attacks. Note that other techniques, overwrite the kernel binary with a sequence of UD2 (two-byte) instructions, 0x0f 0x0b. This opcode can be misinterpreted by reading it as 0x0b 0x0f. In such cases, it would not raise an interrupt. On the other hand, using the int3 instruction, we can detect unexpected execution in a more reliable fashion.
For instance, consider a simple case when an application’s requested kernel execution hits a masked function or masked basic blocks (e.g., a branch not taken or a system call not used during the tracing). In such cases, the execution will hit the int3 instruction immediately. The kernel knows that the code pointed to (by the instruction pointer for the software interrupt) is missing. At that point, one can choose to kill the process or follow other strategies888Depends on the policies picked by the system designer..
We evaluated MultiK to answer the following questions:
Evaluation setup.Evaluation setupEvaluation setup. We specialize application kernels at three different granularity: (a) block and (b) symbol (both with D-Kut) and (c) syscall (with S-Kut) (§ 4). We refer block, symbol and syscall granularity to B, S, and SC, respectively, in the rest of the paper. We performed all experiments in a KVM virtual machine with 2 vcpus and 8G RAM running on Intel(R) Core(TM) i7-8086K CPU @ 4.00GHz. Note that we use KVM for the convenience of testing, and MultiK does not requires a virtual machine to run specialized kernels. For specializing kernels to applications, we choose Apache, STREAM, perf, Redis, and Gzip.
6.1 Attack Surface Reduction
|CVE-2018-11508||An information leak in compatgettimex()||Leak||V||V||V||V||V||V||V||V||V|
|CVE-2018-10881||An out-of-bound access in ext4getgroupinfo()||DoS||V||V||V||V||V||V||V||V||V|
|CVE-2018-10880||A stack-out-of-bounds write in ext4updateinlinedata()||DoS||V||V||V||V||V||V||V||V||V|
|CVE-2018-10879||A use-after-free bug in ext4xattrsetentry()||DoS||V||V||V||V||V||V||V||V||V|
|CVE-2018-10675||A use-after-free bug in dogetmempolicy()||DoS||V||V||V||V||V||V||V||V||V|
|CVE-2018-7480||A double-free bug in blkcginitqueue()||DoS||V||V||V||V||V||V||V||V||V|
|CVE-2018-6927||An integer overflow bug in futexrequeue()||DoS||E||E||E||V||V||E||V||V||V|
|CVE-2018-1120||A flaw in procpidcmdlineread()||DoS||V||V||V||V||V||V||V||V||V|
|CVE-2017-18270||A flaw in keyalloc()||DoS||V||V||E||V||V||E||V||V||E|
|CVE-2017-18255||An integer overflow in perfcputimemaxpercenthandler()||DoS||V||V||V||V||V||V||V||V||V|
|CVE-2017-18208||A flaw in madvisewillneed()||DoS||V||V||V||V||V||V||V||V||V|
|CVE-2017-18203||A race condition between dmgetfromkobject() and dmdestory()||DoS||V||V||V||V||V||V||V||V||V|
|CVE-2017-18174||A double free in pinctlunregister() called by amdgpiremove()||DoS||V||V||V||V||V||V||V||V||V|
|CVE-2017-18079||A null pointer deference in i8042interrupt(), i8042start(), and i8042stop()||DoS||V||V||V||V||V||V||V||V||V|
|CVE-2017-17807||Lack of permission check in requestkeyandlink() and constructgetdestkeyring()||Priv||V||V||E||V||V||E||V||V||E|
|CVE-2017-17806||Lack of validation in hmaccreate() and shashnosetkey()||DoS||V||V||V||V||V||V||V||V||V|
|CVE-2017-17053||A use-after-free bug in initnewcontext()||DoS||E||E||E||V||V||E||E||E||E|
|CVE-2017-17052||A use-after-free bug in mminit()||DoS||E||E||E||V||V||E||E||E||E|
|CVE-2017-15129||A use-after-free bug in getnetnsbyid()||DoS||V||V||V||V||V||V||V||V||V|
|CVE-2017-2618||Lack of input check in selinuxsetprocattr()||DoS||V||V||V||V||V||V||V||V||V|
|CVE-2016-0723||A use-after-free in ttyioctl()||DoS||E||E||E||E||E||E||E||E||E|
|CVE-2015-8709||A flaw in ptracehascap() and ptracemayaccess()||Priv||P||P||P||V||V||V||V||V||V|
|CVE-2015-5327||A out-of-bound access in x509decodetime()||DoS||V||V||V||V||V||V||V||V||V|
We first tackle the question of how much kernel attack surface MultiK can reduce. We do this by measuring how much kernel text (code) is reduced by specialization. Table 2 shows the percentage of kernel text reduction w.r.t. the vanilla kernel, and B, S, and SC indicate the granularity of specialization.
The first row shows the percentage of reduced kernel code. The reduction for each application is depending on the application’s kernel usage. With the block level granularity, MultiK can reduce of the kernel text from Apache (I/O intensive) and of the kernel text from STREAM
More than of the kernel text can still be removed even when the granularity is coarse such as symbol and syscall. Our work outperforms KASR  and Tailor , which can reduce and of kernel code respectively. The second and third rows present the fully and partially removed kernel functions respectively. Because a block is smaller than a function body, we remove parts of functions. With block granularity, more then of the text is excluded.
We observed that the implementation of all system calls in the Linux kernel 999We count all functions called by system call entry functions. only takes up of the text. Restricting access to system calls (e.g., AppArmor , seccomp-bpf ) would (a) not remove any code (leaving the kernels vulnerable) and (b) have a lesser impact than the techniques discussed in this paper. This shows the limitations of only focusing on the whitelisting of system calls.
6.2 CVE Case-Study
We analyze all CVEs present in Linux 4.4.1 by looking at the patch for each one and detecting which functions are vulnerable. The kernel is compiled with configuration 4.4.0-87-generic. A number of vulnerabilities are excluded because they are not present in the kernel binary since they target loadable kernel modules and we do not load modules. We find that 23 of 72 CVEs exist in the kernel binary. A CVE might involve multiple functions.
We separate the results into three categories to indicate the different levels of mitigation: (1) V refers to the case where all functions associated with a vulnerability are removed (2) P refers to case where some functions associated with a vulnerability are removed, and (3) E refers to case where no functions associated with a vulnerability are removed. Table 3 shows the result of each of the CVEs. On average, (out of ) CVEs are mitigated for both block and symbol granularity on average and out of CVEs are mitigated for the system-call granularity.
If a CVE is located in a popular code path, it is more likely that an application exercises it during the offline profiling phrase. Therefore, such CVEs (i.e., CVE-2017-17053 and CVE-2016-0723) are likely to remain for all applications. In other words, if a CVE (e.g., CVE-2018-11508 and CVE-2017-15129) is on a unpopular code path, there is a good chance to remove it. § 6.2 is an example of a CVE that we remove the vulnerability by partically removing one of the function required to form an exploit chain. The vulnerability is about that attackers can create a malicious user namespace and wait for a root process to enter ptracehascap() to gain the root privilege. To exploit the CVE, the attackers need two functions, ptracehascap() and ptracemayaccess(). Hence, removing one of these functions can mitigate this CVE.
[mathescape, linenos, numbersep=5pt, frame=lines, framesep=2mm]csharp static int ptracehascap(struct usernamespace *ns, unsigned int mode) //… return hasnscapability(current, ns, CAPSYSPTRACE); static int ptracemayaccess(struct taskstruct *task, unsigned int mode) //… if (ptracehascap(tcred-¿userns, mode)) goto ok; //… B(Block) granularity. We elaborate each category using a CVE for the Apache web server as an example with
V, the vulnerability is entirely removed. CVE-2017-17807 is a vulnerability resulting from an omitted access-control check when adding a key to the current task’s default request-keyring via the system call, requestkey(). Two functions, constructkeyandlink() and constructgetdestkeyring(), are required to realize the vulnerability. Therefore, since both are eliminated, attackers have no way to form the chain in order to exploit this CVE.
P, the vulnerability chain is partially removed. CVE-2015-8709, depicted in § 6.2, is a flaw that can allow for a privilege escalation. Invoking both functions form the exploit chain: ptracehascap() and ptracemayaccess(). Because our kenrel specialization partially removes one of the functions (ptracehascap()), Attackers will no longer be able to exploit this CVE.
E, the vulnerability chain remains. CVE-2017-17052 allows attackers to achieve a ‘use-after-free’ because the function mminit() does not null out the member -¿exefile of a new process’s mmstruct. Attackers can exploit this as none of the functions have been removed.
6.3 Offline Profile Generation Performance
We trace applications for 10 iterations for symbol granularity and 15 for block granularity because we observed that the workload tended to be stable after this many of iterations. The profiling time depends on how long the workload runs. Table 4 shows the time needed to profile each of the applications. The time needed for profiling depends the workload. If the workload (STREAM in Table 4) only takes a short amount of time to finish, the profiling will be quick and vice versa (Apache and perf in Table 4).
6.4 Performance Evaluation
In this section we evaluate (i) application performance by the Apache web server benchmark, (ii) context-switches by perf101010We run the perf command: perf bench sched all. and (iii) memory bandwidth with STREAM. All the experiments were performed in a KVM virtual machine with 2 vcpus and 8G RAM running on Intel(R) Core(TM) i7-8086K CPU @ 4.00GHz. Positive % indicates improvements, and negative % indicates degradation of the application performance. Positive % indicates improvements, and negative % indicates degradation. We evaluate the performance with application benchmarks such as Apache web server, STREAM , a memory microbenchmark, and perf, a scheduling microbenchmark. Due to the page limits, we place some of the benchmark results in Appendix § A.2 including Redis and Gzip with D-Kut at symbol and block granularity, since the results are similar.
Apache Web Server.Apache Web ServerApache Web Server. We ran the Apache web server, version 2.4.25, on a specialized kernel and the client program on Kernel0. The client program Apache benchmark sends 100,000 requests in total with 100 clients concurrently. Table 5 shows that the Apache web server running on specialized kernels (regardless of the trace granularity) has a very similar performance, in terms of number of requests served per second, compared to when running on a vanilla kernel. The performance is within of the baseline.
STREAM.STREAMSTREAM. We evaluate the memory performance with STREAM, version 5.10. STREAM has four metrics including copy, scale, add and triad. They refer to the corresponding kernel vector operations. Copy refers to a[i] = b[i]. Scale refers to a[i] = const*b[i]. Add refers to a[i] = b[i]+c[i]. Triad refers to a[i] = b[i]+const*c[i]. Copy and scale take two memory accesses while add and triad take three. Table 6 shows that STREAM running on specialized kernels regardless of the granularity have close performance compared to the baseline and the difference is less than for all operations.
Perf.PerfPerf. We evaluate context switch overheads with the perf scheduling benchmark that is composed of a messaging microbenchmark and a piping microbenchmark. The messaging microbenchmark has 200 senders that dispatch messages through sockets to 200 receivers concurrently. The Piping microbenchmark has 1 sender and 1 receiver processes executing 1,000,000 pipe operations. Table 7 shows that perf running on specialized kernels (regardless of the trace granularity) takes the same amount of time to complete the message and pipe tasks when compared to running on vanilla (unmodified) kernels. The performance difference for both message and pipe tasks is less than .
6.4.1 Memory Effect
Every specialized kernel takes up approximately of additional memory space . Kernel memory is not swappable. Therefore, the system-wide memory pressure will increase if we keep creating new specialized kernels. We evaluate the memory pressure by measuring the memory bandwidth with STREAM  benchmark Under a higher memory pressure, the system will swap memory pages in and out more frequently and results in a lower memory bandwidth. We run STREAM together with multiple specialized kernels on a KVM virtual machine with 2 vcpus and 8GB running on Intel(R)Core(TM) i7-8086K CPU @ 4.00GHz. Swappiness is a kernel parameter ranging from 0 to 100, which controls the degree of swapping. The higher the value is, the more frequently virtual memory swaps. We conduct the experiment with default value, 60. In order to exclude the factor of the CPU (i.e., busy CPUs cause slower memory bandwidth), we run the command sleep on specialized kernels. Figure 7 shows that memory operations start to slow down due to more frequent memory swaps when there are more than 750 coexisting kernels. To be more specific, operations Add and Triad take twice the memory accesses than the others so the time that they need to finish is twice longer. We also evaluated the memory effect on a machine with 4G RAM running on Intel(R) Xeon(R) CPU E3-1270 v6 @ 3.80GHz. In this case the impact is observed after coexisting specialized kernels (refer to § A.3). The results indicate that memory overhead is not an issue for most practical deployments.
Kernel Specialization for Containers.Kernel Specialization for ContainersKernel Specialization for Containers. The method of MultiK for dispatching multiple kernels can seamlessly assign and run a specialized kernel for each security domain, i.e., container, such as Docker. This is done by sharing a specialized kernel for multiple applications. In particular, we can configure MultiK to have one specialized kernel for set of applications (running in a container) by profiling kernel traces while running all such application together. In our experiment, we integrated MultiK with Docker containers trivially, and show that MultiK affects Docker containers’ performance by less than in § A.1. This approach can also reduce the memory usage by using fewer kernels. It aligns well with cloud deployment patterns where containers from different organizations may share the same hardware.
Kernel Page-Table Isolation.Kernel Page-Table IsolationKernel Page-Table Isolation. Kernel page-table isolation (KPTI)  is a feature that mitigates the Meltdown  security vulnerability. KPTI uses two page tables for user and kernel modes respectively so that user-mode applications can only access a limited range of kernel memory space such as entry/exit functions and interrupt descriptors. Although KPTI hides the kernel from the user space, it does not mitigate the vulnerabilities because attackers can still call system calls with carefully crafted parameters to enter the kernel mode and exploit the system.
Kernel CFI.Kernel CFIKernel CFI. MultiK is orthogonal to other kernel attack surface reduction techniques such as control-flow integrity (CFI)  and can work with them concurrently. Kernel CFI can indirectly achieve attack surface reductions by restricting available call/jump targets from a large number of control-flow transfer points (that would otherwise serve as attack surfaces). Such an approach is orthogonal to MultiK and, in fact, they can both complement each other. In particular, MultiK can run a kernel reinforced with KCFI, and MultiK can further trim such a kernel to achieve an overall better attack surface reduction. The reason that MultiK can be seamlessly combined with such techniques is due to the fact that it incurs extremely low performance overheads.
AppArmor and SELinux.AppArmor and SELinuxAppArmor and SELinux. AppArmor  and SELinux 111111https://www.nsa.gov/What-We-Do/Research/SELinux are Linux security modules  which try to achieve Mandatory Access Control. In particular, after identifying the necessary resources and capabilities, both approaches apply a profile to enable/disable them via white/blacklisting. The drawbacks of AppArmor and SELinux (compared to MultiK) is that they only remove access to syscalls that are an entry point to a certain code path by limiting the POSIX capabilities. The code is still available to the attacker if she can bypass 121212https://www.cvedetails.com/cve/CVE-2017-6507 this protection. In MultiK we explicitly remove code paths that are not required by the application, thus preventing the attacker from accessing it altogether even if other security measures are compromised. In addition to that, MultiK can further reduce code within a system call if we apply smaller specialization granularity than a symbol, e.g., basic-block granularity.
Arbitrary Kernel Execution.Arbitrary Kernel ExecutionArbitrary Kernel Execution. If an attacker is able to execute arbitrary code within the kernel space, e.g., by inserting kernel modules, then the attacker can modify the page tables for applications and bypass the kernel view imposed by MultiK. We prohibit kernel module insertion for specialized kernels. If a kernel module is needed, it can be inserted from Kernel0 and it is visible to all specialized kernels.
Kernel Data-Only Attacks.Kernel Data-Only AttacksKernel Data-Only Attacks. As the underlying kernel data structures are shared among all the multiplexed customized kernels, MultiK alone cannot prevent kernel data corruption attacks (e.g., [18, 19, 20, 21]) although it can make it harder for attackers to exploit. However it can be integrated with existing kernel data protection mechanisms (e.g., ) to improve the overall security of the system.
8 Related Work
In this section we discuss and compare different kernel reduction approaches. They can be broadly classified into
In this section we discuss and compare different kernel reduction approaches. They can be broadly classified into(i) configuration based specialization, (ii) compiler based specialization, (iii) binary specialization, or (iv) kernel re-architecture.
Configuration based specialization..Configuration based specialization.Configuration based specialization.. The Linux kernel provides the kconfig mechanism for configuring the kernel. However, the complexity of kconfig makes it hard to tailor a kernel configuration for a given application. Kurmus et al.  tried to automate kconfig based kernel customization by obtaining a runtime kernel trace for a target application, then mapping the trace back to the source lines and using source lines along with configuration dependencies for arriving at an optimal configuration. While they were able to achieve 50-80% reduction in kernel size, their approach still requires some manual effort for creating predefined blacklists and whitelists of configurations . Further, when multiple applications need to be run, their approach creates a customized kernel for all of the applications together thereby limiting the effectiveness of the attack surface reduction achieved . More recently, LightVM  tries to address bloat in the kernel by implementing a tool called TinyX that starts from Linux’s tinyconfig and iteratively adds options from a bigger set of configuration options. This involves maintaining a manually produced white-list or black-list. In MultiK, everything is automated and requires no manual intervention.
The Linux kernel tinification project  focuses on reducing the size of the kernel by making every core kernel feature configurable, thereby allowing developers to build just the minimum set of features required to run on embedded devices. Although configuration-based specialization can be effective, it remains a manual and tedious process. MultiK completely circumvents the configuration process and specializes the kernel binary directly. We are also able to produce a more fine grained reduction compared to the coarse grained reduction produced by configuration specialization .
Compiler based specialization..Compiler based specialization.Compiler based specialization.. Modern compilers are much better at code optimization than humans are. A series of LWN.net articles [25, 26, 27, 28] discusses various cutting edge efforts in compiler and link-time techniques that are being developed in the Linux community that can eliminate significant amount of dead code and perform various other code optimizations. Most of the work in this area is experimental and does not produce a working kernel yet. They exist as out-of-tree patches . The main challenge in applying these techniques to the Linux kernel arises out of the complexities in the kernel itself. Handwritten assembly, non-contiguous layout of functions, etc. do not make the kernel a good candidate for compiler-based optimization/specialization as is. It requires manually going through pieces of code and making careful changes without causing unexpected side effects. The LLVM community in recent years has produced a suite of advanced compiler tools . The Linux community hasn’t been able to take advantage of these due to its tight coupling with the gcc toolchain, making it hard to use other compilers like clang  from the LLVM tool chain to build the kernel. These are being fixed one patch at a time . Our approach works with binary instructions and does not depend on a specific compiler tool chain. Moreover compilers will only be able to produce reduction from static analysis methods. There is no room for reduction based on run time behavior. We are able to take into consideration the run-time behavior and make further reductions.
Binary specialization..Binary specialization.Binary specialization.. Binary specialization techniques do not require reconfiguring or rebuilding the kernel. They work on the final kernel binary as is. KASR  specializes the kernel binary using a VM-based approach wherein they trace all the pages in the memory that are used by an application – for a few iterations until the trace doesn’t change. This data is used to mark the unused pages in the extended page tables as non-executable, thus making the memory region unavailable to the application. Face-Change , shadow kernels  create specialized kernel text areas for each of the target applications and switch between them using a hypervisor to support multiple applications running together. Their performance is limited by the performance of the hypervisor. Face-Change reported performance overheads approx. for I/O benchmarks and doesn’t support multithreading. KASR customizes the kernel at a page level granularity. MultiK has near native performance while customizing the kernel at basic-block level granularity while supporting multithreading.
Kernel redesign..Kernel redesign.Kernel redesign.. An orthogonal direction to specializing general purposes kernels for attack surface reduction is to use unikernels or microkernels that define a completely new architecture. Unikernels  get rid of protection rings and have the application code and the kernel in a single ring to reduce performance overhead arising from context switches. But this leaves kernel code that is required for the entire system including boot and termination available to the application. Microkernels such as Mach  design the kernel in a very modular manner such that the kernel TCB is minimal. This comes at the cost of performance overhead from context switching. Moreover both these approaches require the application to be re-built for the respective architectures. NOOKS  redesigns the kernel to isolate device drives from the kernel core – to protect from vulnerable device drivers. It still leaves active a major part of the kernel (for boot, shutdown and other OS tasks).
MultiK does not aim to provide the same level of isolation as virtual machines(VM). Instead, it is a framework that runs specialized kernel code without losing the flexibility of integrating existing security mechanism to defend against different cyber-threats e.g., data-only attacks. Our evaluation shows that MultiK can effectively reduce the kernel attack surface while multiplexing multiple commodity/specialized kernels with less than overhead. MultiK can be easily integrated with container-based software deployment to achieve per-container kernels with no changes to the application.
-  Anil Madhavapeddy, Richard Mortier, Charalampos Rotsos, David J. Scott, Balraj Singh, Thomas Gazagnaire, Steven Smith, Steven Hand, and Jon Crowcroft. Unikernels: library operating systems for the cloud. In Proceedings of the 18th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Houston, TX, March 2013.
-  Reinhard Tartler, Anil Kurmus, Bernhard Heinloth, Valentin Rothberg, Andreas Ruprecht, Daniela Dorneanu, Rüdiger Kapitza, Wolfgang Schröder-Preikschat, and Daniel Lohmann. Automatic os kernel tcb reduction by leveraging compile-time configurability. In Proceedings of the Eighth USENIX Conference on Hot Topics in System Dependability, HotDep’12, pages 3–3, Berkeley, CA, USA, 2012. USENIX Association.
-  Anil Kurmus, Reinhard Tartler, Daniela Dorneanu, Bernhard Heinloth, Valentin Rothberg, Andreas Ruprecht, Wolfgang Schröder-Preikschat, Daniel Lohmann, and Rüdiger Kapitza. Attack surface metrics and automated compile-time OS kernel tailoring. In Proceedings of the 20th Annual Network and Distributed System Security Symposium (NDSS), San Diego, CA, February 2013.
-  Z. Gu, B. Saltaformaggio, X. Zhang, and D. Xu. FACE-CHANGE: Application-Driven Dynamic Kernel View Switching in a Virtual Machine. In 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, June 2014.
-  Zhi Zhang, Yueqiang Cheng, Surya Nepal, Dongxi Liu, Qingni Shen, and Fethi Rabhi. KASR: A reliable and practical approach to attack surface reduction of commodity OS kernels. In Proceedings of the 21th International Symposium on Research in Attacks, Intrusions and Defenses (RAID), Heraklion, Crete, Greece, September 2018.
-  Robert Wahbe, Steven Lucco, Thomas E. Anderson, and Susan L. Graham. Efficient software-based fault isolation. In Proceedings of the Fourteenth ACM Symposium on Operating System Principles, SOSP 1993, The Grove Park Inn and Country Club, Asheville, North Carolina, USA, December 5-8, 1993, pages 203–216, 1993.
-  Úlfar Erlingsson, Martín Abadi, Michael Vrable, Mihai Budiu, and George C. Necula. XFI: software guards for system address spaces. In 7th Symposium on Operating Systems Design and Implementation (OSDI ’06), November 6-8, Seattle, WA, USA, pages 75–88, 2006.
-  Joint Staff of Washtington, DC. Information Assurance Through Defense in Depth. Technical report, feb 2000.
-  Matthew Wilcox. I’ll do it later: Softirqs, tasklets, bottom halves, task queues, work queues and timers. In Linux. conf. au, 2003.
-  Crispin Cowan, Steve Beattie, Greg Kroah-Hartman, Calton Pu, Perry Wagle, and Virgil D. Gligor. Subdomain: Parsimonious server security. In LISA, pages 355–368, 2000.
-  Chris Wright, Crispin Cowan, Stephen Smalley, James Morris, and Greg Kroah-Hartman. Linux security modules: General security support for the linux kernel. In Proceedings of the 11th USENIX Security Symposium, San Francisco, CA, USA, August 5-9, 2002, pages 17–31, 2002.
-  Fabrice Bellard. QEMU, a Fast and Portable Dynamic Translator. In Proceedings of the 2005 USENIX Annual Technical Conference (ATC), Anaheim, CA, April 2005.
-  Taesoo Kim and Nickolai Zeldovich. Practical and effective sandboxing for non-root users. In Andrew Birrell and Emin Gün Sirer, editors, 2013 USENIX Annual Technical Conference, San Jose, CA, USA, June 26-28, 2013, pages 139–144. USENIX Association, 2013.
-  John D. McCalpin. Memory bandwidth and machine balance in current high performance computers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter, pages 19–25, 1995.
-  Jonathan Corbet. KAISER: hiding the kernel from user space. https://lwn.net/Articles/738975, 2017.
-  Moritz Lipp, Michael Schwarz, Daniel Gruss, Thomas Prescher, Werner Haas, Anders Fogh, Jann Horn, Stefan Mangard, Paul Kocher, Daniel Genkin, Yuval Yarom, and Mike Hamburg. Meltdown: Reading kernel memory from user space. In 27th USENIX Security Symposium, USENIX Security 2018, Baltimore, MD, USA, August 15-17, 2018., pages 973–990, 2018.
-  Martín Abadi, Mihai Budiu, Úlfar Erlingsson, and Jay Ligatti. Control-flow integrity. In Proceedings of the 12th ACM Conference on Computer and Communications Security, CCS 2005, Alexandria, VA, USA, November 7-11, 2005, pages 340–353, 2005.
-  Jamie Butler. DKOM (Direct Kernel Object Manipulation). In Black Hat USA Briefings (Black Hat USA), Las Vegas, NV, July 2004.
-  Jidong Xiao, Hai Huang, and Haining Wang. Kernel data attack is a realistic security threat. In Bhavani Thuraisingham, XiaoFeng Wang, and Vinod Yegneswaran, editors, Security and Privacy in Communication Networks, pages 135–154. Springer International Publishing, 2015.
-  Hong Hu, Shweta Shinde, Sendroiu Adrian, Zheng Leong Chua, Prateek Saxena, and Zhenkai Liang. Data-oriented programming: On the expressiveness of non-control data attacks. In Proceedings of the 37th IEEE Symposium on Security and Privacy (Oakland), San Jose, CA, May 2016.
-  Hong Hu, Zheng Leong Chua, Sendroiu Adrian, Prateek Saxena, and Zhenkai Liang. Automatic generation of data-oriented exploits. In Proceedings of the 24th USENIX Security Symposium (Security), Washington, DC, August 2015.
-  Jonathan Corbet. LWN: A different approach to kernel configuration. https://lwn.net/Articles/733405/, 2018.
-  Filipe Manco, Costin Lupu, Florian Schmidt, Jose Mendes, Simon Kuenzer, Sumit Sati, Kenichi Yasukata, Costin Raiciu, and Felipe Huici. My VM is lighter (and safer) than your container. In Proceedings of the 26th ACM Symposium on Operating Systems Principles (SOSP), Shanghai, China, October 2017.
-  Linux Kernel Tinification. https://tiny.wiki.kernel.org/.
-  Nicolas Pitre. LWN: Shrinking the kernel with link-time garbage collection. https://lwn.net/Articles/741494/, 2018.
-  Nicolas Pitre. LWN: Shrinking the kernel with link-time optimization. https://lwn.net/Articles/744507/, 2018.
-  Nicolas Pitre. LWN: Shrinking the kernel with a hammer. https://lwn.net/Articles/748198/, 2018.
-  Nicolas Pitre. LWN: Shrinking the kernel with an axe. https://lwn.net/Articles/746780/, 2018.
-  Andi Kleen. Linux Kernel LTO patch set. https://github.com/andikleen/linux-misc, 2014.
-  Vikram Adve and Will Dietz. ALLVM Research Project. https://publish.illinois.edu/allvm-project/ongoing-research/, 2017.
-  Jake Edge. LWN: Building the kernel with Clang. https://lwn.net/Articles/734071/, 2017.
-  Jonathan Corbet. Variable-length arrays and the max() mess. https://lwn.net/Articles/749064/, 2018.
-  Zhongshu Gu, Brendan Saltaformaggio, Xiangyu Zhang, and Dongyan Xu. FACE-CHANGE: application-driven dynamic kernel view switching in a virtual machine. Atlanta, GA, June 2014.
-  Oliver R. A. Chick, Lucian Carata, James Snee, Nikilesh Balakrishnan, and Ripduman Sohan. Shadow kernels: A general mechanism for kernel specialization in existing operating systems. In Proceedings of the 6th Asia-Pacific Workshop on Systems (APSys), Tokyo, Japan, July 2015.
-  Richard Draves. Mach a new kernel foundation for unix development. In Proceedings of the Workshop on Micro-kernels and Other Kernel Architectures, Seattle, WA, USA, 27-28 April 1992, Seattle, WA, April 1992.
-  Michael M. Swift, Steven Martin, Henry M. Levy, and Susan J. Eggers. Nooks: an architecture for reliable device drivers. ACM, 2002.
-  Jonathan Salwan. ROPgadget - Gadgets finder and auto-roper. http://shell-storm.org/project/ROPgadget/, 2011.
Appendix A Appendix
a.1 Docker Performance
Table 9 shows the performance evaluation when MultiK is integrated with Docker containers.
a.2 Redis and Gzip Benchmarks
Redis.RedisRedis. We ran the redis-server (version 3.2.6) on the specialized kernel and exercised the redis-benchmark with default configuration on Kernel0. The benchmark sends 10,0000 requests of each Redis command with 50 concurrent clients and measures the number of requests serviced per second. Table 10 shows that the requests per second for each command, and that the Redis server running on both symbol- and block-granularity kernels can achieve performance similar to that running on vanilla kernel for all tested commands ().
gzip.gzipgzip. We measure time spent in both kernel and user modes to compress a randomly generated 512MB file with Gzip 1.6. Table 8 shows that gzip running on specialized kernels, regardless of the trace granularity, has similar performance compared to running on vanilla kernel.
a.3 Memory Effect
a.4 ROP Gadget Attacks
We use two open-source tools, ROPgadget  and ROPPER 131313https://github.com/sashs/Ropper, to measure the unique ROP gadgets found. Table 11 indicates the reduced number of ROP gadget reported by these two tools. ROPPER can find more gadgets than ROPGadget by to . A reduction in the number of gadgets by itself is not a good indicator for ROP attack evaluation because attackers can still exploit the system if they can form chains they need from the remaining gadgets . We report these reductions (Table 11) as a guide to system designers, but they should be taken with a grain of salt. MultiK can also be coupled with CFI  to improve the protection of the system from ROP attacks.
|Apache (req per sec)||13999.090||13988.560||red!25-0.08%||13994.800||red!25-0.03%|
|gzip a 512 MB file (sec)||11.062||11.052||green!25+0.01%||11.022||green!25+0.36%|
|perf bench sched messaging (sec)||0.192||0.192||0.00%||0.193||red!25-0.52%|
|perf bench sched pipe (sec)||16.063||16.098||red!25-0.22%||15.953||green!25+0.68%|