Lightweight, Multi-Stage, Compiler-Assisted Application Specialization

09/06/2021 ∙ by Mohannad Alhanahnah, et al. ∙ 0

Program debloating aims to enhance the performance and reduce the attack surface of bloated applications. Several techniques have been recently proposed to specialize programs. These approaches are either based on unsound strategies or demanding techniques, leading to unsafe results or a high overhead debloating process. In this paper, we address these limitations by applying partial-evaluation principles to generate specialized applications. Our approach relies on a simple observation that an application typically consists of configuration logic, followed by the main logic of the program. The configuration logic specifies what functionality in the main logic should be executed. LMCAS performs partial interpretation to capture a precise program state of the configuration logic based on the supplied inputs. LMCAS then applies partial-evaluation optimizations to generate a specialized program by propagating the constants in the captured partial state, eliminating unwanted code, and preserving the desired functionalities. Our evaluation of LMCAS on commonly used benchmarks and real-world applications shows that it successfully removes unwanted features while preserving the functionality and robustness of the deblated programs, runs faster than prior tools, and reduces the attack surface of specialized programs. LMCAS runs 1500x, 4.6x, and 1.2x faster than the state-of-the-art debloating tools CHISEL, RAZOR, and OCCAM, respectively; achieves 25 code-reuse attacks by removing 51.7 of known CVE vulnerabilities

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The software stack is becoming increasingly bloated. This software growth decreases performance and increases security vulnerabilities. Software debloating is a mitigation approach that downsizes programs while retaining certain desired functionality. Although static program debloating can thwart unknown possibilities for attack by reducing the attack surface [attackSurface], prior work has generally not been effective, due to the overapproximation of static program analysis: a lot of bloated code remains because these tools statically determine the set of functions to be removed using a combination of analysis techniques, such as unreachable-function analysis and global constant propagation [BlankIt]. More aggressive debloating approaches (e.g., RAZOR [RAZOR] and Chisel [Chisel]

) can achieve more reduction; however, they involve demanding techniques: the user needs to define a comprehensive set of test cases to cover the desired functionalities, generate many traces of the program, and perform extensive instrumentation. The computational expense of these steps leads to a high-overhead debloating process. The work on RAZOR acknowledges the challenge of generating test cases to cover all code, and incorporates heuristics to address this challenge. Furthermore, these aggressive approaches often break soundness, which can lead the debloated programs to crash or execute incorrectly. These issues make such approaches unsafe and impractical 

[BlankIt].

Partial evaluation is a promising program-specialization technique. It is applied in prior work [OCCAM, TRIMMER], but prior work has suffered from the overapproximation inherent in static program analysis. In particular, existing implementations rely only on the command-line arguments to drive the propagation of constants. However, they do not capture precisely the set of variables that are affected by the supplied inputs. Constant propagation is performed only for global variables that have one of the base types (int, char), and no attempt is made to handle compound datatypes (struct). Therefore, this approach leaves a substantial amount of unwanted code in the “debloated program”, which reduces the security benefits and does not preserve the behaviour of the program after debloating due to unsound transformations (i.e., incorrect constant folding [TRIMMER]). (our evaluation in §5 shows multiple programs debloated by OCCAM are crashing or behave unexpectedly.)

In this paper, we present Lightweight Multi-Stage Compiler-Assisted Application Specialization (LMCAS), a new software-debloating framework. LMCAS relies on the observation that, in general, programs consist of two components: (a) configuration logic, in which the inputs are parsed, and (b) main logic, which implements the set of functionalities provided by the program. We call the boundary between the two divisions the neck. LMCAS captures a partial state of the program by interpreting the configuration logic based on the supplied inputs. The partial state comprises concrete values of the variables that are influenced by the supplied inputs, LMCAS then applies partial-evaluation optimizations to generate the specialized program. These optimizations involve: converting the influenced variables at the neck into constants, applying constant propagation, and performing multiple stages of standard and customized compiler optimizations.

LMCAS makes significant and novel extensions to make debloating much safer in a modern context. The extensions involve optimizing the debloating process and improving its soundness. Specifically, we optimize the debloating process by introducing the neck concept, which eliminates demanding techniques that require: (1) executing the whole program, (2) generating many traces, (3) performing extensive instrumentation, and (4) obtaining a large set of tests. We demonstrate the soundness of our approach by validating the functionality of programs after debloating and under various settings. The achieved soundness is driven by: capturing a precise partial state of the program, supporting various data types, and performing guided constant conversion and clean-up. Our evaluation demonstrates that LMCAS is quite effective: on average, LMCAS achieves reduction in the binary size and reduction in the number of functions in the specialized applications. LMCAS reduces the attack surface of code-reuse attacks by removing of the total gadgets, and eliminates of known CVE vulnerabilities. On average, LMCAS runs x, x, and x faster than the state-of-the-art debloating tools CHISEL, RAZOR, and OCCAM, respectively. Hence, LMCAS strikes a favorable trade-off between functionality, performance, and security.

The contributions of our work are as follows:

  1. We propose the novel idea of dividing programs into configuration logic and main logic to reduce the overhead of debloating process.

  2. We develop a neck miner to enable the identification of such boundary with high accuracy, substantially alleviating the amount of manual effort required to identify the neck.

  3. We apply the principles of partial evaluation to generate specialized programs based on supplied inputs. A partial program state is captured by conducting partial interpretation for the program by executing the configuration logic according to the supplied inputs. The program state is enforced by applying compiler optimizations for the main logic to generate the specialized program.

  4. We develop the LMCAS prototype based on LLVM. LMCAS harnesses symbolic execution to perform partial interpretation and applies a set of LLVM passes to perform the compiler optimizations. We also carried out an extensive evaluation based on real-world applications to demonstrate the low overhead, size reduction, security, and practicality of LMCAS.

  5. We will make LMCAS implementation and its artifacts available to the community.

2 Motivation and Background

In this section, we present a motivating example, and review necessary background material on program analysis and software debloating used in the remainder of the paper. Finally, we discuss debloating challenges illustrated by the motivating example, and describe our solutions for addressing these challenges.

2.1 Motivating Example

Listing 1 presents a scaled-down version of the UNIX word-count utility wc. It reads a line from the specified stream (i.e., stdin), counts the number of lines and/or characters in the processed stream, and prints the results. Thus, this program implements the same action that would be obtained by invoking the UNIX word-count utility with the -lc flag (i.e., wc -lc). Although Listing 1 merely supports two counting options, it is still bloated if the user is only interested in enabling the functionality that counts the number of lines.

1struct Flags {
2  char count_chars;
3  int count_lines; };
4int total_lines = 0;
5int total_chars = 0;
6int main(int argc, char** argv){
7  struct Flags *flag;
8  flag = malloc(sizeof(struct Flags));
9  flag->count_chars = 0;
10  flag->count_lines = 0;
11  if (argc >= 2){
12    for (int i = 1; i < argc; i++) {
13      if (!strcmp(argv[i], "-c")) flag->count_chars = 1;
14      if (!strcmp(argv[i], "-l")) flag->count_lines = 1; }}
15  char buffer[1024];
16  while (fgets(buffer, 1024,stdin)){
17    if (flag->count_chars) total_chars += decodeChar(buffer);
18    if (flag->count_lines) total_lines++;}
19  if (flag->count_chars) printf("#Chars = %d", total_chars);
20  if (flag->count_lines) printf("#Lines = %d", total_lines); }
Listing 1: A scaled-down version of the wc utility. Highlighted statements are eliminated after debloating with “wc -l

Put another way, Listing 1 goes against the principle of least privilege [plp] [Cimplifier]: code that implements unnecessary functionality can contain security vulnerabilities that an attacker can use to obtain control or deny service—bloated code may represent an opportunity for privilege escalation. For instance, the character-count functionality “wc -c” processes and decodes the provided stream of characters via the function decodeChar (Line 17 of Listing 1). An attacker might force it to process special characters that decodeChar cannot handle [wc-bug]. More broadly, attackers can supply malicious code as specially crafted text that injects shellcode and thereby bypasses input restrictions [EnglishShellcode]. Downsizing the program is a way to reduce its attack surface because the “wc -l” functionality does not require the character-processing code used in “wc -c”: the call to the function decodeChar is completely absent in the specialized version for “wc -l”. To achieve this goal, debloating should be performed safely without applying high-demand techniques. The next section describes our approach for handling these challenges.

2.2 Background

Partial evaluation [partialEvaluation, slicing, tutorialPE] is an optimization and specialization technique that precomputes program expressions in terms of the known static input —i.e., a subset of the input that is supplied ahead of the remainder of the input. The result is another program, known as the residual program, which is a specialization of the original program. To create the residual program, a partial evaluator performs optimizations such as loop unrolling, constant propagation and folding, function inlining, etc. [partialEvaluation].

Partial evaluation and symbolic execution [SymbolicExecution] both generalize standard interpretation of programs, but there are key differences between them. Specifically, symbolic execution interprets a program using symbolic input values, while partial evaluation precomputes program expressions and simplifies code based on the supplied inputs. Unlike partial evaluation, one result of symbolic execution is a set of expressions for the program’s output variables. Because their capabilities are complementary, the two techniques have been employed together to improve the capabilities of a software-verification tool [Interleaving].

LLVM [LLVM:CGO04] provides robust compiler infrastructures for popular programming languages, including C and C++, and supports a wealth of compiler analyses and optimizations that make it suitable for developing new compiler transforms. LLVM operates on its own low-level code representation known as the LLVM intermediate representation (LLVM IR). LLVM is widely used in both academia and industry.

2.3 Challenges and Solutions

In this section, we formalize the program-specialization problem illustrated in Listing 1. In general, there is (i) a program P that provides a set of functionalities , and (ii) an input space that contains a set of values that enable certain functionalities in . Typically, one or more functionalities are enabled based on a set of supplied inputs , which are provided as part of a command-line argument or configuration file. Generating a specialized program based on the set of supplied inputs requires identifying a set of variables that are influenced by the supplied inputs, and a corresponding set of constant values . The relationship between and is bijective and . For generating a specialized program that (i) retains the required functionalities based on the supplied inputs , and (ii) removes irrelevant functionalities, we need to address the challenges discussed below:

Challenge 1: how to optimize the debloating process and avoid high-demand techniques?

Solution. To address this challenge, we propose to interpret the program partially up to a certain point, instead of executing the whole program. We can achieve this partial interpretation by relying on the observation that, in general, programs consist of two components: (a) configuration logic, in which the inputs (from the input space ) are parsed, and (b) main logic, which implements the set of functionalities . We call the boundary point the neck. The partial interpreter needs only part of the program state to be available by executing the program upto the neck. By this means, we optimize the debloating process, yet obtain a precise characterization of the set of variables that are influenced by the supplied arguments. We then convert these variables to constants, based on the constant values identified by the partial interpreter. These values are then propagated to other parts of the program via partial evaluation.

Consider Listing 1 again. The program wc provides two functionalities, and these functionalities can be activated through the two inputs . For generating the specialized program that retains the counting_lines functionality (i.e., “wc -l”) based on the supplied input , we interpret program upto the neck (i.e., Line ) to identify the set of influenced variables , flag->count_lines, total_lines, and the corresponding constant values (the partial state of ). We supply this information to the partial evaluator to generate the specialized program.

Challenge 2: how to simplify the program sufficiently, while ensuring that it operates correctly, and preserve its functionality and soundness?

Solution. The combination of partial interpretation followed by partial-evaluation optimizations holds the promise of achieving significant debloating. To achieve this promise and preserve the program semantics, it is necessary to handle various data types and complex data structures (i.e., string, pointers, and structs). By using a precise model of the programming language’s semantics, more information about variables and their values is made available, which in turn enables more optimizations to be carried out during program specialization. Therefore, we need to capture a broad spectrum of variables. For instance, the scaled-down word-count in Listing 1 contains a stack variable (flag) and two global variables (total_lines and total_chars). Various data types need to be supported as well: the global variables are integers, whereas the stack variable flag is a pointer to a struct that consists of two fields (count_line and count_chars). Supporting these various types of variables provides LMCAS the capability to perform safe debloating and maintain soundness.

3 Lmcas Framework

Figure 1: LMCAS Workflow.

This section introduces LMCAS, a lightweight debloating framework that uses sound analysis techniques to generate specialized programs. Figure 1 illustrates the architecture of LMCAS. The debloating pipeline of LMCAS receives as input the whole-program LLVM bitcode of program , and performs the following major phases to generate a specialized program as bitcode, which is ultimately converted into a binary executable.

  • Neck Miner (Section 3.1). Receives the program to be specialized and modifies it by adding a special function call that marks the neck. Section 3.1 describes our approach to identifying the neck based on heuristic and structural analysis.

  • Partial Interpretation (Section 3.2). Interprets the program up to the neck (i.e., terminates after executing the special function call inserted by the neck miner), based on the supplied inputs that control which functionality should be supported by the specialized application. The output of this phase provides a precise partial state of the program at the neck. This partial state comprises the variables that have been initialized during the partial interpretation and their corresponding values.

  • Constant Conversion (Section 3.3). Incorporates into the program the partial state captured by the partial interpretation. It converts the variables and their corresponding values captured in the partial state (i.e., and ) to settings of constants at the neck. This phase also provides the opportunity to boost the degree of subsequent optimization steps by supporting the conversion of multiple kinds of variables to constants.

  • Multi-Stage Simplification (Section 3.4). Applies selected standard and customized LLVM passes for optimizing the program and removing unnecessary functionality. These optimization steps are arranged and tailored to take advantage of the values introduced by the constant-conversion phase.

3.1 Neck Miner

Figure 2: The neck miner selects Line as a splitting point to partition the motivating example in Listing 1 into configuration logic and main logic. The splitting point is called the neck.

We developed a neck miner to recommend potential neck locations. To illustrate the neck idea, consider the example in Figure 2, which represents the motivating example of Listing 1, but with the split between configuration logic and main logic emphasized. The configuration-logic part consists of Lines 4-14, which include the declaration and initialization of global variables total_lines and total_chars, and the first part of function main. The rest of main (Lines 15-20) represents the main logic, which contains the functionalities of counting lines and counting characters. We call the point at the boundary between the two components the neck. Because the neck is located at the end of the configuration logic—where core arguments are parsed—the neck location is independent of the values of supplied inputs. Thus, neck identification only needs to be conducted a single time for each program, and the neck location can be reused for different inputs—i.e., for different invocations of the debloater on that program. According the the motivating example (Listing 1), the same neck location at Line can be used for debloating whether the supplied input is “wc -l” or “wc -c” .

The neck miner uses two analyses: heuristic analysis and structural analysis, as described in Algorithm 1. The heuristic analysis relies on various patterns, corresponding to command-line and configuration-file programs, to identify a location from which to start the structural analysis, which identifies the neck properly.

Input: CFG, EntryPoint, programCategory, fileParsingAPIs
Output: NeckLocation
      /* Heuristic Analysis */
1 if programCategory is Command-Line then
2          for  do
3                   if  is inside a loop-structure then
4                            distance = computeDistance(inst,EntryPoint,CFG)
5                  
6         
7else if programCategory is Config-File then
8          for  CFG do
9                   if  then
10                            distance = computeDistance(inst,EntryPoint,CFG)
11                  
12         
= InstAtShortestDistance(distanceToInst) /* Structural Analysis */
13 for  do
14          if inst is after startingPointForStructuralAnalysis in CFG then
15                   if inst satisfies the control-flow properties from §3.1.2 then
16                            distance = computeDistance(inst,EntryPoint,CFG)
17                  
18         
NeckLocation = InstAtShortestDistance(distanceToNeckLoc) Add special function call before to mark the neck
Algorithm 1 Neck Miner Algorithm

3.1.1 Heuristic Analysis

This step guides the structural analysis. It identifies a single location from which the structural analysis can be conducted, and relies on a set of patterns that apply to two categories of programs. These patterns are described as follows:

Command-Line-Program Patterns (Algorithm 1 Lines 11): the inputs are provided to this category of programs via command-line arguments. In C/C++ programs, command-line arguments are passed to main() via the parameters argc and argv: argc holds the number of command-line arguments, and argv[] is a pointer array whose elements point to the different arguments passed to the program. Consequently, this analysis step tracks the use of the argument argv (Line 1). Specifically, because argv is a pointer array that, in general, points to multiple arguments, the analysis identifies the uses of argv that are inside of a loop (Line 1).

Configuration-File-Program Patterns (Algorithm 1 Lines 11): this category of programs relies on configuration files to identify the required functionalities. As a representative example, consider how the neck is identified in Nginx, a web-server program that supports different configuration options [EuroSec19]. Listing 2 presents a simple Nginx configuration file. The gzip directive at line is associated with the libz.so library. In some cases, multiple directives have to be defined to enable a certain capability, such as the SSL-related directives in lines , , and of Listing 2. The heuristic analysis specifies the first location where the configuration file is parsed by certain APIs. Identifying such APIs is simple because programs use system-call APIs to read files. For instance, nginx uses the Linux system call pread111https://man7.org/linux/man-pages/man2/pwrite.2.html to read the configuration file.

Finally, after identifying the set of statements that match the various patterns, the heuristic analysis returns the statement that is closest to the CFG’s entry point (Line 1), and ties are broken arbitrarily. In the motivating example (Listing 1), the heuristic analysis obtains the statement at Line  because it is the closest location to the entry point that matches the command-line patterns.

1worker_processes 1;
2events { worker_connections 1024; }
3http {
4  charset UTF_8;
5  keepalive_timeout 65;
6  server {
7   listen 443 ssl;               # libssl.so
8   gzip on;                      # libz.so
9   ssl_certificate cert.pem;     # libssl.so
10   ssl_certificate_key cert.key; # libssl.so
11  } }
Listing 2: Nginx configuration file

3.1.2 Structural Analysis

This step identifies the neck location by analyzing the program’s statements, starting from the location specified by the heuristic analysis (Lines 11 in Algorithm 1). It identifies the statements that satisfy a certain set of control-flow properties that are discussed below (Line 1). Because it is possible to have several matching statements, the closest statement to the entry point is selected (Line 1). (Ties are broken arbitrarily.) The closest statement is determined by computing the shortest distances in the CFG from the entry point to the neck candidates (Line 1). The remainder of this section formalizes the aforementioned control-flow properties.

A program is a 4-tuple , where is the set of variables, stmts is the set of statements, is the entry point of the program, and is the exit of the program. As defined in Section 2.3, we assume that there is a set , which we call the set of influenced variables (e.g., command-line parameters of a utility). Note that is the set of “internal” or non-influenced variables. The location of a statement is denoted by . For simplicity, we assume that Val is the set of values that vars in can take.

Let be a partial assignment to the set of influenced variables (we assume that if , then it has not been assigned a value). An assignment is consistent with partial assignment iff for all , if , then . A statement is a neck for a program and a partial assignment (denoted by ) if the following conditions hold:

  • Given any assignment consistent with , always reaches the neck , and the statement corresponding to the neck is executed exactly once. This condition rules out the following possibilities: Given , the execution of (i) might never reach the neck (the intuition here is that we do not want to miss some program statements, which would cause debloating to be unsound), or (ii) the statement corresponding to the neck is inside a loop.

  • Let be all statements defined as follows: iff appears after , then could be identified as articulation point of the CFG of and one of the connected components would over-approximate . Another structural condition could be defined as follows: is the set of all statements that are dominated by the neck.

The neck miner is fully automated, except the step at Line 1 (in Algorithim 1), which currently requires manual intervention. We argue that such effort is manageable; moreover, it is a one-time effort for each program. Once the developer identifies the statements that satisfy the control-flow properties, they are fed to the neck miner to identify the statement that is closest to the entry point in the CFG. Finally, a special function call (which serves to label the neck location) is inserted before the identified neck location.

Consider Listing 1 again. The developer iterates over the program code, starting from Line (specified by the heuristic analysis) to identify the locations that satisfy the control-flow properties. The developer ignores Lines and because they violate the control-flow properties: they are not articulation points and are inside a loop, so not executed only once. Line satisfies the control-flow properties because the statement at this location is executed only once, is an articulation point, and dominates all subsequent statements.

3.2 Partial Interpretation (PI)

Partial interpretation is a supporting phase whose goal is to identify—at the neck—the set of variables (and their values) that contribute to the desired functionality. Partial interpretation is performed by running a symbolic-execution engine, starting at program entry, and executing the program (using partial program states, starting with the supplied concrete values) along a single path to the neck. After partial interpretation terminates, the partial state is saved, and the values of all variables are extracted. Different types of variables are extracted, including base types (e.g., int, char) and constructed/compound types (e.g., enum, pointer, array, and struct).

Consider a network-monitoring tool, such as tcpdump, that supports multiple network interfaces, and takes as input a parameter that specifies which network interface to use. A specialization scenario might require monitoring a specific interface (e.g., Ethernet) and capturing a specific number of packets: for tcpdump, the command would be “tcpdump -i ens160 -c 100” (see the inputs listed in Table 5, Section 5). The first argument identifies the Ethernet interface ens160, while the second argument specifies that packets should be captured. The former argument is a string (treated as an array of characters in C); the latter is an int.

Returning to the example from §2 (Listing 1), Figure 2 illustrates the location of the neck. Suppose that the desired functionality is to count the number of lines (i.e., wc -l). Table 1 shows a subset of the variables and their corresponding values that will be captured and stored in LMCAS’s database after partial interpretation finishes.

Variable Type Scope Value
total_lines int Global 0
total_chars 0
flag->count_lines int Local 1
flag->count_chars char 0
Table 1: Partial program state that contains the set of captured variables and the corresponding values obtained at the neck after partial interpretation of the program in Listing 1.

3.3 Constant Conversion (CC)

This phase aims to propagate constants in the configuration logic to enable further optimizations. For instance, this phase contributes to removing input arguments that were not enabled during symbolic execution, and thus allows tests that check those inputs to be eliminated.

Constant conversion is a non-standard optimizing transformation because optimization is performed upstream of the neck: uses of variables in the configuration logic—which comes before the neck—are converted to constants, based on values captured at the neck. The transformations performed during this phase enforce that the state at the neck in the debloated program is consistent with the partial state of constants at the neck that was captured at the end of partial interpretation. Standard dataflow analyses (e.g. Def-Use) [Storm] for global and stack variables is used to replace all occurrences of the variables with their corresponding constant values in the program code before the neck. Because some of the program statements become dead after constant conversion, the replacement is performed for all occurrences (i.e., accesses) of the variables obtained after partial interpretation.

The CC phase receives as input the bitcode of the whole program generated using WLLVM,222https://github.com/SRI-CSL/whole-program-llvm as well as a dictionary (similar to Table 1) that maps the set of variables in captured after symbolic execution to their constant values . The set involves global and stack variables (base-type, struct, and pointer variables). The CC phase then iterates over the IR instructions to identify the locations where the variables are accessed, which is indicated by load instructions. Then, it replaces the loaded value with the corresponding constant value. This approach works for global variables and stack variables with base types. However, for pointers to base variables, it is necessary to identify locations where the pointer is modifying a base variable (by looking for store instructions whose destination-operand type is a pointer to a base type). The source operand of the store operation is modified to use the constant value corresponding to the actual base variable pointed to by the pointer.

For stack variables that are Structs and pointers to Structs, we first need to identify the memory address that is pointed to by these variables, which facilitates tracing back to finding the corresponding struct and pointer-to-struct variables. We then iterate over the use of the identified memory addresses to determine store operations that modify the variable (corresponding to the memory addresses). Finally, we convert the source operand of the store operations to the appropriate constant. We also use the element index recorded during symbolic execution to identify which struct element should be converted.

For string variables, we identify the instructions that represent string variables, create an array, and assign the string to the created array. Finally, we identify store instructions that use the string variable as its destination operand, and override the store instruction’s source operand to use the constant string value.

In wc (Listing 1), no replacements are performed for global variables total_lines and total_chars before the neck: there are no such occurrences. Replacements are performed for referents of the pointer-to-struct flag: the occurrences of flag->count_chars and flag->count_lines at lines 13 and 14 are replaced with the corresponding values listed in Table 1.

3.4 Multi-Stage Simplification (MS)

This phase begins with the result of constant conversion, and performs whole-program optimization to simplify and remove unnecessary code. In this phase, we used existing LLVM passes, as well as one pass that we wrote ourselves. In particular, LMCAS uses the standard LLVM pass for constant propagation to perform constant folding; it then uses another standard LLVM pass to simplify the control flow. Finally, it applies an LLVM pass we implemented to handle the removal of unnecessary code.

Constant Propagation. This optimization step folds variables that hold known values by invoking the standard LLVM constant-propagation pass. Constant folding allows instructions to be removed.

Simplifying the CFG. LMCAS benefits from the previous step of constant propagation to make further simplifications by invoking a standard LLVM pass, called simplifycfg. This pass determines whether the conditions of branch instructions are always true or always false: unreachable basic blocks are removed, and basic blocks with a single predecessor are merged.

Input: , visitedFunc
Output: ´
/* Remove unused functions */
1 CG constructCallGraph() for  do
2          if  func is not an operand of other instructions then
3                   remove func from and CG
4         
5for  do
6          if func is not an operand of other instructions then
7                  remove func from and CG remove func’s descendent nodes from and CG if they are not reachable from main
8         
/* Remove unused Global Variables */
9 for  do
10          if  is not an operand of other instructions then
11                  remove var from
12         
/* Remove unused Stack Variables */
13 for  do
14          for  do
15                   if  is  then
16                            if  is not an operand of other instructions then
17                                     remove inst from
18                           else if  is a destination operand of only one  then
19                                     remove from remove from
20                           
21                  
22         
Algorithm 2 LMCAS Clean up

Clean Up. In the simplification pass, LMCAS removes useless code (i.e., no operation uses the result [cooper2011engineering]) and unreachable code, including dead stack and global variables and uncalled functions. Although LLVM provides passes to perform aggressive optimization, we wrote a targeted LLVM pass that gives us more control in prioritizing the cleaning up of unneeded code, as described in Algorithm 2, which receives the modified program after the CC phase and the list of functions visited during the Partial Interpretation phase (visitedFunc).

The first priority is to remove unused functions. The goal is to remove two categories of functions: (i) those that are called only from call-sites before the neck, but not called during symbolic execution (Lines 2-2), and (ii) those that are never called from the set of functions transitively reachable from main, including indirect call-sites (Lines 2-2). Function removal is performed after constructing the call graph at Line 3. To handle indirect-call sites, Algorithm 2 also checks the number of uses of a function, at Lines and , before removing the function. This check prevents the removal of a function invoked via a function pointer.

The focus then shifts to simplifying the remaining functions. For removing global variables (Lines 2-2), we iterate over the list of global variables and remove unused variables. Finally, we remove stack variables (Lines 2-2), including initialized but unused variables by iterating over the remaining functions and erasing unused allocation instructions. (In general, standard LLVM simplifications do not remove a stack variable that is initialized but not otherwise used because the function contains a store operation that uses the variable. Our clean-up pass removes an initialized-but-unused variable by deleting the store instruction, and then the allocation instruction.)

In wc (Listing 1), after the CC phase both the count_chars and count_lines fields of the struct pointed to by stack variable flag are replaced by the constants and , respectively (see Table 1). The simplification steps remove the tests at lines and because the values of the conditions are always true. Because the values of the conditions in the tests at lines and are always false, control-flow simplification removes both the tests and the basic blocks in the true-branches. Furthermore, the removal of these basic blocks removes all uses of the global variable total_chars, and thus the cleanup step removes it as an unused variable.

4 Implementation

Neck Miner. This component is implemented as an LLVM analysis pass. In command-line programs, we use LLVM’s def/use API to track the use of argv. For configuration-file programs, we iterate over the LLVM IR code to identify call-sites for the pre-identified file-parsing APIs. The developer has the responsibility of identifying the program locations that satisfy the structural properties from §3.1.2. (i.e., locations that are executed only once, and dominate the main logic). This task is relatively easy because the developer can rely on existing LLVM analysis passes to compute the dominance tree and verify the structural properties. We argue that such efforts are manageable. More importantly, they are one-time efforts. (Such a semi-automated approach has also been used in prior work [mpi] and completely manual [Temporal].) Finally, the neck location is marked by adding a special function call to the program being analyzed.

Partial Interpretation. Our implementation uses KLEE [KLEE] to perform the partial interpretation because it (1) models memory with bit-level accuracy, and (2) can handle interactions with the outside environment—e.g., with data read from the file system or over the network—by providing models designed to explore various possible interactions with the outside world. We modified KLEE 2.1 to stop and capture the set of affected variables and their corresponding values after the neck is reached. In essence, KLEE is being used as an execution platform that supports “partial concrete states.” For LMCAS, none of the inputs are symbolic, but only part of the full input is supplied. In the case of word-count, the input is “wc -l”, with no file-name supplied. Only a single path to the neck is followed, at which point KLEE returns its state (which is a partial concrete state). The second column in Tables 5 and 6 describes the inputs supplied to KLEE for the different examples.

Multi-Stage Simplification. We also developed two LLVM passes using LLVM 6.0 to perform constant-conversion (CC) and the clean-up step of the MS phase. We implemented these passes because of the absence of existing LLVM passes that perform such functionalities. We tried existing LLVM passes like global dead-code elimination (DCE) to remove unused code. However, global DCE is limited to handle only global variables (and even some global variables cannot be removed). We also noticed that not all stack variables are removed, so in our clean-up pass we employ def-use information to identify stack variables that are loaded but not used. Also, the removal of indirect calls is not provided by LLVM. To prevent the removal of functions invoked via a function pointer, our clean-up pass checks that the number of uses of a function is zero before removing the function.

5 Evaluation

This section presents our experimental evaluation of LMCAS. We address the following research questions:

  • Effectiveness of Neck Miner: How accurate is the neck miner in identifying the neck location? (5.1)

  • Optimizing Debloating Process: Does LMCAS speed up the debloating process w.r.t. running time? (5.2)

  • Functionality Preserving and Robustness: Does LMCAS produce functional programs? and how robust are the debloated programs produced by LMCAS? (5.3)

  • Code Reduction: What is the debloating performance of LMCAS w.r.t. the amount that programs are reduced in size? (5.4)

  • Security: Can LMCAS reduce the attack surface? (5.5)

  • Scalability: How scalable is LMCAS in debloating large apps? (5.6)

Experimental Setup. Our evaluation relies on three datasets, as shown in Table 2. Benchmark_1 contains programs from GNU Coreutils v8.32 (see Table 6). Benchmark_2 contains six programs obtained from ChiselBench (see Table 4).333https://github.com/aspire-project/chisel-bench Benchmark_3 consists of three programs (see Table 5). The selection of programs in Benchmark_1 was motivated by their use in prior papers on software debloating. We used Benchmark_2 because it provides us a list of CVE vulnerabilities and corresponding apps; considering this dataset facilitates our evaluation of the removal of CVEs, and allows us to compare against the CVE-removal performance of Chisel and RAZOR.

All experiments were conducted on an Ubuntu machine with a 2.3GHz Quad-Core Intel i7 processor and 16GB RAM, except the fuzzing experiment, for which we used an Ubuntu machine with a 3.8GHz Intel(R) Core(TM) i7-9700T CPU and 32GB RAM.

Source Label # of apps
GNU Coreutils 8.32
Benchmark_1 15
CHISEL Benchmark
Benchmark_2 6
Tcpdump & GNU Binutils
Benchmark_3 3
Table 2: Benchmark sets used in the evaluation.

Compared tools and approaches. To evaluate the effectiveness of LMCAS, we compared with the following tools and approaches:

  • Baseline. We establish the baseline by compiling each app’s LLVM bitcode at the -O2 level of optimization. This baseline approach was used in prior work [TRIMMER].

  • OCCAM [OCCAM]. The system most comparable to the approach used by LMCAS. However, OCCAM does not perform constant propagation, and thus omits a major component of partial evaluation.

  • CHISEL [Chisel]

    . It requires the user to identify wanted and unwanted functionalities and uses reinforcement learning to perform the reduction process.

  • RAZOR [RAZOR]. Similar to CHISEL, RAZOR relies on test cases to drive the debloating but incorporates heuristic analysis to improve soundness. RAZOR performs debloating for binary code, while the others operate on top of the LLVM IR code.

We considered CHISEL and RAZOR as they represent state-of-the-art tools that are applying aggressive debloating techniques. While we selected OCCAM because it is a state-of-the-art partial evaluation tool and thus is the closest to LMCAS. Comparing with these various tools facilitates the verify the capabilities and effectiveness of LMCAS.

5.1 Effectiveness of the Neck Miner

In this experiment, we measured the effectiveness of the neck miner in facilitating neck identification. Our evaluation involved the programs specified in Table 2. These programs belong to various projects: Coreutils, Binutils, Diffutils, Nginx and Tcpdump. For all programs, neck mining was successful, and the identified neck location was used to perform debloating.

For some programs, such as GNU wc and date, there were multiple candidate neck locations before the shortest-distance criterion was applied at Line 1 of Algorithm 1. (Table 8 in the Appendix B contains the full set of results.)

The neck location is inside the main function for the majority of the programs, except readelf and Nginx. With the help of the neck miner, it took only a few minutes for the one manual step (Line 1 of Algorithm 1) needed to identify the neck locations. More specifically, for each program the analysis time for the heuristic analysis was seconds on average, and it took minutes to perform the manual part of structural analysis. This amount of time is acceptable, given that neck identification is performed only once per program.

As mentioned in Section 3.1, the neck is identified only once for each program: the same neck can be used, regardless of what arguments are supplied. To verify this aspect, we debloated various programs based on different supplied inputs. For example, we debloated sort and wc based on and input settings, respectively, and (for each program) the same neck location was used with all debloating settings. Similarly, a single neck location is used for multiple debloatings for each of the programs listed in Tables 5 and 9 (Appendix B).

5.2 Optimizing Debloating Process

We compared the running time of LMCAS against those of CHISEL, RAZOR, and OCCAM on Benchmark_2. We used this benchmark because it was used by both CHISEL and RAZOR, thus the used test cases are available; otherwise, we need to come up with a set of test cases, which is not trivial. Debloating settings in this experiment are listed in Table 7. As depicted in Figure 3, the running times for LMCAS and OCCAM are significantly lower than the time for aggressive debloating techniques CHISEL and RAZOR. As a result, LMCAS runs up to x, x, and x faster on average than CHISEL, RAZOR, and OCCAM, respectively. This result illustrates LMCAS substantially speeds up the debloating process in contrast to aggressive debloating tools, but also slightly outperforms partial evaluation debloating techniques.

Figure 3: Running times of LMCAS, CHISEL, RAZOR, and OCCAM based on Benchmark_2 (ChiselBench).

5.3 Functionality Preserving and Robustness

In this experiment, we ran the binaries before and after debloating against given test cases to understand their robustness. The majority of the programs debloated in Benchmark_2 using RAZOR and CHISEL are suffering from run-time issues. The issues include crashing, infinite loop, or performing unexpected operations. These issues are reported and discussed in [RAZOR]. In our experiment, we found that all the debloated programs by CHISEL contain these issues. Among the programs in Benchmark_2, all but one of the OCCAM-debloated programs work correctly, and all of the LMCAS-debloated applications run correctly.

Since LMCAS and OCCAM have comparable results over Benchmark_2, we extended our evaluation by debloating programs in Benchmark_1 according to the settings in Table 6. Five of the OCCAM-debloated programs () crash (i.e., segmentation fault) or generate inaccurate results, as reported in Table 3. In contrast, all of the LMCAS-debloated programs run correctly (i.e., LMCAS preserves programs’ behavior).

We further assessed the robustness aspect of the debloated programs using fuzzing. It has been used to verify the robustness of debloated programs in [Chisel]. The aim was to test whether a debloated programs by LMCAS functioned correctly and did not crash. We used AFL (version 2.56b), a state-of-the-art fuzzing tool [afl], to perform this experiment. We used AFL’s black-box mode because our analysis is performed on LLVM bitcode; therefore, we could not use AFL to instrument the source code. We ran AFL on the debloated programs created from Benchmark_1 and Benchmark_2 for six days. AFL did not bring out any failures or crashes of the debloated programs in either dataset. This experiment provides additional confidence in the correctness of the debloated programs created using LMCAS.

Program OCCAM LMCAS
basename
basenc
comm
date
du W
echo
fmt
fold
head L
id
kill W
realpath
sort L
uniq
wc C
Table 3: Evaluation of functionality preserving after debloating by LMCAS and OCCAM for programs in Benchmark_1. means functionality is correctly preserved; otherwise: crashing (C), Infinite Loop(L), or Wrong operation (W).

5.4 Code Reduction

We used Benchmark_1 to compare the performance of LMCAS against baseline and OCCAM. Figure 4 shows the average reduction in size, using four different size metrics, achieved by baseline, OCCAM, and LMCAS. All size measures, except binary size, are taken from the LLVM Intermediate Representation (IR). For computing the binary-size metric, we compiled all debloated apps with gcc, and ran size. We report the sum of the sizes of all sections in the binary file (text + data + bss) because this quantity reflects the outcome of our simplifications across all sections. LMCAS achieved significant higher reduction rate (i.e., around the double) in comparison with baseline and OCCAM.

This result is due to the fact that the clean-up step of LMCAS can remove nodes in the call-graph that correspond to functions in binary libraries that are not used. Although the reduction rates of LMCAS

and OCCAM are close (geometric mean binary size reduction is

and , respectively), some of the specialized programs generated by OCCAM are not reliable (as discussed in Section 5.3).

Figure 4: Average reduction in size achieved through baseline, OCCAM, and LMCAS, using four different size metrics. (Higher numbers are better.)

Although the baseline shows a higher average reduction rate at instruction and basic block levels, its average reduction at binary size is the worse. Indeed, it increases the binary size for two programs (i.e., basenc and kill), as depicted in Figure 5 (Appendix C presents extended results), which shows the comparison results based on the reduction in the binary size that each tool achieved for each app in Benchmark_1.

Figure 5: Binary size reduction achieved through baseline OCCAM, and LMCAS. (Higher numbers are better.)

5.5 Security benefits of Lmcas

We evaluated the capabilities of LMCAS to reduce code-reuse attacks and to remove known vulnerabilities. Therefore, we conducted the following three experiments: (i) we attempt to reproduce (by executing the app) the vulnerability after debloating the apps to see if the vulnerabilities were eliminated; (ii) we measured the reduction in the number of code-reuse attacks by counting the number of eliminated gadgets in compiled versions of the original and reduced programs;(iii) we compared the degree of gadget reduction achieved by LMCAS and LLVM-CFI.

Vulnerability Removal. To test the ability to mitigate vulnerabilities, we used six programs in Benchmark_2 because this benchmark contains a set of known CVEs. Table 4 presents a comparison between LMCAS, RAZOR, OCCAM, and CHISEL. LMCAS and OCCAM removed CVEs from out of the programs, but the debloated sort-8.16 using OCCAM was not behaving correctly at run-time. While CHISEL and RAZOR removed CVEs from programs. Although OCCAM removed CVEs from programs including rm-8.4, but the debloated rm-8.4 shows unexpected infinite-loop behaviour. We suspect that OCCAM may remove loop condition checks. LMCAS could not remove the vulnerability in date-8.21 because the bug is located in the core functionality of this program. When undesired functionality is too intertwined with the core functionality—e.g., multiple quantities are computed by the same loop—then LMCAS may not be able to remove some undesired functionality because—to the analysis phases of LMCAS—it does not appear to be useless code. In such cases, LMCAS retains the desired functionality. In contrast, CHISEL tries to remove all undesired functionality.

App CVE ID RAZOR CHISEL OCCAM LMCAS
chown-8.2 CVE-2017-18018
date-8.21 CVE-2014-9471
gzip-1.2.4 CVE-2015-1345
rm-8.4 CVE-2015-1865
sort-8.16 CVE-2013-0221
uniq-8.16 CVE-2013-0222
  • CHISEL and RAZOR CVE removal is obtained from the corresponding publication.

  • Although OCCAM removed the CVE in rm-8.4, but the debloated version suffers from infinite loop at run-time.

Table 4: Vulnerabilities after debloating by LMCAS and CHISEL. means the CVE vulnerability is eliminated; means that it was not removed.

Gadget Elimination. Code-reuse attacks leverage existing code snippets in the executable, called gadgets, as the attack payload [CSET19]. Those gadgets are often categorized—based on their last instruction—into Return-Oriented Programming (ROP), Jump-Oriented Programming (JOP), and syscall (SYS) [Piece-Wise, CARVE]. Code-reuse attacks can be mitigated using software diversification, which can be achieved through application specialization. Software debloating is a specialization technique that can both reduce the number of exploitable gadgets and change the locations of gadgets, thus diversifying the binary.

In this experiment, we used Benchmark_1 and observed noticeable reductions in the total gadget count (occurrences of ROP, SYS, and JOP gadgets), as illustrated in Figure 6. We used ROPgadget [ROPGadget] for counting the number of gadgets. The average reduction (arithmetic mean) in the total number of unique gadgets is and the maximum reduction is (for date), while OCCAM reduces the total number of unique gadgets by , on average, with a maximum reduction of (for echo). For one program, sort, OCCAM increases the total number gadgets by . With LMCAS, the number of SYS gadgets is reduced to for out of the applications. LMCAS caused an increase in the number of SYS gadgets in one application (sort), but still produced an overall decrease when considering ROP and JOP. A similar increase is also observed with TRIMMER in three applications [TRIMMER].

Figure 6: Reduction in the total number of unique gadget occurrences (SYS, ROP, and JOP) between LMCAS and OCCAM for Benchmark_1. (Higher numbers are better.)

Control-Flow Integrity (CFI). CFI is a prominent mechanism for reducing a program’s attack surface by preventing control-flow hijacking attacks: CFI confines the program to some specific set of CFG paths, and prevents the kinds of irregular transfers of control that take place when a program is attacked. Although CFI does not specifically aim to reduce the number of gadgets, others have observed empirically that CFI reduces the number of unique gadgets in programs [enforce_CFI, Fine-CFI, Ancile]. Thus, we compared the degree of gadget reduction achieved by LMCAS and LLVM-CFI (a state-of-the-art static CFI mechanism) [Ancile]. We compiled the LLVM bitcode of our suite of programs using clang, with the flags -fsanitize=cfi -fvisibility=default. Among the programs analyzed, LMCAS outperformed LLVM-CFI on programs (60%) by creating a program with a smaller total number of unique gadgets. (Table 10 in the Appendix E contains the full set of results.) The last column in Table 10 shows that a significant reduction in unique gadgets—beyond what either LMCAS or LLFM-CFI is capable of alone—is obtained by first applying LMCAS and then LLVM-CFI.

5.6 Scalability

We evaluated the capability of LMCAS to handle large-scale programs that maintain complex input formats, such as object files, binaries, and network packets.

We used Benchmark_3 in this experiment. The programs considered in Benchmark_3 have been used in prior work to evaluate scalability [OCCAM, SymCC], including the scalability of KLEE [moklee]. Accordingly, we use the following applications444 Because deciding which lines of code belong to an individual application is difficult, lines of code (LOC) are for the whole application suites from which our benchmarks are taken. We used scc (https://github.com/boyter/scc) to report LOC. to show the scalability of LMCAS:

  • tcpdump [tcpdump] (version 4.10.0; 77.5k LOC) analyzes network packets. We link against its accompanying libpcap library (version 1.10.0; 44.6k LOC).

  • readelf and objdump from GNU Binutils [Binutils] (version 2.33; 78.3k LOC and 964.4k of library code555 We only consider lines in the binutils folder and the following dependencies: libbfd, libctf, libiberty, and libopcodes. ). readelf displays information about ELF files, while objdump displays information about object files.

We debloated these programs using different inputs to illustrate the capability of LMCAS to handle and debloat programs based on single and multiple inputs. For example, we debloated readelf based on one argument (-S) and nine arguments (-h -l -S -s -r -d -V -A -I). Table 5 breaks down the analysis time in terms of Partial Interpretation (third column) and the combination of Constant Conversion and Multistage Simplification (fourth column). For these programs, the time for symbolic execution is lower than that for the LLVM-based simplifications. This situation is expected because these programs contain a large number of functions, so a longer time is needed for the LLVM simplifications steps. The inclusion of third-party libraries diminishes the reduction rate in the binary size, which is clearly illustrated in the achieved reduction rate for tcpdump.

Program Supplied Inputs PI (sec) CC & MS (sec) Binary Size
Reduction Rate
tcpdump -i ens160 48.1 173.1 2%
-i ens160 -c 5 48.2 201.7 2%
readelf -S 10.6 41.7 4.8%
-h -l -S -s -r -d -V -A -I 20.15 72.4 4.71%
objdump -x 40.84 246.17 5.65%
-h -f -p 48.07 320.11 5.71%
Table 5: Scalability analysis of large applications, based on various inputs.

6 Discussion and Limitations

Generality of the neck concept. This concept applies to various types of programs. Our evaluation involves command-line programs, in which the neck can be easily identified, without discarding any program because the neck identification was not possible. We also inspected Nginx and observed that ALL directives in the configuration file are read before reaching the main logic (as in command-line programs). But this partitioning approach into configuration and main logic cannot be applied to Event-driven programs, which require constant interaction with the user to handle the functionalities required by the user. Therefore, the configuration of program features is performed at various locations. However, we foresee our partitioning approach applies to Event-driven programs that apply server architecture, whose life-cycle is divided into initialization and serving phases [Temporal]. Our future work will consider evaluating such programs.

Incorrect neck identification. Misidentifying the neck may lead to incorrect debloating. But the neck miner incorporates a set of heuristics and structural features to recommend accurate neck locations. The heuristic analysis techniques aid the neck identification by pinpointing the starting point where the neck miner establishes its analysis. For instance, the GNU Coreutils programs use a particular idiom for parsing command-line parameters; in principle, a special-purpose algorithm could be designed for identifying the neck for programs that use that idiom. Then the neck miner applies a set of structural requirements to constrain the neck candidates. For example, the properties that the neck is executed only once and the neck is an articulation point. These structural properties reflect the nature of the neck definition.

Precision of Constant Conversion. LMCAS relies on converting a subset of the variables in the captured concrete state into constants. If one of the variables could have been converted to a constant, but was not identified by LMCAS as being convertible (and therefore no occurrences of the variable are changed), no harm is done: that situation just means that some simplification opportunities may be missed. On the other hand, if some variable occurrences are converted to constants unsoundly, the debloated program may not work correctly.

We mitigate this issue by: (1) avoiding the conversion of some variables to constants (e.g., argv) because it carries out the rest of inputs (i.e., delayed inputs) different than the one required by the specialized program: for instance, with wc the file name is not supplied during partial interpretation, while the file name is supplied to the debloated program; (2) leveraging existing LLVM APIs (i.e., getUser) that track the uses of variables to capture the final constant values at the end of the partial interpretation. This approach overcomes situations where a pointer indirectly updates the value of a location, and ensures that the pre-neck constant-conversion step operates using updated and accurate constant values.

Reducing the Attack Surface. LMCAS reduces the attack surface in various ways by removing some known CVEs and eliminating some code-reuse attacks. Our future work will enforce stronger security properties: specializing control-flow by restricting the set of allowed indirect control-flow transfers in the remained code by incorporating CFI techniques [Ancile, enforce_CFI] and disabling security-critical system calls that are unused in the main logic [Temporal].

7 Related Work

A variety of software-debloating techniques have been developed in the research community, mainly over the last three years [EuroSec19, pldi10, stochasticOptimization, Temporal, RAZOR, BinaryTrimming, BinRec, CARVE, wedDebloat, RedDroid, Nibbler, ESORICS2019, Cozart, Unikernel]. In this section, we discuss various lines of research on software debloating and partial evaluation that are related to our work.

Program Partitioning. Our work is peripherally related to prior work on program partitioning. MPI [mpi] reduces the overhead of provenience tracking and audit logging by portioning programs’ execution based on annotated data structures. Privtrans [Privtrans] applies program partitioning to integrate privilege separation. Programs are divided into two components: the monitor and slave. These components run as separate processes, but cooperate to perform the same function as the original program. Glamdring [Glamdring] partitions applications at the source-code level, minimizing the amount of code placed inside an enclave. Our use of partitioning is for a different purpose: we create exactly two partitions, separated by the neck, and the partition before the neck is used to identify constant values that are then used to specialize the partition that comes after the neck.

In the debloating domain, Ghavamnia et al. [Temporal] propose a debloating approach to reduce the attack surface in server applications. This approach partitions the execution life-cycle of server programs into initialization and execution phases. It then reduces the number of system calls in the execution phase. However, this approach requires manual intervention from the developer to identify the boundary between the two phases, but without providing certain specifications to guide the identification process. In contrast, LMCAS performs specialization for the main logic and incorporates a neck miner to suggest a possible neck location. The neck miner provides semi-automatic support for the partitioning process and identified the neck correctly in programs.

Partial Evaluation has been used in numerous domains, including software debloating [TRIMMER], software verification [Interleaving], and to generate test cases [Albert2010PETAP]. Bubel et al. [Interleaving] use a combination of partial evaluation and symbolic execution. Because their capabilities are complementary, the two techniques have been employed together to improve the capabilities of a software-verification tool [Interleaving]. However, the goals and modes of interaction are different: in the work of Bubel et al., partial evaluation is used to speed up repeated execution of code by a symbolic-execution engine; in our work, symbolic execution is in service to partial evaluation by finding values that hold at the neck.

Application Specialization. For debloating Java programs, JShrink [JShrink] applies static and dynamic analysis. Storm is a general framework for reducing probabilistic programs [Storm]. For debloating C/C++ programs TRIMMER [TRIMMER] and OCCAM [OCCAM] use partial evaluation. While TRIMMER overcomes the limitations of OCCAM by including loop unrolling and constant propagation. But in both tools, constant propagation is only performed for global variables, and thus TRIMMER and OCCAM miss specialization opportunities that involve local variables, which makes the debloating process unsafe. LMCAS can accurately convert the elements of struct variables into constants. Furthermore, our analysis considers pointers, both to base types and struct types, which boosts the reliability of LMCAS.

Aggressive debloating tools like CHISEL [Chisel] and RAZOR [RAZOR] can achieve a significantly higher reduction rate in the size of specialized applications; however, these tools are prone to run-time issues (e.g., crashing, infinite loops). Furthermore, the debloating process takes a long time because these tools apply burdensome techniques, based on extensive program instrumentation and requiring users to provide a comprehensive set of test cases. RAZOR uses a best-effort heuristic approach to overcome the challenge of generating test cases to cover all code. While LMCAS applies lightweight techniques using program partitioning. Also, the specialized programs generated by LMCAS do not suffer from run-time issues.

Function Specialization. Saffire [Saffire] specializes call-sites of sensitive methods to handle certain parameters based on the calling context. Quach et al. [PieceWise] proposes a a tool called Piecewise for debloating libraries. Piecewise constructs accurate control-flow-graph information at compilation and linking time according to an applications’ usage.

8 Conclusion

In this paper, we present LMCAS, a practical and lightweight debloating approach for generating specialized applications. To speed up our analysis, LMCAS introduces the neck concept, a splitting point where “configuration logic” in a program hands off control to the “main logic” of the program. We develop a neck miner for alleviating the amount of manual effort required to identify the neck. LMCAS only applies partial interpretation up to the neck. The main logic is then optimized according to the values obtained from analyzing the configuration logic. Therefore, LMCAS eliminates the overhead of demanding techniques and boosts the safety of debloating process. LMCAS achieves a substantial reduction in the size of programs, and also reduces the attack surface.

References

Appendix A Benchmark Characteristics and Debloating Settings

Table 6 lists the programs in Benchmark_1 and the supplied inputs for debloating, and mentions various size metrics about the original programs. Table 7 provides the list of supplied inputs that we used for debloating the programs in Benchmark_2. This list of inputs are obtained from [RAZOR].

Program Supplied Inputs Original
# IR Inst. # Func. # Basic Blocks Binary Size
basename suffix=.txt 4,083 96 790 26,672
basenc base64 8,398 156 1,461 44,583
comm -12 5,403 110 972 32,714
date -R 29,534 166 6,104 89,489
du -h 5,0727 466 8,378 180,365
echo -E 4,095 89 811 27,181
fmt -c 5,732 115 1,095 79,676
fold -w30 4,623 100 893 29,669
head -n3 6,412 119 1,175 37,429
id -G 5,939 125 1,172 36,985
kill -9 4,539 96 898 31,649
realpath -P 8,092 155 1,419 41,946
sort -u 25,574 329 3,821 116,119
uniq -d 5,634 115 1,092 37,159
wc -l 7,076 130 1,219 41,077

Total binary size obtained via the GNU size utility.

Table 6: Characteristics of the benchmarks in Benchmark_1.
Program Supplied Inputs
chown -h, -R
date -d, –rfc-3339, -utc
gzip -c
rm -f, -r
sort -r, -s, -u, -z
uniq -c, -d, -f, -i, -s, -u, -w
Table 7: Input settings for the programs in Benchmark_2 (obtained from [RAZOR]).

Appendix B Neck Miner Evaluation

Table 8 describes the neck miner evaluation results. The second column indicates whether multiple neck locations are matching the control-flow properties. The third column describes if the location of the selected neck location is inside the main function.

Table 9 shows LMCAS performed debloating based on various debloating settings but using the same neck locations that has been identified in each program. This experiment shows the neck location is independent of the input arguments.

Program Multiple Neck Locations Inside main
basename 8.32
basenc 8.32
comm 8.32
date 8.32
du 8.32
echo 8.32
fmt 8.32
fold 8.32
head 8.32
id 8.32
kill 8.32
realpath 8.32
sort 8.32
uniq 8.32
wc 8.32
chown-8.2
date-8.21
rm-8.4
sort-8.16
uniq-8.16
gzip-1.2.4
tcpdump-4.10.0
objdump-2.33
readelf-2.33
diff-2.8
Nginx-1.19.0
Table 8: Neck Miner Results. The second column indicates if there are multiple neck locations to select from. The third column indicates whether the identified neck location is inside the main function
App Supplied Inputs Required Functionality Reduction After Debloating
#Func. Binary Size Total Gadgets
du -b shows number of bytes 23% 15% 46%
-b –time shows the time of the last modification and number of bytes 22% 14% 45%
sort -c check if the file given is already sorted or not 34% 28% 54%
-n sort a file numerically 31% 25% 51%
-un sort a file numerically and remove duplicate 31% 25% 51%
wc -c character count 42% 21% 41%
-w word count 42% 21% 41%
-lc line and character count 43% 22% 42%
-wc word and character count 42% 21% 42%
Table 9: Debloating subset apps from Benchmark_1 based on various input arguments using the same neck location identified in each program

Appendix C Code Reduction Comparing with other tools

This section provides a detailed code reduction comparison with two more debloating approaches.

  • Debugger-guided Manual debloating. We developed a simple but systematic protocol to perform debloating manually, which we state as Algorithm 3. The goal of this manual approach is to create an approximation for the maximum level of reduction that can be achieved by an average developer.

  • Nibbler [Nibbler2]. state-of-the-art tool for debloating binary code. It does not generate specialized apps, it rather focuses only on reducing the size of shared libraries.

Input: App , Input
Output: App
1 the set of statements that GDB reports are executed, given input repeat
2          for  do
3                   if  then
4                            remove stmt from if  is a call site then
5                                    
6                           
7                   while   do
8                            for  do
9                                     Remove from funcToRemove if no occurrence of exists then
10                                              Remove func from A´
11                                    
12                           for  do
13                                     Remove from varToRemove if no occurrence of exists then
14                                              Remove var from A´
15                                    
16                           
17                  if  does not build correctly then
18                            put back undo and removal from A´
19                  
20         
until no more removals of statements
Algorithm 3 Debugger-guided Manual Debloating Protocol

We used Benchmark_1 to compare the performance of LMCAS against manual debloating, baseline, Nibbler, and OCCAM. Figure 7 shows the comparison results based on the reduction in the binary size that each tool achieved for each app in Benchmark_1. For computing the binary-size metric, we compiled all debloated apps with gcc -O2, and ran size.

Figure 7: Binary size reduction achieved through manual debloating, baseline, Nibbler, OCCAM, and LMCAS. (Higher numbers are better.)

Appendix D Lmcas Running Time

We measured the running time of LMCAS. Figure 8 shows the breakdown of running time, for Benchmark_1, between (i) partial interpretation, and (ii) Constant Conversion (CS) plus Multi-stage Simplification (MS).

The average total running time is seconds; the maximum total running time is seconds for analyzing sort; and the lowest total analysis time is seconds for analyzing basename. Notably, the time for Constant Conversion and Multi-stage Simplification is low: on average, the time for constant conversion and multi-stage simplification is seconds, while the average time for Partial Interpretation is seconds.

Figure 8: The time required for partial interpretation (PI) and the partial-evaluation steps (Constant Conversion and Multi-stage Simplifications) for Benchmark_1.

Appendix E CFI Experiment

Program Total ROP Count
Original LLVM-CFI LMCAS LMCAS +LLVM-CFI
basename 1794 964 841 261
basenc 3063 1805 1309 793
comm 2095 1145 964 794
date 12119 3654 3381 1592
du 15094 7874 8503 5873
echo 1835 446 876 442
fmt 2496 1403 1230 1158
fold 2094 1168 1015 769
head 2671 1398 1366 932
id 2514 1214 1183 801
kill 1924 1147 919 1054
realpath 3073 1610 1664 1658
sort 7558 4804 4185 3699
uniq 2516 1280 1121 776
wc 2225 1611 1320 973
objdump 115587 103241 107985 80156
readelf 58186 50512 56519 45320
tcpdump 82682 53205 67417 50809
Chown 2890 2280 2529 1998
rm 3068 2316 2579 2083
Table 10: Total unique gadgets count for original binaries, debloated binaries using LMCAS