V-Fuzz: Vulnerability-Oriented Evolutionary Fuzzing

01/04/2019 ∙ by Yuwei Li, et al. ∙ Zhejiang University Georgia Institute of Technology 0

Fuzzing is a technique of finding bugs by executing a software recurrently with a large number of abnormal inputs. Most of the existing fuzzers consider all parts of a software equally, and pay too much attention on how to improve the code coverage. It is inefficient as the vulnerable code only takes a tiny fraction of the entire code. In this paper, we design and implement a vulnerability-oriented evolutionary fuzzing prototype named V-Fuzz, which aims to find bugs efficiently and quickly in a limited time. V-Fuzz consists of two main components: a neural network-based vulnerability prediction model and a vulnerability-oriented evolutionary fuzzer. Given a binary program to V-Fuzz, the vulnerability prediction model will give a prior estimation on which parts of the software are more likely to be vulnerable. Then, the fuzzer leverages an evolutionary algorithm to generate inputs which tend to arrive at the vulnerable locations, guided by the vulnerability prediction result. Experimental results demonstrate that V-Fuzz can find bugs more efficiently than state-of-the-art fuzzers. Moreover, V-Fuzz has discovered 10 CVEs, and 3 of them are newly discovered. We reported the new CVEs, and they have been confirmed and fixed.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Fuzzing is an automated vulnerability discovery technique by feeding manipulated random or abnormal inputs to a software program [1]. With the rapid improvement of computer performance, fuzzing has been widely used by software vendors such as Google [7] and Microsoft [8] for detecting bugs in softwares. However, the efficiency of fuzzing is still badly in need of being improved, especially for detecting bugs for large/complex softwares.

It is time-consuming for fuzzers to discover bugs as they usually work blindly. The blindness of fuzzers is mainly reflected in the following aspects. First, fuzzers generate inputs blindly. Usually, the input generation strategies are based on simple evolutionary algorithms, or even merely mutate the seed inputs randomly without any feedback information, especially for blackbox fuzzers. Second, fuzzers are blind with the fuzzed software especially for blackbox and greybox fuzzers, i.e., they do not know much information about the fuzzed program. Blackbox fuzzers [22] [6] treat a program as a black box and are unaware of its internal structure [5]. They simply generate inputs at random. Thus, discovering a bug is just like looking for a needle in a haystack, which somewhat relies on luckiness. Greybox fuzzers are also unaware of the program’s source codes. They usually do some reverse analyses of binary programs. However, these analyses are just simple binary instrumentations, which are mainly used to measure the code coverage.

Although whitebox fuzzers [26] [27] can perform analysis on the source code of a program to increase code coverage or to reach certain critical program locations, there are still many problems for whitebox fuzzers to be applied in fuzzing real programs. Additionally, whitebox fuzzers still have blindness. They usually leverage symbolic execution or similar techniques to generate inputs that try to go through as many paths as possible. Nevertheless, symbolic execution-based input generation is largely ineffective against large programs [37]. For instance, Driller is a guided whitebox fuzzer that combines AFL [9] with concolic execution [43]. It was benchmarked with 126 DARPA CGC [4] binaries, while its concolic engine can only generate valid inputs for 13 out of 41 binaries. Therefore, it is not practical to apply symbolic execution to explore meaningful path especially for the large software. Moreover, as it is hard to generate enough valid inputs, whitebox fuzzers’ strategy which blindly pursues high code coverage is not wise and may waste a lot of time and energy.

As indicated by Sanjay et al. [57], the crux of a fuzzer is its ability to generate bug triggering inputs. However, most of the state-of-the-art fuzzers e.g., coverage-based fuzzers mainly focus on how to improve the code coverage. Intuitively, fuzzers with a higher code coverage can potentially find more bugs. Nevertheless, it is not appropriate to treat all codes of the program as equal. The reasons are as follows. First, it is difficult to achieve a full code coverage for real programs. As the size of real programs may be very large, the generation of inputs to satisfy some sanity checks is still extremely hard [62]. Second, it needs a mass of time and computing resources to achieve high code coverage. Third, the vulnerable code usually takes a tiny fraction of the entire code. For example, Shin et al. [68]

found that only 3% of the source code files in Mozilla Firefox have vulnerabilities. Although higher code coverage can enable a fuzzer to find more bugs, most covered codes may not be vulnerable. Thus, it is expected to seek a balance between the fuzzing efficiency and the time cost and computing resources. Therefore, fuzzers should prioritize the codes which have higher probability of being vulnerable.

Therefore, we develop a vulnerability-oriented fuzzing framework V-Fuzz for quickly finding more bugs on binary programs in a limited time. In particular, we mainly focus on detecting vulnerabilities of binary programs since for most of the cases, we cannot get access to the source codes of softwares, especially for the Commercial Off-The-Shelf (COTS) softwares. Nevertheless, our approach is also suitable for the softwares which are open source by compiling them into binaries. V-Fuzz consists of two components: a vulnerability prediction model and a vulnerability-oriented evolutionary fuzzer. The prediction model is built based on a neural network. Given a binary software, the prediction model will perform an analysis on which components of the software are more likely to have vulnerabilities. Then, the evolutionary fuzzer will leverage the prediction information to generate high-quality inputs which tend to arrive at these potentially vulnerable components. Based on the above strategies, V-Fuzz reduces the waste of computing resources and time, and significantly improves the efficiency of fuzzing. In addition, V-Fuzz still gives relatively small attention to other parts of the binary program, which have a lower probability of being vulnerable. Therefore, it can mitigate the false negative of the vulnerability prediction model. In summary, our contributions are the followings:

1) We analyze the limitations of existing coverage-based fuzzers. They usually simply treat all the components of a program as equal, in hope of achieving a high code coverage, which is inefficient.

2) To improve fuzzing efficiency, we propose V-Fuzz, a vulnerability-oriented evolutionary fuzzing prototype. V-Fuzz significantly improves the fuzzing performance leveraging two main strategies: deep learning based vulnerable component predication and predication information guided evolutionary fuzzing.

3) To examine the performance of V-Fuzz, we conduct extensive evaluations leveraging 10 popular Linux applications and three programs of the popular fuzzing benchmark LAVA-M. The results demonstrate that V-Fuzz is efficient in discovering bugs for binary programs. For instance, compared with three state-of-the-art fuzzers (VUzzer, AFL and AFLfast), V-Fuzz finds the most unique crashes in 24 hours. Moreover, we discovered 10 CVEs by V-Fuzz, where 3 of them are newly discovered. We reported the new CVEs, and they have been confirmed and fixed.

2 Background

2.1 Motivation of V-Fuzz

1#define SIZE 1000
2int main(int argc, char **argv){
3    unsigned char source[SIZE];
4    unsigned char dest[SIZE];
5    char *input=ReadData(argv[1]);
6    int i;
7    i=argv[2];
8     /*magic byte check */
9    if(input[0]!=’*’)
10        return ERROR;
11    if(input[1]==0xAB && input[2]==0xCD){
12        printf(”Pass1stcheck!\n”);
13         /*some nested condition*/
14        if(strncmp(&input[6],”abcde”,5)==0){
15            printf(”Pass2ndcheck!\n”);
16             /* some common codes without vulnerabilities*/
17            
18        }
19        else{
20            printf(”Notpassthe2ndcheck!”);
21        }
22    }
23    else{
24        printf(”Notpassthe1stcheck!”);
25    }
26     /*A buffer overflow vulnerability*/
27    func_v(input, source, dest);
28    return 0;
29}
Fig. 1: A motivation example.

Fuzzers can be categorized into several types from different perspectives. Based on different input generation methods, fuzzers can be categorized as generation-based and mutation-based. A generation-based fuzzer [2] [3] [6] generates inputs according to the input model designed by users. A mutation-based fuzzer [9] [10] [11] generates inputs based on mutating a corpus of seed files during fuzzing. In this paper, we focus on mutation-based fuzzing.

Based on the exploration strategy, fuzzers can be classified as directed and coverage-based. A directed fuzzer

[21] [27] [33] [35] [53] generates inputs that aim to arrive at some specific regions. A coverage-based fuzzer [9] [10] [49] [63] aims to generate inputs that can traverse as many paths as possible.

Figure 1 shows a highly simplified scenario to describe our motivation. In this example, there is a function main and it calls a function func_v in line 27. There are some magic bytes and error checks in the main function from line 9 to line 25, some of which are even nested condition statements. It is known that generating inputs which can arrive at the check codes is not easy. However, the codes for these checks do not have any vulnerability. While, in line 27, the function func_v has a buffer overflow vulnerability.

For the example in Figure 1, most of the state-of-the-art coverage-based fuzzers do not work well. For instance, AFL [9] is a security-oriented fuzzer that employs an evolutionary algorithm to generate inputs that can trigger new internal states in the targeted binary. More specifically, if an input discovers a new path, this input will be selected as a seed input. Then, AFL generates new inputs by mutating the seed inputs. AFL simply regards the input which discovers a new path as a good input. Once a path is difficult to explore, AFL gets ”stuck” and chooses another path which is easier to arrive. In this example, AFL will spend a lot of time trying to generate inputs that can bypass these checks. However, AFL will easily get ”stuck”. Finally, AFL may find new paths or crashes when it arrives at the path at function func_v. Nevertheless, the previous effort is barely useful in this case. Another state-of-the-art fuzzer is VUzzer [57], which is an application-aware evolutionary fuzzer. VUzzer is designed to pay more attention to generating inputs which can arrive at deeper paths. For this example, VUzzer will give more weights to fuzz the codes nested in the condition statements. However, the codes in deeper paths do not mean that they have a higher probability to be vulnerable. Therefore, the fuzzing strategy of VUzzer does not work well either.

For this example, the most important parts of the program that fuzzers should concern first is the function func_v. If there exists a static analysis tool or vulnerability detection model which can give a warning information that func_v may have vulnerabilities, a fuzzer can give more weights to generating inputs which tend to arrive at func_v. This will make the fuzzer more efficient in discovering bugs.

Based on the above discussion, we propose V-Fuzz, which aims to find bugs quickly in a limited time. V-Fuzz leverages an evolutionary algorithm to generate inputs which tend to arrive at vulnerable codes of the program, assisted by vulnerability prediction information. It needs to be emphasized that V-Fuzz is neither coverage-based nor directed. V-Fuzz is different from most coverage-based fuzzers which regard all codes as equal, since it pays more attention to the codes which have higher probability to be vulnerable. In addition, unlike directed fuzzers such as AFLGo [53], which generates inputs with the objective of reaching a given set of target program locations, V-Fuzz gives relatively small weights to other codes which are unlikely to be vulnerable. This is because the vulnerability prediction model may not be so accurate, and the components that are predicted to be safe may still be vulnerable. Therefore, V-Fuzz leverages the advantages of vulnerability prediction and evolutionary fuzzing, and meanwhile reduces the disadvantages of them.

2.2 Binary Vulnerability Prediction

In order to implement a vulnerability-oriented fuzzer, a vulnerability prediction module is expected, which performs a pre-analysis of which components are more likely to be vulnerable. There are two main approaches for vulnerability prediction. One is using traditional static vulnerability detection methods. The other is leveraging machine learning or deep learning techniques. Among the two approaches, we choose to leverage deep learning to build the vulnerability prediction model due to the following reasons.

First, most of traditional static analysis methods use pattern-based approaches to detect vulnerabilities. The patterns are manually defined by security experts, which is difficult, tedious and time-consuming. For example, there are many open source vulnerability detection tools such as ITS4 [23], and commercial tools such as Checkmarx [12]. These tools often have high false positive rates or false negative rates [65]. In addition, these tools are used to detect vulnerabilities within source codes. As V-Fuzz focuses on fuzzing binary programs, these tools are not suitable for V-Fuzz.

Second, deep learning has been applied to the fields of defect prediction [38] [47], program analysis [39] [40] [56] and vulnerability detection [65] successfully. In these applications, deep learning has several advantages when compared with pattern-based methods: (1) Deep learning methods do not need experts to define the features or patterns, which can reduce lots of overhead. (2) Even for the same type of vulnerabilities, it is hard to define an accurate pattern that can describe all forms of it. (3) Pattern-based methods usually can only detect one specific type vulnerability, while deep learning methods have been proven to be able to detect several types of vulnerabilities simultaneously [65]. Therefore, we choose to leverage deep learning methods to build our vulnerability prediction model.

Moreover, it is worth to note that there has been no such approach that leverages deep learning to detect or predict vulnerabilities for binary programs to the best of our knowledge. In order to build our prediction model, there are several questions that need to be considered.

How to represent a binary program? It is inappropriate to analyze binary codes directly, as the binary itself does not have sufficient syntactic or semantic information. Therefore, the binary needs to be transformed into some intermediate form such as assembly language, which has enough meaningful characteristic information. Following this idea, we choose to analyze the Control Flow Graph (CFG) of a binary program, as the CFG

contains rich semantic and structural information. In addition, in order to conveniently train a deep learning model, the binary program needs to be represented as numerical vectors. Therefore, we leverage the

Attributed Control Flow Graph (ACFG) [42] to describe and represent a binary program with numerical vectors.

Which granularity is suitable for analysis? The granularity is also an important factor that needs to be considered. The possible granularity of a binary program can be file, function, basic block, etc. In fact, a proper granularity cannot be too coarse or too fine, as the too coarse granularity will decrease the precision of the model, while the too fine granularity will make it hard to collect sufficient labeled data for training a meaningful model. Therefore, we seek a tradeoff and choose function as the granularity to analyze.

Which neural network model is appropriate for the vulnerability prediction problem? One important advantage of neural networks is that it can learn features automatically. Since we leverage ACFG to represent a binary program, we leverage graph embedding network [44] to build our model as it has been successfully applied to extract valid features of structural data. Moreover, Xu et al. [56] applied this approach to detect similar cross-platform binary codes. Therefore, it could be feasible to leverage this model to predict the vulnerable components of a program. As the aim of our model is not detecting similar codes, it is necessary to change the graph embedding network to make it suitable for our problem. The detailed description will be presented later.

3 V-Fuzz: System Overview

In this section, we introduce the main components and workflow of V-Fuzz. Figure 2 shows the architecture of V-Fuzz, which consists of two modules: a Neural Network-based Vulnerability Prediction Model and a Vulnerability-Oriented Evolutionary Fuzzer.

Fig. 2: The architecture of V-Fuzz.

Neural Network-based Vulnerability Prediction Model. For a binary program, the vulnerability prediction model will give a prediction on which components are more likely to be vulnerable. More specifically, the component is a function of a binary program, and the prediction is the probability of a function being vulnerable. We leverage deep learning to build this model, and the core structure of the model is a graph embedding network. The detailed structure of the model is shown in Section 4. In order to enable the model to predict vulnerabilities, we train the model with a number of labeled data (the label is ”vulnerable” or ”secure”). In addition, the model is able to predict several types of vulnerabilities when it is trained with sufficient data related to these vulnerabilities. Then, the prediction result will be sent to the fuzzer to assist it in finding bugs.

Vulnerability-Oriented Evolutionary Fuzzer. Based on the previous vulnerability prediction result, the fuzzer will assign more weight to the functions that have higher vulnerable probabilities. For convenience, we use the ”vulnerable probability” to represent the ”the probability of being vulnerable” in this paper. The process is as follows: for each function of the binary program with a vulnerable probability, V-Fuzz will give each basic block in the function a Static Vulnerable Score (SVS), which represents the importance of the basic block. The detailed scoring method is described in Section 5. Then, V-Fuzz starts to fuzz the program with some initial inputs provided by users. It leverages an evolutionary algorithm to generate proper inputs. For each executed input, V-Fuzz gives a fitness score for it, which is the sum of the SVS of all the basic blocks that are on its execution path. Then, the inputs that have higher fitness scores or cause crashes will be selected as new seed inputs. Finally, new inputs will be continuously generated by mutating the seed inputs. In this way, V-Fuzz tends to generate inputs that are more likely to arrive at the vulnerable regions.

4 Vulnerability Prediction

4.1 Problem Formalization

In this subsection, we formalize the vulnerability prediction problem. We denote the vulnerability prediction model as . Given a binary program , suppose it has functions . For any function , it is an input of , and the corresponding output denotes the vulnerable probability of , i.e.,

(1)

The function that has a high vulnerable probability should be paid more attention when fuzzing the program. In order to build such an , there are three aspects to be considered: the representation of input data, i.e., the approach of data preprocess, the model structure and how to train and use the model. We will introduce the details of the three aspects below.

4.2 Data Preprocessing

As discussed in Section 2, to build and train , we should seek a method to transform binary program functions into numerical vectors. Moreover, the vectors should be able to carry enough information for future training. Towards this, we choose to leverage the Attributed Control Flow Graph (ACFG) [42] to represent the binary function.

ACFG is a directed graph , where is the set of vertices, is the set of edges, and is a mapping function. In ACFG, a vertex represents a basic block, an edge represents the connection between two basic blocks, and maps a basic block in to a set of attributes .

As we know, it is common to use Control Flow Graph (CFG) to find bugs [41] [46]. However, CFG is not a numerical vector, which means it cannot be used to train a deep learning model directly. Fortunately, ACFG is another form of CFG by describing CFG with a number of basic-block level attributes. In ACFG, each basic block is represented by a numerical vector, where each dimension of the vector denotes the value of a specific attribute. In this way, the whole binary function can be represented as a set of vectors. Therefore, ACFG is suitable for our requirements to represent a binary function.

Fig. 3: The Workflow of data preprocessing.
Type Attributes Num
Instructions The num of call instruction 244
Operand The num of void operand 8
The num of general register operand
The num of direct memory reference operand
The num of operand that consists of a base register or an index register
The num of operand that consists of registers and displacement value
The num of immediate operand
The num of operand that is accessing immediate far addresses
The num of operand that is accessing immediate near addresses
Other The num of string ”malloc” 3
The num of string ”calloc”
The num of string ”free”
All attributes num 255
TABLE I: The used attributes of basic blocks.

Now, we show how to vectorize a binary program. Figure 3 shows the workflow of data preprocessing. First, we disassemble the binary program to get the CFGs of its functions. Then, we extract attributes for basic blocks and transform each basic block into a numerical vector. The attributes are used to characterize a basic block, and they can be statistical, semantic and structural. Here, we only extract the statistical attributes for the following reasons. The first reason is for efficiency. As indicated in [42], the cost of extracting semantic features such as I/O pairs of basic blocks is too expensive. Second, the graph embedding network can learn the structural attributes automatically. We extract 255 attributes in total. Table LABEL:appendix-full-attributes shows all the 255 attributes, and all the instruction type-related attributes can be found in Section 5.1 of [20].

There are mainly three kinds of attributes: instruction-related attributes, operand-related attributes and string-related attributes. Then, each basic block can be represented by a 255-dimensional vector, and the binary program now is represented by a set of 255-dimensional vectors.

4.3 Model Structure

Based on the discussion in Section 2, we choose to adapt Graph Embedding Network [44] [56] as the core of our vulnerability prediction model. Firstly, we give a brief introduction of the graph embedding network. Then, we detail the design of our model.

Fig. 4: The Structure of the vulnerability prediction model. is the output of the model, which represents the vulnerable probability of a binary function.

Graph embedding is an efficient approach to solve graph-related problems [67] such as node classification, recommendation and so on. It transforms a graph into an embedding vector that contains sufficient information of the graph for solving corresponding problems. In our scenario, the embedding vectors of a binary function should be able to contain sufficient features for vulnerability prediction. In addition, graph embedding can be considered as a mapping , which maps a function’s ACFG into a vector .

We leverage a neural network to approximate the mapping

, and train the model with vulnerable and secure binary functions to enable the graph embedding network to learn the features related to vulnerabilities. As the vulnerability prediction model is required to output the vulnerable probability of a binary function, we combine the graph embedding network with a pooling layer and a softmax-layer. The pooling layer transforms the embedding vector into a 2-dimensional vector

, and the softmax-layer maps the 2-dimension vector of arbitrary real values into another 2-dimension vector , where the value of each dimension is in the range . The first dimension represents the vulnerable probability, which is represented by . The second dimension represents the secure probability, and naturally the value is

. The whole model is trained by labeled data end-to-end, and the parameters of the model can be learned by minimizing a loss function.

Notation Meaning
The number of attributes of a basic block
The ACFG of a binary function
The set of vertices
The set of edges
a vertex in
The attributed vector of vertex
The dimension of an embedding vector
The embedding vector of vertex
The graph embedding vector of ACFG
TABLE II: Notations.

Below is the formalization of the model. Table LABEL:appendix-notations shows all the notations related to the model. and the structure of the model is presented in Figure 4. The input of the model is the ACFG of a binary program function, . Each basic block in ACFG has an attribute vector which can be constructed according to all selected attributes. The number of attributes for each basic block is . Thus, is an -dimensional vector. For each basic block , the graph embedding network computes an embedding vector , which combines the topology information of the graph. The dimension of the embedding vector is . Let be the set of neighboring vertices of . Since ACFG is a directed graph, can be considered as the set of precursor vertices of . Then the embedding vector can be computed by , where is a non-linear function that can be , etc. The embedding vector is computed for iterations. For each iteration , the temp embedding vector can be get by equation , where is an vector, and is a matrix. The initial embedding vector is set to zero. After iterations, , we obtain the final graph embedding vector . is a -layer fully-connected neural network with parameters . Let

be a rectified linear unit. We have

. After iterations, we can get the final graph embedding vector for each vertex . Then the graph embedding vector of the ACFG can be represented by the summation of the embedding vector of each basic block, i.e., , where is a matrix. To compute the vulnerable probability of the function, we map the graph embedding vector into a 2-dimensional vector , i.e.,

(2)

where is a matrix. Then, we use a softmax function to map the values of into the vector , , i.e.,

(3)

The value of is the output of the model, which represents the vulnerable probability of the binary program function .

4.4 Train and Use the Model

In order to predict the vulnerable probability, the model needs be trained with labeled data, where the label is either ”vulnerable” or ”secure”. For the ACFG of a function, the label of is 0 or 1, where means the function has at least one vulnerability, and means the function is secure.

The model’s training process is similar to a classification model. Then the parameters of can be learned by optimizing the following equation:

(4)

where is the number of training data, and is a cross-entropy loss function. We optimize Equation (4

) with a stochastic gradient descent method. Thus, the vulnerability prediction model can be trained in this way.

Although the model’s training process is similar to training a classification model, when using the model, the classification result that whether a binary function is vulnerable or not is too coarse-grained and not suitable for the subsequent fuzzing. Therefore, we choose to use the vulnerable probability as the output of the model.

5 Vulnerability-Oriented Fuzzing

Based on the result from the prediction model, the vulnerability-oriented fuzzer will pay more attention to the functions with high probabilities. Figure 5 shows the workflow of vulnerability-oriented fuzzing, where V-Fuzz leverages an evolutionary algorithm to generate inputs which tend to arrive at the vulnerable components. Specifically, for a binary program, V-Fuzz uses the data process module to disassemble the binary to get the ACFG of each function, which is the input of the vulnerability prediction model. Then, the prediction model will give each binary program function a Vulnerability Prediction (VP). Based on the VP result, each basic block in the program is given a Static Vulnerable Score (SVS), which will be used later to evaluate the executed test cases.

Fig. 5: Vulnerability-Oriented Fuzzing. DBI: Dynamic Binary Instrumentation, SVS: Static Vulnerable Score, and VP: Vulnerability Prediction.

The fuzzing test is a cyclic process. Like most mutation-based evolutionary fuzzers, V-Fuzz maintains a seed pool, which is used to save high-quality inputs as seeds. V-Fuzz starts to execute the binary program with some initial inputs that are provided by users. Meanwhile, it uses Dynamic Binary Instrumentation (DBI) to track the execution information of the program such as basic block coverage. Based on SVS and the execution information, V-Fuzz will calculate a fitness score for each executed testcase. The testcases with high fitness scores are considered as high-quality inputs, and will be sent to the seed pool. In addition, the executed testcases which trigger crashes will also be sent to the seed pool, regardless of their fitness scores. The detailed method for calculating fitness score will be given in Section 5.2. Next, V-Fuzz generates the next generation testcases by mutating the seeds in the seed pool. In this way, V-Fuzz continues to execute the program with new generated inputs until the end conditions are met. Below, we elaborate the workflow of vulnerability-oriented fuzzing.

5.1 Static Vulnerable Score

Based on the VP result, V-Fuzz gives each basic block a Static Vulnerable Score (SVS). For a function , we assume its vulnerable probability is , and it has basic blocks . For , ’s SVS, denoted by , can be calculated by the following equation:

(5)

where and are constant parameters that should be obtained from fuzzing experiments. Hence, the basic blocks that belong to the same function have the same values. For parameter , we fuzz 64 Linux programs (e.g., some programs in binutils), and 20 of them have crashes. Then, we fuzz these 20 programs individually with the value of , and we observe that when , the fuzzing test performs the best. Therefore, we set as the default value. For parameter , it is used to avoid when functions have very low vulnerable probabilities . As if , it represents that has no meaning for fuzzing and becomes trivial, which is against our design principle of V-Fuzz. Therefore, we set the value of , which is small and can make all the time.

Based on the approach of calculating , V-Fuzz assigns more weight to the basic blocks that are more likely to be vulnerable, which will further assist the fuzzer to generate inputs that are more likely to cover the basic blocks.

5.2 Seed Selection Strategy

Algorithm 1 shows the seed selection strategy of V-Fuzz. Specifically, V-Fuzz leverages an evolutionary algorithm to select seeds which are more likely to arrive at the vulnerable components.

After giving every basic block a , V-Fuzz enters the fuzzing loop. During each loop, V-Fuzz monitors the program to check if it has exceptions such as crashes. If the input causes a crash, then the input is added to the seed pool. Once an execution has completed, V-Fuzz records the execution path for the input. The fitness score of the input is the sum of the SVS values of the basic blocks that are on the execution path. Figure 6 shows an example for fitness score calculation. We assume there are two inputs , in this generation. The execution paths of the two inputs are and respectively. Suppose is , and is . The fitness score of input and are and respectively. Then, , . As is larger than , the input will be selected as a seed. It should be noted that, if any input causes a crash, no matter how low the fitness score it has, it will be sent to the seed pool.

In this way, V-Fuzz not only utilizes the information of the vulnerability prediction model, but also considers the actual situation. Therefore, V-Fuzz can mitigate the potential weakness of the vulnerability prediction model.

0:  Binary Program: ; The set of all basic blocks of : ; The set of initial inputs: ; The set of seed pool: ; The set of testcases: ;
  
  while In the fuzzing loop do
     for  in  do
         
         
         if  then
            
         end if
     end for
     
     
     
  end while
Algorithm 1 Seed Selection Algorithm
Fig. 6: An example for fitness score calculation.

5.3 Mutation Strategy

V-Fuzz is a mutation-based evolutionary fuzzing system, which generates new inputs by mutating the seeds. Like most of mutation-based fuzzers, the mutation operations are bit/byte flips, inserting ”interesting” bits/bytes, changing some bits/bytes, selecting some bytes from several seeds and splicing them together and so on.

The design of the mutation strategy is very important, as an appropriate strategy can help the fuzzer generate good inputs which can find new paths or crashes. For example, Figure 7 gives the CFG of a simple program. Assume that there is a seed string , which can cover the basic block , . Another seed string covers no basic block of the program. It is obvious that the new inputs mutated from are more likely to cover new basic blocks than those mutated from .

It is worth noting that if we want to get ”ab*” by mutating ”abx”, the mutation must be slight, which only changes a small part of the original seed. However, if the fuzzer has spent too much time doing the slight mutation operations, while making no progress, the fuzzer should change its mutation strategy and pay more attention to other paths. In this example, if the fuzzer gets ”stuck” for a long time by performing the slight mutation on ”abx”, it would be better to choose heavy mutation that changes more about the original seed, which may help the fuzzer find the basic block . Therefore, the fuzzer should dynamically adjust its mutation strategy according to the actual fuzzing states.

We classify the mutation strategies into slight mutation and heavy mutation. In order to help the fuzzer determine the selection of the mutation strategy, we define Crash Window (CW), which is a threshold to determine the selection of mutation strategy. Consequently, we assume that the number of generations whose inputs have not found any new path or crash is denoted by notation . If , the fuzzer should select heavy mutation strategy. Furthermore, we propose the Crash Window Jump Algorithm to adjust the value of CW optimally.

The main idea of the Crash Window Jump Algorithm is as follows: First, we assume that the initial value of CW is , its maximum value is and its minimum value is . The value of CW starts from , and the fuzzer selects slight mutation as its initial mutation strategy. During the fuzzing process, if , the fuzzer will change its mutation strategy to heavy mutation, and will double the value of CW. Once an input finds a new path or a crash, then we set and the new value of CW as the half of its former value. Algorithm 2 shows the pseudo-code of the Crash Window Jump Algorithm.

0:  Binary Program: ; The initial crash window: ; Max Crash Window: ; Min Crash Window: ; The current Crash Window: ; The set of seed pool: ; The set of testcases: ; The mutation strategy: ; ; ;
  
  
  
  
  while In the fuzzing loop do
     for  in  do
         
         if find crash then
            
         end if
         if find new basic block then
            
         end if
     end for
     if  then
         
         if  then
            
            if  then
               
            end if
         end if
     else
         
         
         if  then
            
         end if
     end if
     
     
  end while
Algorithm 2 Crash Window Jump Algorithm
Fig. 7: A simple CFG.

6 Implementation and Settings

6.1 Vulnerability Prediction

The vulnerability prediction module consists of two main components: the ACFG extractor and the vulnerability prediction model. For the ACFG extractor, we implement it by writing a plug-in on the famous disassembly tool IDA Pro [14]

. For the vulnerability prediction model, we implement it based on PyTorch

[15], which is a popular deep learning framework.

We train the vulnerability prediction model on a server which is equipped with two Intel Xeon E5-2640v4 CPUs (40 cores in total) running at 2.40GHz, 4 TB HDD, 64 GB memory, and one GeForce GTX 1080 TI GPU card.

6.2 Vulnerability-Oriented Fuzzing

For vulnerability-oriented fuzzing, we implement the fuzzer based on VUzzer [57], an application-aware evolutionary fuzzer which focuses on fuzzing binary programs.

We conduct the fuzzing test on a virtual machine with Ubuntu 14.04 LTS. The virtual machine is configured with 32-bit single core 4.2GHz Intel CPU and 4 GB RAM, During the fuzzing test, we observe that the fuzzing process takes less than 1GB of memory.

7 Evaluation

In this section, we evaluate the performance of V-Fuzz. Since V-Fuzz consists of two components, we will present the evaluation results in two parts: the vulnerability prediction and the vulnerability-oriented fuzzing.

7.1 Vulnerability Prediction Evaluation

7.1.1 Data Selection

The dataset used for training and testing the vulnerability prediction model is published by the National Institute of Standards and Technology (NIST) [13], as this dataset has been widely used in many vulnerability related work [65] [69]. We use the codes of Juliet Test Suite v1.3 [16] as our training and testing data, which is a collection of test cases in the C/C++ language, and each function in this dataset is labeled with ”good” or ”bad”. A function labeled with ”good” means it does not have flaws, while one labeled with ”bad” means it has at least one flaw. Each ”bad” example has a Common Weakness Enumeration IDentifier (CWE ID). Juliet Test Suite v1.3 has examples of 118 different CWEs in total. As fuzzing is suitable for discovering bugs related to memory, we select some CWE samples which are related to memory errors from Juliet Test Suite v1.3, which are shown in Table LABEL:CWE.

As Table LABEL:CWE shows, we collect 111,540 labeled function samples in total, which include 78,511 secure samples and 33,029 vulnerable samples. The top three types of CWEs are Integer Overflow (the number is 26,982 ), Heap Based Buffer Overflow (the number is 18,522) and Stack Based Buffer Overflow (the number is 15,403). These CWE testcases take almost half of all the data (54.6%). In addition, these 3 types of vulnerabilities are the most common ones in real-world. Thus, the distribution of our selected datasets is similar as the distribution of real-world vulnerabilities.

From Table LABEL:CWE, we randomly select 40,000 samples as the training data TRAIN-DATA. Then, we again randomly select 4,000 samples as the testing data TEST-DATA, which has no overlap with TRAIN-DATA. Table LABEL:dataset presents the information of the 3 datasets.

CWE Type #Secure #Vulnerable Total
121 Stack Based Buffer Overflow 10,187 5,216 15,403
122 Heap Based Buffer Overflow 12,263 6,259 18,522
124 Buffer Under write 4,183 2,031 6,214
126 Buffer Over Read 3,019 1,376 4,395
127 Buffer Under Read 4,183 2,031 6,214
134 Uncontrolled Format String 6,833 2,357 9,190
190 Integer Overflow 20,187 6,795 26,982
401 Memory Leak 6,756 2,303 9,059
415 Double Free 4,204 1,448 5,652
416 Use After Free 1,760 470 2,230
590 Free Memory Not On The Heap 4,049 2,250 6,299
761 Free Pointer Not At Start 887 493 1,380
Total 78,511 33,029 111,540
TABLE III: The Types of CWE.
Dataset #Vulnerable #Secure Total Remarks
ALL-DATA 78,511 33,029 111,540 All data in Table 2
TRAIN-DATA 20,000 20,000 40,000 Selected from ALL-DATA
TEST-DATA 2,000 2,000 4,000 Selected from ALL-DATA
TABLE IV: Datasets.

7.1.2 Pre Experiments

First, we conduct some pre-training experiments to determine the default parameters of the model. We use stochastic gradient descent as the optimization algorithm and set the learning rate equal to 0.0001. Based on the results of the pre-training experiments, we set the depth of the network as 5, the embedding size as 256, and the number of iterations as 3. We take the above setting as the default of our model. Then, we use TRAIN-DATA to train the vulnerability prediction model, and use TEST-DATA to test the model.

7.1.3 Evaluation Metrics

We use three metrics to evaluate the performance of the model: accuracy, recall and loss. The first two metrics reflect the model’s capability of predicting vulnerable functions. Additionally, we evaluate whether the model converges by observing the value of loss. Next, we show the method to calculate these three metrics and show the performance of the model using these metrics.

Suppose the number of testing samples is . Among these samples, the number of samples labeled as ”vulnerable” is , and the number of samples labeled as ”secure” is . Based on the prediction outputs, we sort the testing examples in descending order of their predicted values (i.e., vulnerable probability). Then, we select the top-K testing samples. Assuming among the top-K samples, the number of samples labeled as ”vulnerable” is , and the number of samples labeled as ”secure” is . Thus, we can calculate the accuracy of the top-K testing data as . When the threshold K equals to the number of , we can calculate the recall as . Finally, we leverage the cross-entropy loss function to calculate loss.

Figure 8 shows the performance of the model evaluated by the three metrics. Figure 8 presents the prediction accuracy when setting the threshold to different values (in range ), from which we can observe that the accuracy of the model is high (greater than 80%). Figure 8

presents the prediction recall. From the figure, we can observe that the prediction recall rises from 53% to 66% during the training process. Its value increases very soon and becomes stable within 10 training epochs. Figure

8 presents the value of loss, from which we can observe that its value drops very quickly in the first 6 epochs, and remains stable after that.

From the above results, we can see that this model is capable for vulnerable function prediction.

(a) Accuracy.
(b) Recall.
(c) Loss.
Fig. 8: The performance of the model.

7.1.4 Hyperparameters Analysis

In this part, we give an analysis on the hyperparameters. They are the number of depth, embedding size and the number of iterations. Figure

9 shows the impact of different hyperparameters. We select the accuracy of top-K (K=600) as the impact indicator to show the effectiveness of different hyperparameters.

Embedding Size: Figure 9 shows the impact of different embedding size. Here, we test the impact by setting the embedding size as 128, 256, 512 and 1024 respectively. From the figure, we can observe that when the embedding size is 1024, the accuracy reaches the largest, and the accuracy of other embedding size is almost the same. However, as the overhead will be increased when using a large embedding size, to balance the overhead and performance, we choose to select the embedding size as 256 instead of 1024.

Depth: Figure 9 shows the impact of different depth. Here, we test the depth at 3, 4, 5, 6, 7 and 8. From the figure, we can observe that when the depth is 8, the accuracy reaches the largest. However, the accuracy does not have a strictly positive correlation with the number of depth. For instance, the accuracy is the lowest when the depth is 6. Since training a neural network model with more layers may take more time and computing resource, we choose to set the depth of our model as 5, which also has a high accuracy.

Iterations: The number of iterations means that the embedding vectors of one vertex contains the information of its T-hop neighborhood vertices. Figure 9 shows the impact of different iterations. We test the model with . From the figure, we can observe that when , the accuracy is the highest. We conjecture the reason may be that the size of our test samples is not very big. For a basic block, the information of its 3-hop neighborhoods is enough for the model to learn the features of the graph. Therefore, we choose to set in our model. It is worth noting that the number of iterations can be adjusted to make it appropriate for specific training or testing datasets.

(a) Accuracy of different embedding size .
(b) Accuracy of different depth .
(c) Accuracy of different iteration .
Fig. 9: Impact of different hyperparameters.

7.1.5 Efficiency

We evaluate the efficiency of the vulnerability prediction from two aspects: (1) ACFG extraction time and (2) training time with different model structures.

ACFG Extraction Time. We evaluate the ACFG extraction time from two aspects: the number of functions in a binary and the size of the binary. We collect a lot of binaries samples and test the ACFG extraction time of them. Figure  10 shows the ACFG extraction time with different number of functions and file sizes. Moreover, we test the debugging binaries and the released binaries respectively. From the figure, we have the following observations. (1) The extraction time of debugging binaries is much less than the released binaries. Since the debugging binaries have more information such as symbolic tables, which can help improve the speed of disassemble analysis, while the released binaries don not have so much information and may cost more time to do the disassemble analysis. (2) The extraction time has a positive linear correlation with the number of functions. (3) The extraction time has a positive linear correlation with the value of file size. (4) The ACFG extraction time is pretty short. For most of the debugging binaries, the extraction time is within 2.5 seconds. For most of the released binaries, the extraction time is within 100 seconds. Thus, we can extract the ACFG of a binary program efficiently.

(a) The number of functions.
(b) File Size.
Fig. 10: ACFG extraction time.

Training Time. Figure 11 shows the training time with different parameters. For this model, the parameters which affect the training time much are depth and embedding size. From this figure, we can observe that training time has a positive correlation with both the depth and the embedding size. Table LABEL:table_train_time presents the statistical results of the training time for 50 training epochs with different depth and embedding size. From the table, we can see that the average time for training the model in 1 epoch is about 20 minutes. As the model can converge within 10 epochs, we only need about 200 minutes to train a valid vulnerability prediction model, which is significantly more efficient. In addition, as the vulnerability prediction model can be trained offline, it does not affect the time overhead of fuzzing.

Depth Embedding Size Training Time(min)
50 epochs average for 1 epoch
3 128 907.70 18.15
256 909.14 18.18
512 934.99 18.69
4 128 984.99 19.69
256 988.98 19.77
512 1,009.09 20.18
5 128 1,010.24 20.20
256 1,013.58 20.27
512 1,049.14 20.98
TABLE V: The training time with different parameters.
Fig. 11: Training time with different model parameters: depth and embedding size .

7.2 The Evaluation of Fuzzing

In this section, we evaluate the performance of V-Fuzz in fuzzing test. Towards this, we conduct a number of fuzzing test on 13 different applications as shown in Table LABEL:uc. These applications are popular real-world open source Linux applications and fuzzing benchmarks. The real-world Linux applications includes the audio processing softwares (e.g., MP3GAIN), pdf transformation tools (e.g., pdftotext) and XPS documents library (e.g, libgxps). The fuzzing benchmarks are three programs (uniq, base64 and who) of LAVA-M [45]. The reasons for selecting these applications are as follows: (1) The selected real-world Linux applications are very popular and have been used widely. (2) As these applications are open source. the security of these applications may have significant impacts on other applications that are developed based on them. (3) LAVA-M [45] is a famous benchmark for evaluating the performance of vulnerability detection tools. Many the advanced fuzzing studies test their tools [57] [63] [64] with the programs of LAVA-M (a set of programs in LAVA-M). Even more, we compare V-Fuzz with several state-of-the-art fuzzers: VUzzer [57] AFL [9] and AFLFast [48].

It should be emphasized that all of the fuzzing experiments are conducted based on the following principles: (1) All the running environments are the same. Every single fuzzing test is conducted on a virtual machine equipped with 32-bit single core 4.2 GHz Intel CPU and 4 GB RAM, on Ubuntu 14.04 LTS system. (2) All the initial inputs for fuzzing are same. (3) The running time of all the fuzzing evaluation is the same. Therefore, all the fuzzing experiments are fair and convincing.

There are two main aspects to evaluate a fuzzer’s capability of finding bugs: unique crashes, and identified vulnerabilities. We will demonstrate the performance of V-Fuzz from the two aspects firstly. Moreover, we will evaluate the code coverage.

7.2.1 Unique Crashes

The capability of finding unique crashes is an important factor to evaluate a fuzzer’s performance. Although a unique crash is not necessarily a vulnerability, in most cases, if a fuzzer can find more crashes, it can find more vulnerabilities. More specifically, a fuzzer with a better capability of finding unique crashes usually has the following characteristics: (1) In a limited time, it can find more unique crashes. (2) For finding a fixed number of unique crashes, it can find them fast.

Thus, we will demonstrate V-Fuzz’s performance in finding unique crashes by answering the following two questions.

Whether V-Fuzz can find more unique crashes in a limited time? We fuzz the 13 programs for 24 hours and compare V-Fuzz with 3 state-of-the-arts fuzzers: VUzzer, AFL and AFLFast. In detail, Table LABEL:uc presents the information of the fuzzed programs and the number of unique crashes found in 24 hours. From Table LABEL:uc, we have the following observations. (1) For all the 13 programs, V-Fuzz find the most unique crashes (the average number of unique crashes for one program is 1,114) and is much better than the other three fuzzers. (2) Compared with VUzzer, the average number of unique crashes found by V-Fuzz is improved by 35.8%. In addition, for the program cflow, VUzzer did not find any crash while V-Fuzz found one crash. (3) Compared with AFL, there are five programs (uniq, base64, who, pdf2svg and cflow), on which AFL has did not find any crash while V-Fuzz had a good performance. (4) Compared with AFLFast, there are also five programs (uniq, base64, who, pdffonts and cflow), on which AFLFast did not find any crash while V-Fuzz did.

Application Version Fuzzer
V-Fuzz VUzzer AFL AFLFast
uniq LAVA-M 659 321 0 0
base64 LAVA-M 128 100 0 0
who LAVA-M 117 92 0 0
pdftotext xpdf-2.00 209 59 12 108
pdffonts xpdf-2.00 581 367 13 0
pdftopbm xpdf-2.00 50 25 37 35
pdf2svg + libpoppler pdf2svg-0.2.3 3 2 0 1
libpoppler-0.24.5
MP3Gain 1.5.2 217 34 103 110
mpg321 0.3.2 321 184 40 17
xpstopng libgxps-0.2.5 3,222 2,195 2 2
xpstops libgxps-0.2.5 4,157 3,044 3 3
xpstojpeg libgxps-0.2.5 4,828 4,243 4 4
cflow 1.5 1 0 0 0
Total 14,493 10,666 214 280
Average 1,114 820 16 21
TABLE VI: The number of unique crashes found for 24 hours.

Whether V-Fuzz can find unique crashes quickly? Figure 12 shows the growth curves of the number of discovered unique crashes within 24 hours. From the figure, we can observe that V-Fuzz finds unique crashes more quickly than other state-of-the-art fuzzers. From the above results, we can see that V-Fuzz has good performance in discovering unique crashes and outperforms state-of-the-art fuzzers.

Fig. 12: The growth curves of the number of unique crashes.

7.2.2 Vulnerability Discovery

In this part, we show V-Fuzz’s capability of discovering vulnerabilities. During the fuzzing test of V-Fuzz, we collect the inputs which cause crashes. For the three programs of LAVA-M, we run the programs again with the crash inputs, and verify the bugs they found. Table  LABEL:LAVA-M-bug shows the number of bugs found by V-Fuzz and VUzzer. Each injected bug in LAVA-M has a unique ID, and the corresponding ID is printed when the bug is triggered. There are two kinds of bugs in LAVA-M: listed and unlisted. The listed bugs are those that the LAVA-M authors were able to trigger when creating the LAVA-M programs, and the unlisted bugs are those that the LAVA-M authors were not able to trigger. From Table  LABEL:LAVA-M-bug, we can observe that V-Fuzz can trigger more bugs than VUzzer. In addition, V-Fuzz is able to trigger several unlisted bugs and exhibits a better performance than VUzzer in this case too. Table LABEL:appendix-LAVA-M-id shows the IDs of bugs triggered by V-Fuzz on three programs (uniq, base64 and who) of LAVA-M datasets.

Application IDs of Bugs
uniq 112, 130, 166, 169, 170, 171, 215, 222, 227, 293, 296
297, 321, 322, 346, 347, 368, 371, 372, 393, 396
397, 443, 446, 447, 468, 471, 472
base64 1, 222, 235, 253, 255, 274, 276, 278, 284, 386, 521, 526
556, 558, 562, 576, 583, 584, 784, 790, 804, 805, 806, 813
832, 841, 843
who 1, 2, 3, 5, 6, 9, 14, 20, 22, 24, 56, 58, 60, 62, 75
77, 79, 81, 83, 87, 89, 109, 116, 124, 127, 129, 131
133, 137, 139, 143, 149, 151, 152, 153, 155, 157, 161
177, 179, 197, 1151, 1171, 1250, 1272, 1276, 1280, 1291
1721, 1783, 1880, 1888, 1908, 2486, 2947, 2957, 2979
3201, 3240, 3780, 3923, 3979
TABLE VII: The IDs of bugs triggered by V-Fuzz on LAVA-M.

For the real-world Linux applications, in order to verify the vulnerabilities found by V-Fuzz, we recompile the fuzzed programs with AddressSanitizer [18], which is a memory error detector for C/C++ programs. Then, we execute these programs with the collected crash inputs. AddressSanitizer can give the detailed information of the vulnerabilities. Based on the information from AddressSanitizer, we search the related information on the official CVE website [19] and validate the vulnerabilities we found.

Table  LABEL:cve_found shows the detailed CVE information that we found. We have found 10 CVEs in total, which 3 of them (CVE-2018-10767, CVE-2018-10733 and CVE-2018-10768) are newly found by us. Moreover, most of the fuzzed applications are shown to have CVEs. The crash inputs which are found when fuzzing the programs of xpdf-2.0 can also trigger the vulnerability of xpdf-3.01. Finally, most of the CVEs are buffer related errors, which is reasonable as fuzzing test is good at finding this type of vulnerabilities.

Application V-Fuzz VUzzer
Listed Unlisted Total Listed Unlisted Total
uniq 27 1 28 26 1 27
base64 24 3 27 23 2 25
who 57 9 62 52 7 59
TABLE VIII: The number of bugs found on LAVA-M.
Application Version CVE Vulnerability Typye
pdftotext xpdf3.01 CVE-2007-0104 Buffer errors
pdffonts
pdftopbm
mpg321 0.3.2 CVE-2017-11552 Buffer errors
MP3Gain 1.5.2 CVE-2017-14406 NULL pointer dereference
CVE-2017-14407 Stack-based buffer over-read
CVE-2017-14409 Buffer overflow
CVE-2017-14410 Buffer over-read
CVE-2017-12912 Buffer errors
libgxps 0.3.0 CVE-2018-10767 (new) Buffer errors
CVE-2018-10733 (new) Stack-based buffer over-read
libpoppler 0.24.5 CVE-2018-10768 (new) NULL pointer dereference
TABLE IX: The CVEs found by V-Fuzz.

7.2.3 Code Coverage

V-Fuzz is neither a coverage-based fuzzer nor a directed fuzzer. The goal of V-Fuzz is to find more bugs in a shorter period of time. In this part, we show that V-Fuzz can achieve its goal without decreasing its code coverage. The methods of calculating the code coverage for different fuzzers are different. AFL and AFLFast utilize the static instrumentation with a compact bitmap to track edge coverage, VUzzer leverages the dynamic binary instrumentation tool PIN [24] to track basic block coverage. Additionally, as [63] indicated, most of the fuzzers cannot calculate code coverage accurately. Therefore, for evaluating the code coverage, we only compare V-Fuzz with VUzzer as they leverage the same approach to calculate the basic block coverage.

Figure 13 shows the growth curves of the code coverage within 24 hours for V-Fuzz and VUzzer. From this figure, we have the following observations: (1) For most of the programs, the code coverage of V-Fuzz and VUzzer is almost the same. (2) There is only one program (mpg321), on which the code coverage covered of VUzzer(19%) is slightly larger than V-Fuzz(17%). (3) There are several programs (e.g., MP3Gain), on which V-Fuzz has a higher code coverage, especially for the program xpstops.

In conclusion, although the code coverage is not the main objective that V-Fuzz focuses on, the experiments show that V-Fuzz can achieve its main goal of finding crashes without decreasing much code coverage. Moreover, it can help reduce the false negative of the vulnerability prediction model.

Fig. 13: Code coverage rate.

8 Discussion

8.1 Vulnerability-Oriented Fuzzing

The blindness of fuzzing can be decreased by providing it with more information or combine fuzzing with other techniques. For example, Driller [43] is a popular fuzzer which combines fuzzing with symbolic execution techniques. When the fuzzer got ”stuck”, the symbolic execution can compute a valid input for it. However, as symbolic execution does not work well for real programs, it is hard to apply this idea to real world.

The idea of vulnerability-oriented fuzzing is to improve a fuzzer’s efficiency of finding vulnerabilities by assisting it with static vulnerability analysis techniques. The static vulnerability analysis can provide a fuzzer with more information, which can reduce the blindness of it. It is worth noting that static vulnerability analysis can be conducted by traditional static vulnerability detection tools or AI techniques (e.g., deep learning or machine learning) based models. Therefore, the vulnerability-oriented fuzzing can be implemented by many static analysis techniques. In addition, the static analysis just gives an auxiliary information to a fuzzer, which does not affect its extensibility.

Most of static analysis tools or models suffer from high false positive or false negative. This problem has been plaguing the security researchers for many years, and there has been no good solution to solve it yet. However, static analysis can be combined with fuzzing to reduce its weakness. Moreover, unlike combining fuzzing with symbolic execution, it has better extensibility and effectiveness.

8.2 Limitations and Future Works

Although V-Fuzz has improved the capability of fuzzing test in detecting vulnerabilities, it still has several limitations. First, we only use the dataset from NIST [13] to train and test the vulnerability prediction model. The number of the labeled samples may not be sufficient, which might further affect the performance of the model. In the future, we will pay more attention to collect and label more data to improve the model. Second, our model is mainly designed for binary programs. Although the model for predicting vulnerabilities for binary programs might be difficult than for source codes, it is also necessary to design a model which can be applied to source codes directly. In addition, as the instrumentation methods of most state-of-the-art fuzzers are based on compilers (e.g., gcc or clang), they need the source code of the fuzzed programs. In order to combine our model with these fuzzers conveniently, we will study source-code based prediction models, and combine these models with more state-of-the-art fuzzers in the future.

9 Related Work

We introduce the related work from two perspectives: vulnerability prediction and fuzzing test.

9.1 Vulnerability Prediction

Here we introduce the related work about vulnerability prediction. It needs to be emphasized that most of the previous work on vulnerability prediction is dealing with source code, which is different from us as we deal with binaries.

9.1.1 Machine Learning based Vulnerability Prediction

Machine learning based vulnerability prediction models are designed based on some software features which are manually selected or constructed by the experts. Shin et al. [70] leveraged complexity, code churn, and developer activity metrics to discriminate between vulnerable and neutral files. Hovsepyan et al. [71] presented a novel approach for vulnerability prediction that leverages on the analysis of raw source code as text. Gegick et al. [72] designed an attack-prone prediction model with some code-level metrics: static analysis tool alert density, code churn and count of the lines of code. Neuhaus et al. [73] proposed an approach to mine existing vulnerability databases and version archives automatically to map past vulnerabilities to components. Shar et al. [74] leveraged some static and dynamic attributes to predict the SQL injection and cross site scripting vulnerabilities. Walden et al. [75] examined vulnerability prediction models for web applications, and compared the performance of prediction models based on softwares metrics with that of models based on text mining.

9.1.2 Deep Learning based Vulnerability Prediction

Machine learning based approaches focus on manually designing features to represent a program. However, these approaches still need too much energy. Therefore, some deep learning based vulnerability prediction models are proposed. One of the biggest advantages of these models is the deep learning based models can learn features automatically according to different types of programs. Dam et al. [76]

described a new approach built upon the Long Short Term Memory (LSTM) model to automatically learn both semantic and syntactic features in code. Wang et al.

[77]

leveraged Deep Belief Network (DBN) to automatically learn semantic representation of programs from source code. Yang et al.

[38] leveraged deep learning techniques to learn features for defect prediction. Dam et al. [78] presented an end-to-end generic framework based on LSTM for modeling software and its development process to predict future risks. White et al. [39]

demonstrated that the deep software language models have better performance than n-grams based models. Gu et al.

[79] proposed a deep learning based approach to generate API usage sequences for a given natural language query. Huo et al. [80]

proposed a convolutional neural network based model to learn unified features from natural language and source code in programs for locating the potential buggy source code.

9.2 Fuzzing Test

9.2.1 Symbolic Execution based Fuzzing.

Most of whitebox fuzzers leverage symbolic execution or concolic execution (combines symbolic execution and concrete execution) to generate inputs that can execute new paths. SAGE [26] is a whitebox fuzzer, which uses symbolic execution to gather path constraints of conditional statements. CUTE [25] is a unit testing engine for C programs by combining symbolic and concrete execution. ZESTI [31] is a software testing tool that takes a lightweight symbolic execution mechanism to execute regression test. Dowser [33] is a guided fuzzer that combines taint tracking, program analysis and symbolic execution to detect buffer overflow vulnerabilities. Driller [43] combines AFL with concolic execution to generate inputs that can trigger deeper bugs. SmartFuzz [28] focuses on discovering integer bugs for x86 binaries using symbolic execution.

9.2.2 Taint Analysis based Fuzzing.

Another common technique that is leveraged by fuzzing is taint analysis, especially dynamic taint analysis. BuzzFuzz [27] is an automated whitebox fuzzer. It employs dynamic taint tracing to locate the regions of the original inputs that influence values used at key attack points. TaintScope [29] is a fuzzing system that uses dynamic taint analysis and symbolic execution to fuzz x86 binary programs.

9.2.3 Heuristic Algorithm based Fuzzing.

Most mutation-based fuzzers employ heuristic algorithms such as evolutionary algorithms to guide them select and generate high-quality inputs. AFL

[9] is the most popular mutation-based fuzzer which leverages a simple evolutionary algorithm to generate inputs. AFLFast [48]

is a coverage-based greybox fuzzer. It leverages a Markov chain model to generate inputs that tends to arrive at the “low-frequency” paths. VUzzer

[57] is an application-aware fuzzer. It leverages an evolutionary algorithm to generate inputs that can discover bugs in deep paths. AFLGo [53] is a directed greybox fuzzer, which generates inputs with the aim of reaching some specific locations by a simulated annealing-based power schedule approach. SlowFuzz [55] is a domain-independent framework for automatically finding algorithmic complexity vulnerabilities.

9.2.4 Machine Learning based Fuzzing.

With the development of artificial intelligence techniques, there are some researches that begin to explore how to apply these techniques into fuzzing test. Godefroid et al.

[59]

propose a novel fuzzing method, which uses a learnt input probability distribution to intelligently guide where to fuzz inputs. Rajpal et al.

[60] present a learning approach that uses neural networks to learn patterns in the inputs files to guide future fuzzing. Nichols et al. [61] propose a method that uses Generative Adversarial Network (GAN) models to reinitialize the seed files to improve the performance of fuzzing. This paper [66]

formalizes fuzzing as a reinforcement learning problem by using the concept of Markov decision process.

9.2.5 Other Fuzzing Researches.

kAFL [50] is a hardware-assisted feedback fuzzer that focuses on fuzzing x86-64 kernels. IMF [52] leverages inferred dependence model to fuzz commodity OS kernels. Skyfire [51] leverages the knowledge of existing samples to generate well-distributed inputs for fuzzing. CollAFL [63] is a coverage sensitive fuzzing solution that mitigates path collisions by improving the accurate coverage information. T-Fuzz [62] leverages a lightweight dynamic tracing-based approach to infer all checks that could not be passed and generates mutated programs where the checks are negated. Angora [64] is an mutation-based fuzzer that increases branch coverage by solving path constraints without symbolic execution.

10 Conclusion

In this paper, we design and implement V-Fuzz, a vulnerability-oriented evolutionary fuzzing framework. By combining the vulnerability prediction with evolutionary fuzzing, V-Fuzz can generate inputs that tend to arrive at the potential vulnerable regions. We evaluate V-Fuzz on popular benchmark programs (e.g., uniq) of LAVA-M [45], and a variety of real-world Linux applications including the audio processing softwares (e.g., MP3Gain), pdf transformation tools (e.g., pdftotext) and xps documents library (e.g., libgxps). Compared with the state-of-the-art fuzzers, the experimental results demonstrate that V-Fuzz can find more vulnerabilities quickly. In addition, V-Fuzz has discovered 10 CVEs, and 3 of them are newly discovered. We reported the new CVEs, and they have been confirmed and fixed. In the future, we will study to leverage more advanced program analysis techniques to assist fuzzer in discovering vulnerabilities.

References

  • [1] M. Sutton, A. Greene, and P. Amini, Fuzzing: Brute Force Vulnerability Discovery.   Addison-Wesley Professional, 2007.
  • [2] J. Röning, M. Laakso, A. Takanen, and R. Kaksonen, ”Protos-systematic approach to eliminate software vulnerabilities,” Invited presentation at Microsoft Research, 2002.
  • [3] D. A, ”An introduction to spike, the fuzzer creation kit,” 2002.
  • [4] D. CGC, ”Darpa cyber grand challenge,” https://github.com/CyberGrandChallenge/, 2017.
  • [5] Wikipedia, ”Fuzzing,” https://en.wikipedia.org/wiki/Fuzzing/, 2018.
  • [6] M. E, ”Peach fuzzer,” https://www.peach.tech, 2018.
  • [7] Google, ”Oss-fuzz - continuous fuzzing for open source software,” https://github.com/google/oss-fuzz, 2018.
  • [8] Microsoft, ”Microsoft security development lifecycle,” https://www.microsoft.com/en-us/sdl/process/verification.aspx, 2018.
  • [9] M. Zalewski, ”american fuzzy lop,” http://lcamtuf.coredump.cx/afl/, 2017.
  • [10] Google, ”honggfuzz,” https://google.github.io/honggfuzz/, 2017.
  • [11] C. Labs, http://caca.zoy.org/wiki/zzuf/, 2017.
  • [12] Checkmarx, ”Checkmarx,” https://www.checkmarx.com/, 2017.
  • [13] NVD, http://nvd.nist.gov/, 2017.
  • [14] Hex-Rays, ”The ida pro disassembler and debugger,” https://www.hex-rays.com/products/ida/, 2015.
  • [15]

    PyTorch, ”Tensors and dynamic neural networks in python with strong gpu acceleration,”

    http://pytorch.org/, 2017.
  • [16] N. C. for Assured Software, ”Juliet test suite for c/c++,” https://samate.nist.gov/SRD/testsuite.php, 2017.
  • [17] ”Sanitizercoverage: Clang documentation,” https://clang.llvm.org/docs/SanitizerCoverage.html, 2018.
  • [18] chefmax, ”Addresssanitizer,” https://github.com/google/sanitizers/wiki/AddressSanitizer, 2017.
  • [19] NVD, ”Cve: Common vulnerabilities and exposures,” https://cve.mitre.org/, 2018.
  • [20] Intel, ”Intel 64 and ia-32 architectures software developer manuals,” https://software.intel.com/en-us/articles/intel-sdm, 2018.
  • [21] P. Godefroid, N. Klarlund, and K. Sen, ”Dart:directed automated random testing,” ACM SIGPLAN Notices, vol. 40, no. 6, pp. 213–223, 2005.
  • [22] A. Takanen, J. Demott, and C. Miller, Fuzzing for Software Security Testing and Quality Assurance.   Artech House, 2008.
  • [23] J. Viega, J. T. Bloch, Y. Kohno, and G. Mcgraw, ”Its4: A static vulnerability scanner for c and c++ code,” in Computer Security Applications, 16th Annual Conference, 2000, pp. 257–267.
  • [24] C. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. Reddi, and K. Hazelwood, ”Pin: building customized program analysis tools with dynamic instrumentation,” in ACM Sigplan Conference on Programming Language Design and Implementation, 2005, pp. 190–200.
  • [25] K. Sen, D. Marinov, and G. Agha, ”Cute: a concolic unit testing engine for c,” in ACM SIGSOFT Software Engineering Notes, vol. 30, no. 5.   ACM, 2005, pp. 263–272.
  • [26] P. Godefroid, M. Levin, and D. Molnar, ”Automated whitebox fuzz testing,” in Network and Distributed System Security Symposium, 2008.
  • [27] V. Ganesh, T. Leek, and M. Rinard, ”Taint-based directed whitebox fuzzing,” in Proceedings of the 31st International Conference on Software Engineering.   IEEE Computer Society, 2009, pp. 474–484.
  • [28] D. Molnar, X. Li, and D. Wagner, ”Dynamic test generation to find integer bugs in x86 binary linux programs,” in Usenix Conference on Security Symposium, 2009, pp. 67–82.
  • [29] T. Wang, T. Wei, G. Gu, and W. Zou, ”Taintscope: A checksum-aware directed fuzzing tool for automatic software vulnerability detection,” in Proceedings of the 31st IEEE Symposium on Security and Privacy.   IEEE, 2010, pp. 497–512.
  • [30] R. Mcnally, K. Yiu, D. Grove, and D. Gerhardy, ”Fuzzing: The state of the art,” Fuzzing the State of the Art, 2012.
  • [31] P. Marinescu and C. Cadar, ”make test-zesti: A symbolic execution solution for improving regression testing,” in Proceedings of the 34th International Conference on Software Engineering.   IEEE, 2012, pp. 716–726.
  • [32] M. Woo, K. Sang, S. Gottlieb, and D. Brumley, ”Scheduling black-box mutational fuzzing,” in Proceedings of the 2013 ACM Sigsac Conference on Computer and Communications Security, 2013, pp. 511–522.
  • [33] I. Haller, A. Slowinska, M. Neugschwandtner, and H. Bos, ”Dowsing for overflows: a guided fuzzer to find buffer boundary violations,” in Usenix Conference on Security Symposium, 2013, pp. 49–64.
  • [34] A. Rebert, K. Sang, G. Grieco, and D. Brumley, ”Optimizing seed selection for fuzzing,” in Usenix Conference on Security Symposium, 2014, pp. 861–875.
  • [35] M. Neugschwandtner, P. Milani Comparetti, I. Haller, and H. Bos, ”The borg: Nanoprobing binaries for buffer overreads,” pp. 87–97, 2015.
  • [36] K. Sang, M. Woo, and D. Brum ley, ”Program-adaptive mutational fuzzing,” in Proceedings of the 36th IEEE Symposium on Security and Privacy, 2015, pp. 725–741.
  • [37] X. Wang, L. Zhang, and P. Tanofsky, ”Experience report: how is dynamic symbolic execution different from manual testing? a study on klee,” in International Symposium on Software Testing and Analysis, 2015, pp. 199–210.
  • [38] X. Yang, D. Lo, X. Xia, Y. Zhang, and J. Sun, ”Deep learning for just-in-time defect prediction,” in IEEE International Conference on Software Quality, Reliability and Security, 2015, pp. 17–26.
  • [39] M. White, C. Vendome, Linares-V, M. Squez, and D. Poshyvanyk, ”Toward deep learning software repositories,” Mining Software Repositories, 2015, pp. 334–345.
  • [40] E. Shin, D. Song, and R. Moazzezi, ”Recognizing functions in binaries with neural networks,” in Usenix Conference on Security Symposium, 2015, pp. 611–626.
  • [41] J. Pewny, B. Garmany, R. Gawlik, C. Rossow, and T. Holz, ”Cross-architecture bug search in binary executables,” in Proceedings of the 36th IEEE Symposium on Security and Privacy.   IEEE, 2015, pp. 709–724.
  • [42] Q. Feng, R. Zhou, C. Xu, Y. Cheng, B. Testa, and H. Yin, ”Scalable graph-based bug search for firmware images,” in ACM SIGSAC Conference on Computer and Communications Security, 2016, pp. 480–491.
  • [43] N. Stephens, J. Grosen, C. Salls, A. Dutcher, R. Wang, J. Corbetta, S. Yan, C. Kruegel, and G. Vigna, ”Driller: Augmenting fuzzing through selective symbolic execution,” in Network and Distributed System Security Symposium, 2016.
  • [44] H. Dai, B. Dai, and L. Song, ”Discriminative embeddings of latent variable models for structured data,” in International Conference on International Conference on Machine Learning, 2016, pp. 2702–2711.
  • [45] B. Dolangavitt, P. Hulin, E. Kirda, T. Leek, A. Mambretti, W. Robertson, F. Ulrich, and R. Whelan, ”Lava: Large-scale automated vulnerability addition,” in Proceeding of the 37th IEEE Symposium on Security and Privacy, 2016, pp. 110–121.
  • [46] S. Eschweiler, K. Yakdan, and E. Gerhards-Padilla, ”discovre: Efficient cross-architecture identification of bugs in binary code,” in Network and Distributed System Security Symposium, 2016.
  • [47] S. Wang, T. Liu, and L. Tan, ”Automatically learning semantic features for defect prediction,” in IEEE/ACM International Conference on Software Engineering.   ACM, 2016, pp. 297–308.
  • [48] V. Pham and A. Roychoudhury, ”Coverage-based greybox fuzzing as markov chain,” in Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, 2016, pp. 1032–1043.
  • [49] K. Serebryany, ”Continuous fuzzing with libfuzzer and addresssanitizer,” in Cybersecurity Development, 2017, pp. 157–157.
  • [50] Schumilo, Aschermann, and Gawlik, ”kafl: Hardware-assisted feedback fuzzing for os kernels,” in Usenix Conference on Security Symposium, 2017, pp. 167–182.
  • [51] J. Wang, B. Chen, L. Wei, and Y. Liu, ”Skyfire: Data-driven seed generation for fuzzing,” in Proceedings of the 38th IEEE Symposium on Security and Privacy.   IEEE, 2017, pp. 579–594.
  • [52] H. Han and S. Cha, ”Imf: Inferred model-based fuzzer,” in Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security.   ACM, 2017, pp. 2345–2358.
  • [53] M. Böhme, V. Pham, M. Nguyen, and A. Roychoudhury, ”Directed greybox fuzzing,” in Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, 2017, pp. 2329–2344.
  • [54] H. Han and K. Sang, ”Imf: Inferred model-based fuzzer,” in Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, 2017, pp. 2345–2358.
  • [55] T. Petsios, J. Zhao, A. Keromytis, and S. Jana, ”Slowfuzz: Automated domain-independent detection of algorithmic complexity vulnerabilities,” in Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, 2017.
  • [56] X. Xu, C. Liu, Q. Feng, H. Yin, L. Song, and D. Song, ”Neural network-based graph embedding for cross-platform binary code similarity detection,” in Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security.   Dallas: ACM, 2017, pp. 363–376.
  • [57] S. Rawat, V. Jain, A. Kumar, L. Cojocar, C. Giuffrida, and H. Bos, ”Vuzzer: Application-aware evolutionary fuzzing,” in Network and Distributed System Security Symposium, 2017.
  • [58] S. Wang, T. Liu, and L. Tan, ”Automatically learning semantic features for defect prediction,” in IEEE/ACM International Conference on Software Engineering, 2017, pp. 297–308.
  • [59] P. Godefroid, H. Peleg, and R. Singh, ”Learn&fuzz: Machine learning for input fuzzing,” in Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering.   IEEE, 2017, pp. 50–59.
  • [60] M. Rajpal, W. Blum, and R. Singh, ”Not all bytes are equal: Neural byte sieve for fuzzing,” arXiv preprint arXiv:1711.04596, 2017.
  • [61] N. Nichols, M. Raugas, R. Jasper, and N. Hilliard, ”Faster fuzzing: Reinitialization with deep neural models,” arXiv preprint arXiv:1711.02807, 2017.
  • [62] P. Hui, S. Yan, and P. Mathias, ”T-fuzz: fuzzing by program transformation,” in Proceedings of the 39th IEEE Symposium on Security and Privacy, 2018.
  • [63] S. Gan, C. Zhang, X. Qin, X. Tu, K. Li, Z. Pei, and Z. Chen, ”Collafl: Path sensitive fuzzing,” in Proceedings of the 39th IEEE Symposium on Security and Privacy, 2018.
  • [64] P. Chen and H. Chen, ”Angora: Efficient fuzzing by principled search,” in Proceedings of the 39th IEEE Symposium on Security and Privacy, 2018.
  • [65] Z. Li, D. Zou, S. Xu, X. Ou, H. Jin, S. Wang, Z. Deng, and Y. Zhong, ”Vuldeepecker: A deep learning-based system for vulnerability detection,” in Network and Distributed System Security Symposium, 2018.
  • [66] K. Böttinger, P. Godefroid, and R. Singh, ”Deep reinforcement fuzzing,” arXiv preprint arXiv:1801.04589, 2018.
  • [67] H. Cai, V. Zheng, and K. Chang, ”A comprehensive survey of graph embedding: problems, techniques and applications,” IEEE Transactions on Knowledge and Data Engineering, 2018.
  • [68] Y. Shin, and L. Williams, “Can traditional fault prediction models be used for vulnerability prediction?”, Empirical Software Engineering, 2013, 18(1): 25-59.
  • [69] W. Han, B. Joe, B. Lee, C. Song, and I. Shin, “Enhancing Memory Error Detection for Large-Scale Applications and Fuzz Testing,” Network and Distributed System Security Symposium, 2018.
  • [70] Y. Shin, A. Meneely, L. Williams, and J. A. Osborne, “Evaluating complexity code chrun, and developer activity metrics as indicators of software vulnerabilities,” IEEE Transactions on Software Engineering, 2011, 37(6): 772-787.
  • [71] A. Hovsepyan, R. Scandariato, W. Joosen, and J. Walden, “Software vulnerability prediction using text analysis techniques,” Proceedings of the 4th international workshop on Security measurements and metrics, ACM, 2012: 7-10.
  • [72] M. Gegick, L. Williams, J. Osborne, and M. Vouk, “Prioritizing software security fortification through code-level metrics,” Proceedings of the 4th ACM workshop on Quality of protection, ACM, 2008: 31-38.
  • [73] S. Neuhaus, T. Zimmermann, C. Holler, and A. Zeller, “Predicting vulnerable software components,” Proceedings of the 14th ACM conference on Computer and communications security, ACM, 2007: 529-540.
  • [74] L. K. Shar, H. B. K. Tan, and L. C. Briand, “Mining SQL injection and cross site scripting vulnerabilities using hybrid program analysis,” Proceedings of the 2013 International Conference on Software Engineering, IEEE Press, 2013: 642-651.
  • [75] J. Walden, J. Stuckman, and R. Scandariato, “Predicting vulnerable components: Software metrics vs text mining,” Proceedings of the 25th International Symposium on Software Reliability Engineering (ISSRE), IEEE, 2014: 23-33.
  • [76] H. K. Dam, T. Tran, T. Pham et al. “Automatic feature learning for vulnerability prediction.” arXiv preprint arXiv:1708.02368., 2017
  • [77] S. Wang, T. Liu, and L. Tan, “Automatically learning semantic features for defect prediction,” Proceedings of the 38th International Conference on Software Engineering, ACM, 2016, pp. 297–308.
  • [78] H. K. Dam, T. Tran, J. Grundy, and A. Ghose, “DeepSoft: A vision for a deep model of software,” Proceedings of the 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, ACM, 2016.
  • [79] X. Gu, H. Zhang, D. Zhang, and S. Kim, “Deep api learning” Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, ACM, 2016, pp. 631–642.
  • [80] X. Huo, M. Li, and Z.-H. Zhou, “Learning unified features from natural and programming languages for locating buggy source code”, Proceedings of the 25th International Joint Conference on Artificial Intelligence, 2016, pp. 1606–1612.