SySeVR: A Framework for Using Deep Learning to Detect Software Vulnerabilities

07/18/2018 ∙ by Zhen Li, et al. ∙ Huazhong University of Science u0026 Technology The University of Texas at San Antonio 0

The detection of software vulnerabilities (or vulnerabilities for short) is an important problem that has yet to be tackled, as manifested by many vulnerabilities reported on a daily basis. This calls for machine learning methods to automate vulnerability detection. Deep learning is attractive for this purpose because it does not require human experts to manually define features. Despite the tremendous success of deep learning in other domains, its applicability to vulnerability detection is not systematically understood. In order to fill this void, we propose the first systematic framework for using deep learning to detect vulnerabilities. The framework, dubbed Syntax-based, Semantics-based, and Vector Representations (SySeVR), focuses on obtaining program representations that can accommodate syntax and semantic information pertinent to vulnerabilities. Our experiments with 4 software products demonstrate the usefulness of the framework: we detect 15 vulnerabilities that are not reported in the National Vulnerability Database. Among these 15 vulnerabilities, 7 are unknown and have been reported to the vendors, and the other 8 have been "silently" patched by the vendors when releasing newer versions of the products.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Software vulnerabilities (or vulnerabilities for short) are a fundamental reason for the prevalence of cyber attacks. Despite academic and industrial efforts at improving software quality, vulnerabilities remain a big problem. This can be justified by the fact that each year, many vulnerabilities are reported in the Common Vulnerabilities and Exposures (CVE) [1].

Given that vulnerabilities are inevitable, it is important to detect them as early as possible. Source code-based static analysis is an important approach to detecting vulnerabilities, including code similarity-based methods [2, 3] and pattern-based methods [4, 5, 6, 7, 8, 9, 10]. Code similarity-based methods can detect vulnerabilities that are incurred by code cloning, but have high false-negatives when vulnerabilities are not caused by code cloning [11]. Pattern-based methods may require human experts to define vulnerability features for representing vulnerabilities, which makes them error-prone and laborious. Therefore, an ideal method should be able to detect vulnerabilities caused by a wide range of reasons while imposing as little reliance on human experts as possible.

Deep learning — including Recurrent Neural Networks (RNN)

[12, 13, 14, 15, 16]

, Convolutional Neural Networks (CNN)

[17, 18]

, and Deep Belief Networks (DBN)

[19, 20]

— has been successful in image and natural language processing. While it is tempting to use deep learning to detect vulnerabilities, we observe that there is a “domain gap”: deep learning is born to cope with data with natural

vector representations (e.g., pixels of images); in contrast, software programs do not have such vector representations. Recently, we proposed the first deep learning-based vulnerability detection system VulDeePecker [11] with the capability of pinning down the locations of vulnerabilities. While demonstrating the feasibility of using deep learning to detect vulnerabilities, VulDeePecker has a range of weaknesses: (i) It considers only the vulnerabilities that are related to library/API function calls. (ii) It leverages only the semantic information induced by data dependency

. (iii) It considers only the particular RNN known as Bidirectional Long Short-Term Memory (BLSTM). (iv) It makes no effort to explain the cause of false-positives and false-negatives.

Our contributions. In this paper, we propose the first systematic framework for using deep learning to detect vulnerabilities. The framework is centered at answering the following question: How can we represent programs as vectors that accommodate syntax and semantic information that is suitable for vulnerability detection? In order to answer this question, we introduce and define the notions of Syntax-based Vulnerability Candidates (SyVCs) and Semantics-based Vulnerability Candidates (SeVCs), and design algorithms for computing them. Intuitively, SyVCs reflect vulnerability syntax characteristics, and SeVCs extend SyVCs to accommodate the semantic information induced by data dependency and control dependency. Correspondingly, the framework is called Syntax-based, Semantics-based, and Vector Representations, or SySeVR for short. As a piggyback, this study overcomes the aforementioned weaknesses (i)-(iv) of [11].

In order to evaluate SySeVR, we produce a dataset of 126 types of vulnerabilities caused by various reasons from the National Vulnerability Database (NVD) [21] and the Software Assurance Reference Dataset (SARD) [22], while noting that the dataset published by [11] is not sufficient for our purpose because it contains only 2 types of vulnerabilities. Our dataset is of independent value, and has been made publicly available at https://github.com/SySeVR/SySeVR.

Equipped with the new dataset, we show that SySeVR can make deep learning detect vulnerabilities. Some findings are: (1)

SySeVR can make multiple kinds of deep neural networks detect various kinds of vulnerabilities. SySeVR makes Bidirectional RNNs, especially the Bidirectional Gated Recurrent Unit (BGRU), more effective than CNNs, and makes CNNs more effective than DBNs. Moreover, SySeVR renders deep neural networks (especially BGRU) much more effective than the state-of-the-art vulnerability detection methods.

(2) In terms of explaining the cause of false-positives and false-negatives, we find that the effectiveness of BGRU is substantially affected by the training data. If some syntax elements (e.g., tokens) often appear in vulnerable (vs. not vulnerable) pieces of code, then these syntax elements may cause high false-positives (correspondingly, false-negatives). (3) It is better to use deep neural networks that are tailored to specific kind of vulnerabilities than to use a single deep neural network to detect various kinds of vulnerabilities. (4) The more semantic information accommodated for learning deep neural networks, the higher vulnerability detection capability the learned neural networks. For example, semantic information induced by control dependency can reduce the false-negative rate by 19.6% on average. (5) By applying SySeVR-enabled BGRU to 4 software products (Libav, Seamonkey, Thunderbird, and Xen), we detect 15 vulnerabilities that have not been reported in NVD [21]. Among these 15 vulnerabilities, 7 are unknown to exist in these software despite that similar vulnerabilities are known to exist in other software); for ethical reasons, we do not release their precise locations, but we have reported them to the respective vendors. The other 8 vulnerabilities have been “silently” patched by the vendors when releasing newer versions of the products.

Paper outline. Section II presents the SySeVR framework. Section III describes experiments and results. Section IV discusses limitations of the present study. Section V reviews related prior work. Section VI concludes the paper.

Ii The SySeVR Framework

Ii-a The Domain Gap

Fig. 1: (a) The notion of region proposal in image processing. (b) The SySeVR framework inspired by the notion of region proposal and centered at obtaining SyVC, SeVC, and vector representations of programs.

Deep learning is successful in the domain of image processing and other applications, which is however different from the domain of vulnerability detection. In order to clearly see the gap between these domains (i.e., the “domain gap”), let us consider the example of using deep learning to detect human in an image. As illustrated in Fig. 1(a), this can be achieved by using the notion of region proposal [23, 24] and leveraging the structural representation of images (e.g., texture, edge, and color). Multiple region proposals can be extracted from one image, and each region proposal can be treated as a “unit” for training neural networks to detect objects (i.e., human in this example). When using deep learning to detect vulnerabilities, there is no natural structural representation for programs like what region proposal is to images. This means that deep learning cannot be directly used for vulnerability detection.

Ii-B The Framework

Overview. In order to bridge the domain gap, one may suggest treating each function in a program as a region proposal in image processing. However, this is too coarse-grained because vulnerability detectors not only need to tell whether a function is vulnerable or not, but also need to pin down the locations of vulnerabilities. That is, we need finer-grained representation of programs for vulnerability detection. On the other hand, one may suggest treating each line of code (i.e., statement — the two terms are used interchangeably in this paper) as a unit for vulnerability detection. This treatment has two drawbacks: (i) most statements in a program do not contain any vulnerability, meaning that few samples are vulnerable; and (ii) multiple statements that are semantically related to each other are not considered as a whole.

Inspired by the notion of region proposal in image processing, we propose dividing a program into smaller pieces of code (i.e., a number of statements), which may exhibit the syntax and semantics characteristics of vulnerabilities. This explains, as highlighted in Fig. 1(b), why the framework seeks SyVC, SeVC, and vector representations of programs.

Fig. 2: An example illustrating SyVC, SeVC, and vector representations of program, where SyVCs are highlighted by boxes and one SyVC may be part of another SyVC. The SyVCSeVC transformation is elaborated in Fig. 3.

Running example. In order to help understand the details of SySeVR, we use Fig. 2 to highlight how SySeVR extracts SyVCs, SeVCs, and vector representation of SeVCs. At a high level, a SyVC, highlighted by a box in Fig. 2, is a code element that may or may not be vulnerable according to some syntax characteristics of known vulnerabilities. A SeVC extends a SyVC to include statements (i.e., lines of code) that are semantically related to the SyVC, where semantic information is induced by control dependency and/or data dependency; this SyVCSeVC transformation is fairly involved and therefore elaborated in Fig. 3. Each SeVC is encoded into a vector for input to deep neural networks.

Ii-B1 Extracting SyVCs

We observe that most vulnerabilities exhibit some simple syntax characteristics, such as function call and pointer usage. Therefore, we propose using syntax characteristics to identify SyVCs, which serve as a starting point (i.e., SyVCs are not sufficient for training deep learning models because they accommodate no semantic information of vulnerabilities). In order to define SyVC, we first define:

Definition 1 (program, function, statement, token)

A program is a set of functions , denoted by . A function , where , is an ordered set of statements , denoted by . A statement , where and , is an ordered set of tokens , denoted by . Note that tokens can be identifiers, operators, constants, and keywords, and can be extracted by lexical analysis.

Defining SyVCs. Given a function , there are standard routines for generating its abstract syntax tree, which is denoted by . On , the root corresponds to function , a leaf node corresponds to a token (), and an internal node corresponds to a statement or multiple consecutive tokens of . Intuitively, a SyVC is one token (corresponding to a leaf node) or consists of multiple consecutive tokens (corresponding to an internal node). Formally,

Definition 2 (SyVC)

Consider a program , , where with , . Given a set of vulnerability syntax characteristics, denoted by where is the number of syntax characteristics, a code element is composed of one or multiple consecutive tokens of , namely where . A code element is called a SyVC if it matches some vulnerability syntax characteristic .

Note that different kinds of vulnerabilities would have different syntax characteristics. For example, vulnerabilities related to library/API function calls have the following syntax characteristic: a function on is a “callee” (indicating a function call). In Section III, we will show how to extract vulnerability syntax characteristics and determine whether a code element matches a syntax characteristic or not.

6em

A program ; a set of vulnerability syntax characteristics

A set of SyVCs

1:  ;
2:  for each function  do
3:      Generate an abstract syntax tree for ;
4:      for each code element in  do
5:          for each  do
6:              if  matches  then
7:                  {};
8:              end if
9:          end for
10:      end for
11:  end for
12:  return  ; {the set of SyVCs}
Algorithm 1 Extracting SyVCs from a program

Algorithm for computing SyVCs. Given a program and a set of vulnerability syntax characteristics, Algorithm 1 extracts SyVCs from as follows. First, Algorithm 1 uses a standard routine to generate an abstract syntax tree for each function . Then, Algorithm 1 traverses to identify SyVCs, namely the code elements that match some . We defer the details to Section III-B1, because different vulnerability syntax characteristics need different matching methods.

Running example. In the second column of Fig. 2, we use boxes to highlight all of the SyVCs that are extracted from the program source code using the vulnerability syntax characteristics that will be described in Section III-B1. We will elaborate how these SyVCs are extracted. It is worth mentioning that one SyVC may be part of another SyVC. For example, there are three SyVCs that are extracted from line 18 because they are extracted with respect to different vulnerability syntax characteristics.

Ii-B2 Transforming SyVCs to SeVCs

In order to detect vulnerabilities, we propose transforming SyVCs to SeVCs, or SyVCSeVC transformation, to accommodate the statements that are semantically related to the SyVCs in question. For this purpose, we propose leveraging the program slicing technique to identify the statements that are semantically related to SyVCs. In order to use the program slicing technique, we need to use Program Dependency Graph (PDG). This requires us to use data dependency and control dependency, which are defined over Control Flow Graph (CFG). These concepts are reviewed below.

Definition 3 (Cfg [25])

For a program , , the CFG of function is a graph , where is a set of nodes with each node representing a statement or control predicate, and , is a set of direct edges with each edge representing the possible flow of control between a pair of nodes.

Definition 4 (data dependency [25])

Consider a program , the CFG of function , and two nodes and in where and . If a value computed at is used at , then is data-dependent on .

Definition 5 (control dependency [25])

Consider a program , the CFG of function , and two nodes and in where and . It is said that post-dominates if all paths from to the end of the program traverse through . If there exists a path starting at and ending at such that (i) post-dominates every node on the path excluding and , and (ii) does not post-dominate , then is control-dependent on .

Based on data dependency and control dependency, we can define PDG.

Definition 6 (Pdg [25])

For a program , , the PDG of function is denoted by , where is the same as in CFG , and , is a set of direct edges with each edge representing a data or control dependency between a pair of nodes.

Given PDGs, we can extract the program slices of SyVCs, which may go beyond the boundaries of individual functions. We consider both forward and backward slices because (i) the SyVC may affect some subsequential statements, which may therefore contain a vulnerability; and (ii) the statements affecting the SyVC may render the SyVC vulnerable. Formally,

Definition 7 (forward, backward, and program slices of a SyVC)

Consider a program , PDGs of functions , and a SyVC, , of statement in .

  • The forward slice of SyVC in , denoted by , is defined as an ordered set of nodes , , in , where , , is reachable from in . That is, the nodes in are from all paths in starting at .

  • The interprocedural forward slice of SyVC in program , denoted by , is a forward slice going beyond function boundaries (caused by function calls).

  • The backward slice of SyVC in , denoted by , is defined as an ordered set of nodes in , where , , from which is reachable in . That is, the nodes in are from all paths in ending at .

  • The interprocedural backward slice of SyVC in program , denoted by , is a backward slice going beyond function boundaries (caused by function calls).

  • Given an interprocedural forward slice and an interprocedural backward slice , the (interprocedural) program slice of SyVC , denoted by , is defined as an ordered set of nodes (belonging to the PDGs of functions in ) by merging and at the SyVC .

Defining SeVCs. Having extracted the program slices of SyVCs, we can transform them to SeVCs according to:

Definition 8 (SeVC)

Given a program and a SyVC in statement of function , the SeVC corresponding to SyVC , denoted by , is defined as an ordered subset of statements in , denoted by , where a data dependency or control dependency exists between statement ( and ) and SyVC . In other words, a SeVC, , is an ordered set of statments that correspond to the nodes of (interprocedural) program slice .

6em

A program ; a set of SyVCs generated by Algorithm 1

The set of SeVCs

1:  ;
2:  for each  do
3:      Generate a PDG for ;
4:  end for
5:  for each in  do
6:      Generate forward slice & backward slice of ;
7:      Generate interprocedural forward slice by interconnecting and the forward slices from the functions called by ;
8:      Generate interprocedural backward slice by interconnecting and the backward slices from both the functions called by and the functions calling ;
9:      ;   {throughout this algorithm, “set ” means ordered set union; see text for explanations}
10:      for each statement appearing in as a node do
11:          , according to the order of the appearance of in ;
12:      end for
13:      for two statements and () appearing in as nodes do
14:          if  calls  then
15:              , where ;
16:          else
17:              , where ;
18:          end if
19:      end for
20:      ;
21:  end for
22:  return  ; {the set of SeVCs}
Algorithm 2 Transforming SyVCs to SeVCs
Fig. 3: Elaborating the SyVCSeVC transformation in Algorithm 2 for SyVC “”, where solid arrows (i.e., directed edges) represent data dependency, and dashed arrows represent control dependency.

Algorithm for computing SeVCs and running examples. Algorithm 2 summarizes the preceding discussion in three steps: generating PDGs; generating program slices of the SyVCs output by Algorithm 1; and transforming program slices to SeVCs. In what follows we elaborate these steps and using Fig. 3 to illustrate a running example. Specifically, Fig. 3 elaborates the SyVCSeVC transformation of SyVC “” (related to pointer usage) while accommodating semantic information induced by data dependency and control dependency.

Step 1 (lines 2-4 in Algorithm 2). This step generates a PDG for each function. For this purpose, there are standard algorithms (e.g., [26]). As a running example, the second column of Fig. 3 shows the PDGs respectively corresponding to functions and , where each number represents the line number of a statement.

Step 2: (lines 6-9 in Algorithm 2). This step generates the program slice for each SyVC . The interprocedural forward slice is obtained by merging and the forward slices from the functions called by . The interprocedural backward slice is obtained by merging and the backward slices from both the functions called by and the functions calling . Finally, and are merged into a program slice .

As a running example, the third column in Fig. 3 shows the program slice of SyVC “”, where the backward slice corresponds to function and the forward slice corresponds to functions and . It is worth mentioning that for obtaining the forward slice of a SyVC, we leverage only data dependency for two reasons: (i) statements affected by a SyVC via control dependency would not be vulnerable in most cases and (ii) utilizing statements that have a control dependency on a SyVC would involve many statements that have little to do with vulnerabilities. On the other hand, for obtaining the backward slice of a SyVC, we leverage both data dependency and control dependency.

Step 3 (lines 10-19 in Algorithm 2). This step transforms program slices to SeVCs as follows. First, the algorithm transforms the statements belonging to function and appearing in as nodes to a SeVC, while preserving the order of these statements in . As a running example shown in Fig. 3, 13 statements belong to function , and 3 statements belong to function . According to the order of these statements in the two functions, we obtain two ordered sets of statements: lines {7, 9, 10, 11, 12, 14, 16, 18, 22, 23, 24, 25, 26} and lines {1, 3, 4}.

Second, the algorithm transforms the statements belonging to different functions to a SeVC. For statements and () appearing in as nodes, if calls , then and are in the same order of function call, that is, ; otherwise, . As a running example shown in Fig. 3, the SeVC is {7, 9, 10, 11, 13, 14, 16, 18, 22, 23, 24, 25, 26, 1, 3, 4}, in which the statements in function appear before the statements in function because calls . The fourth column in Fig. 3 shows the SeVC corresponding to SyVC “”, namely an order set of statements that are semantically related to SyVC “”.

Ii-B3 Encoding SeVCs into Vectors

Algorithm 3 encodes SeVCs into vectors in three steps.

Step 1 (lines 2-6 in Algorithm 3). Each SeVC is transformed to a symbolic representation. For this purpose, we propose removing non-ASCII characters and comments, then map variable names to symbolic names (e.g., “V1”, “V2”) in a one-to-one fashion, and finally map function names to symbolic names (e.g., “F1”, “F2”) in a one-to-one fashion. Note that different SeVCs may have the same symbolic representation.

6em

A set of SeVCs generated by Algorithm 2;
a threshold

The set of vectors corresponding to SeVCs

1:  ;
2:  for each  do
3:      Remove non-ASCII characters in ;
4:      Map variable names in to symbolic names;
5:      Map function names in to symbolic names;
6:  end for
7:  for each  do
8:      ;
9:      Divide into a set of symbols ;
10:      for each in order do
11:          Transform to a fixed-length vector ;
12:          , where means concatenation;
13:      end for
14:      if  is shorter than  then
15:

          Zeroes are padded to the end of

;
16:      else if the vector corresponding to is shorter than  then
17:          Delete the leftmost portion of to make ;
18:      else if the vector corresponding to is shorter than  then
19:          Delete the rightmost portion of to make ;
20:      else
21:          Delete almost the same length from the leftmost portion and the rightmost portion of to make ;
22:      end if
23:       ;
24:  end for
25:  return  ; {the set of vectors corresponding to SeVCs}
Algorithm 3 Transforming SeVCs to vectors

Step 2 (lines 8-13 in Algorithm 3). This step is to encode the symbolic representations into vectors. For this purpose, we propose dividing the symbolic representation of a SeVC (e.g., “V1=V2-8;”) into a sequence of symbols via a lexical analysis (e.g., “V1”, “=”, “V2”, “-”, “8”, and “;”). We transform a symbol to a fixed-length vector. By concatenating the vectors, we obtain a vector for each SeVC.

Step 3 (lines 14-22 in Algorithm 3). Because (i) the number of symbols (i.e., the vectors representing SeVCs) may be different and (ii) neural networks take vectors of the same length as input, we use a threshold as the length of vectors for the input to neural network. When a vector is shorter than , zeroes are padded to the end of the vector. When a vector is longer than , we consider three scenarios. (i) If the portion of the vector corresponding to the forward slice is shorter than , we delete the leftmost portion of to make . (ii) If the portion of the vector corresponding to the backward slice is shorter than , we delete the rightmost portion of to make . (iii) Otherwise, we delete almost the same length from the leftmost portion and the rightmost portion of to make .

Ii-B4 Labeling SeVCs and Corresponding Vectors

In order to learn a deep neural network, we label the vectors (i.e., the SeVCs they represent) as vulnerable or not as follows: A SeVC (i.e., the vector representing it) containing a known vulnerability is labeled as “1” (i.e., vulnerable), and “0” otherwise (i.e., not vulnerable). A learned deep neural network encodes vulnerability patterns and can detect whether given SeVCs are vulnerable or not.

Ii-C Evaluation Metrics

The effectiveness of vulnerability detectors can be evaluated by the following 5 widely-used metrics [27]: false-positive rate (), false-negative rate (), accuracy (), precision (), and F1-measure (). Let TP denote the number of vulnerable samples that are detected as vulnerable, FP denote the number of samples are not vulnerable but are detected as vulnerable, TN denote the number of samples that are not vulnerable and are detected as not vulnerable, and FN denote the number of vulnerable samples that are detected as not vulnerable. Table I summarizes their definitions.

Metric Formula Meaning
False- positive rate The proportion of false-positive samples in the total samples that are not vulnerable.
False- negative rate The proportion of false-negative samples in the total samples that are vulnerable.
Accuracy The correctness of all detected samples.
Precision The correctness of detected vulnerable samples.
F1- measure The overall effectiveness considering both precision and false-negative rate.
TABLE I: Evaluation metrics.

Iii Experiments and Results

Iii-a Research Questions and Dataset

Research questions. Our experiments are geared towards answering the following Research Questions (RQs):

  • RQ1: Can SySeVR make BLSTM detect multiple kinds (vs. single kind) of vulnerabilities?

  • RQ2: Can SySeVR make multiple kinds of deep neural networks to detect multiple kinds of vulnerabilities? Can we explain their (in)effectiveness?

  • RQ3: Can accommodating control-dependency make SySeVR more effective, and by how much?

  • RQ4: How more effective are SySeVR-based methods when compared with the state-of-the-art methods?

In order to answer these questions, we implement the following 6 deep neural networks in Python using Theano

[28]: CNN [29], DBN [30], and RNNs (including LSTM, GRU, BLSTM, and BGRU [31, 32, 33]). The computer running experiments has a NVIDIA GeForce GTX 1080 GPU and an Intel Xeon E5-1620 CPU running at 3.50GHz.

Vulnerability dataset. We produce a vulnerability dataset from two sources: NVD [21] and SARD [22]. NVD contains vulnerabilities in production software and possibly diff files describing the difference between a vulnerable piece of code and its patched version. SARD contains production, synthetic and academic programs (also known as test cases), which are categorized as “good” (i.e., having no vulnerabilities), “bad” (i.e., having vulnerabilities), and “mixed” (i.e., having vulnerabilities whose patched versions are also available).

For NVD, we focus on 19 popular C/C++ open source products (same as in [11]) and their vulnerabilities that are accompanied by diff files, which are needed for extracting vulnerable pieces of code. As a result, we collect 1,592 open source C/C++ programs, of which 874 are vulnerable. For SARD, we collect 14,000 C/C++ programs, of which 13,906 programs are vulnerable (i.e., “bad” or “mixed”). In total, we collect 15,592 programs, of which 14,780 are vulnerable; these vulnerable programs contain 126 types of vulnerabilities, where each type is uniquely identified by a Common Weakness Enumeration IDentifier (CWE ID) [34]. The 126 CWE IDs are published with our dataset.

Iii-B Experiments

The experiments follow the SySeVR framework, with elaborations when necessary.

Iii-B1 Extracting SyVCs

In what follows we will elaborate the two components in Algorithm 1 that are specific to different kinds of vulnerabilities: the extraction of vulnerability syntax characteristics and how to match them.

Extracting vulnerability syntax characteristics. In order to extract syntax characteristics of known vulnerabilities, it would be natural to extract the vulnerable lines of code from the vulnerable programs mentioned above, and analyze their syntax characteristics. However, this is an extremely time-consuming task, which prompts us to leverage the C/C++ vulnerability rules of a state-of-the-art commercial tool, Checkmarx [6], to analyze vulnerability syntax characteristics. As we will see, this alternate method is effective because it covers 93.6% of the vulnerable programs collected from SARD. It is worth mentioning that we choose Checkmarx over open-source tools (e.g., Flawfinder [4] and RATS [5]) because the latter have simple parsers and imperfect rules [35].

Our manual examination of Checkmarx rules leads to the following 4 syntax characteristics (each accommodating many vulnerabilities).

  • Library/API Function Call (FC for short): This syntax characteristic covers 811 library/API function calls, which are published with our dataset. These 811 function calls correspond to 106 CWE IDs.

  • Array Usage (AU for short): This syntax characteristic covers 87 CWE IDs related to arrays (e.g., issues related to array element access, array address arithmetic).

  • Pointer Usage (PU for short): This syntax characteristic covers 103 CWE IDs related to pointers (e.g., improper use in pointer arithmetic, reference, address transfer as a function parameter).

  • Arithmetic Expression (AE for short): This syntax characteristic covers 45 CWE IDs related to improper arithmetic expressions (e.g., integer overflow).

Fig. 4 shows that these 4 syntax characteristics overlap with each other in terms of the CWE IDs they cover (e.g., 39 CWE IDs exhibit all of the 4 syntax characteristics, but ).

Fig. 4: Venn diagram of the FC, AU, PU, and AE in terms of the CWE IDs they cover, where , , , and .

Matching syntax characteristics. In order to use Algorithm 1 to extract SyVCs, we need to determine whether or not a code element , which is on the abstract syntax tree of function in program , matches a vulnerability syntax characteristic. Note that can be generated by using Joern [36]. The following method, as illustrated in Fig. 5 via the example program shown in Fig. 2, can automatically decide whether or not code element matches a syntax characteristic.

Fig. 5: Examples for illustrating the matching of syntax characteristics, where a highlighted node matches some vulnerability syntax characteristic and therefore is a SyVC.
  • As illustrated in Fig. 5(a), we say code element (i.e., “memset”) matches the FC syntax characteristic if (i) on is a “callee” (i.e., the function is called), and (ii) is one of the 811 function calls mentioned above.

  • As illustrated in Fig. 5(b), we say code element (i.e., “source”) matches the AU syntax characteristic if (i) is an identifier declared in an identifier declaration statement (i.e., IdentifierDeclStatement) node and (ii) the IdentifierDeclStatement node contains characters ‘[’ and ‘]’.

  • As illustrated in Fig. 5(c), we say code element (i.e., “data”) matches the PU syntax characteristic if (i) is an identifier declared in an IdentifierDeclStatement node and (ii) the IdentifierDeclStatement node contains character ‘’.

  • As illustrated in Fig. 5(d), we say code element (“data=dataBuffer-8”) matches the AE syntax characteristic if (i) is an expression statement (ExpressionStatement) node and (ii) contains a character ‘=’ and has one or more identifers on the right-hand side of ‘=’.

Extracting SyVCs. Now we can use Algorithm 1 to extract SyVCs from the 15,592 programs. Corresponding to the 4 syntax characteristics, we extract 4 kinds of SyVCs:

  • FC-kind SyVCs: We extract 6,304 from NVD and 58,099 from SARD, or 64,403 in total.

  • AU-kind SyVCs: We extract 9,776 from NVD and 32,453 from SARD, or 42,229 in total.

  • PU-kind SyVCs: We extract 73,856 from NVD and 217,985 from SARD, or 291,841 in total.

  • AE-kind SyVCs: We extract 5,264 from NVD and 16,890 from SARD, or 22,154 in total.

Putting them together, we extract 420,627 SyVCs, which cover 13,016 (out of the 13,906, or 93.6%) vulnerable programs collected from SARD; this coverage validates our idea of using Checkmarx rules to derive vulnerability syntax characteristics. Note that we can compute the coverage 93.6% because SARD gives the precise location of each vulnerability; in contrast, we cannot compute the coverage with respect to NVD because it does not give precise locations of vulnerabilities. The average time for extracting a SyVC is 270 milliseconds.

Iii-B2 Transforming SyVCs to SeVCs

When using Algorithm 2 to transform SyVCs to SeVCs, we use Joern [36] to extract PDGs. Corresponding to the 420,627 SyVCs extracted from Algorithm 1, Algorithm 2 generates 420,627 SeVCs (while recalling that one SyVC is transformed to one SeVC). In order to see the effect of semantic information, we actually use Algorithm 2 to generate two sets of SeVCs: one set accommodating semantic information induced by data dependency only, and the other set accommodating semantic information induced by both data dependency and control dependency. In either case, the second column of Table II summarizes the numbers of SeVCs categorized by the kinds of SyVCs from which they are transformed. In terms of the efficiency of the SyVCSeVC transformation, on average it takes 331 milliseconds to generate a SeVC accommodating data dependency and 362 milliseconds to generate a SeVC accommodating data dependency and control dependency.

Kind of SyVCs
#SeVCs
#Vul. SeVCs
#Not vul. SeVCs
FC-kind 64,403 13,603 50,800
AU-kind 42,229 10,926 31,303
PU-kind 291,841 28,391 263,450
AE-kind 22,154 3,475 18,679
Total 420,627 56,395 364,232
TABLE II: The number of SeVCs, vulnerable SeVCs, and not vulnerable SeVCs from the 15,592 programs.

Iii-B3 Encoding SeVCs into Vector Representation

We use Algorithm 3 to encode SeVCs into vector representation. For this purpose, we adopt [37] to encode the symbols of each SeVC into fixed-length vectors. Then, each SeVC is represented by the concatenation of the vectors representing its symbols. We set each SeVC to have 500 symbols (padding or truncating if necessary, as discussed in Algorithm 3) and the length of each symbol is 30, meaning .

Iii-B4 Generating Ground Truth Labels of SeVCs

For SeVCs extracted from NVD, we examine the vulnerabilities whose diff files involve line deletion; we do not consider the diff files that only involve line additions because in these cases the vulnerable statements are not given by NVD. We generate the ground truth labels in 3 steps. Step 1: Parse a diff file to mark the lines that are prefixed with a “-” and are deleted/modified, and the lines that are prefixed with a “-” and are moved (i.e., deleted at one place and added at another place). Step 2: If a SeVC contains at least one deleted/modified statement that is prefixed with a “-”, it is labeled as “1” (i.e., vulnerable); if a SeVC contains at least one moved statement prefixed with a “-” and the detected file contains a known vulnerability, it is labeled as “1”; otherwise, a SeVC is labeled as “0” (i.e., not vulnerable). Step 3: Check the SeVCs that are labeled as “1” because Step 2 may mislabel some SeVCs that are not vulnerable as “1” (while noting that it is not possible to mislabel a vulnerable SeVC as “0”). Among the preceding 3 steps, the first two are automated, but the last one is manual.

For SeVCs extracted from SARD, a SeVC extracted from a “good” program is labeled as “0”; a SeVC extracted from a “bad” or “mixed” program is labeled as “1” if the SeVC contains at least one vulnerable statement, and “0” otherwise.

In total, 56,395 SeVCs are labeled as “1” and 364,232 SeVCs are labeled as “0”. The third and fourth columns of Table II summarize the number of vulnerable vs. not vulnerable SeVCs corresponding to each kind of SyVCs. The ground-truth label of the vector corresponding to a SeVC is the same as the ground truth label of the SeVC.

Iii-C Experimental Results

For learning a deep neural network, we use 80% of the programs respectively and randomly selected from NVD and SARD for training (training programs), and use the rest 20% for testing (testing programs).

Iii-C1 Experiments for Answering RQ1

In this experiment, we use BLSTM as in [11] and the SeVCs accommodating semantic information induced by data and control dependencies. We randomly choose 30,000 SeVCs extracted from the training programs as the training set (12.7% of which are vulnerable and the rest 87.3% are not) and 7,500 SeVCs extracted from the testing programs as the testing set (12.2% of which are vulnerable and the rest 87.8% are not). Both sets contain SeVCs corresponding to the 4 kinds of SyVCs, proportional to their amount as shown in the second column of Table II. For fair comparison with VulDeePecker [11], we also randomly choose 30,000 SeVCs corresponding to the FC-kind SyVCs extracted from the training programs (22.8% of which are vulnerable and the rest 77.2% are not) as the training set, and randomly choose 7,500 SeVCs corresponding to the FC-kind SyVCs extracted from the testing programs as the testing set (22.0% of which are vulnerable and the rest 78.0% are not). These SeVCs only accommodate semantic information induced by data dependency (as in [11]).

The main parameters for learning BLSTM are: dropout is 0.2; batch size is 16; number of epochs is 20; output dimension is 256; minibatch stochastic gradient descent together with ADAMAX

[38] is used for training with a default learning rate of 0.002; dimension of hidden vectors is 500; and number of hidden layers is 2.

Kind of SyVC
FPR
(%)
FNR
(%)
A
(%)
P
(%)
F1
(%)
VulDeePecker w/ FC-kind 4.1 21.7 92.0 84.0 81.0
SySeVR-BLSTM w/ FC-kind 3.8 9.6 94.9 87.3 88.8
SySeVR-BLSTM w/ AU-kind 6.3 11.7 92.4 82.5 85.3
SySeVR-BLSTM w/ PU-kind 2.1 13.6 96.9 80.2 83.2
SySeVR-BLSTM w/ AE-kind 1.5 12.5 97.3 87.5 87.5
SySeVR-BLSTM w/ all-kinds 2.9 12.1 95.9 82.5 85.2
TABLE III: Effectiveness of VulDeePecker [11] vs. effectiveness of BLSTM in the SySeVR framework.

Table III summarizes the results. We observe that SySeVR-enabled BLSTM (or SySeVR-BLSTM) with FC-kind SyVCs leads to the lowest FNR (9.6%), but its FPR is higher than that of PU- and AE-kind SyVCs. The other 3 types of SyVCs lead to, on average, a FPR of 3.3% and a FNR of 12.6%. Overall, SySeVR makes BLSTM achieve a lower FPR (0.3% lower) and a lower FNR (12.1% lower) than VulDeePecker for the same kind of vulnerabilities. This leads to:

Insight 1

SySeVR-BLSTM detects various kinds of vulnerabilities, and can reduce FNR by 12.1%.

Iii-C2 Experiments for Answering RQ2

In order to answer RQ2, we train 1 CNN, 1 DBN, and 4 RNNs (i.e., LSTM, GRU, BLSTM, and BGRU) using the same dataset as in Section III-C1. Table IV summarizes the results. We observe that when compared with unidirectional RNNs (i.e., LSTM and GRU), bidirectional RNNs (i.e., BLSTM and BGRU) can reduce FNR by 3.1% on average, at the price of increasing FPR by 0.3% on average. This phenomenon might be caused by the following: Bidirectional RNNs can accommodate more information about the statements that appear before and after the statement in question. In summary,

Insight 2

SySeVR-enabled bidirectional RNNs (especially BGRU) are more effective than CNNs, which in turn are more effective than DBNs. Still, their FNRs are consistently much higher than their FPRs.

Neural network
FPR (%)
FNR (%)
A (%)
P (%)
F1 (%)
CNN 2.1 16.3 95.9 86.5 85.0
DBN 11.0 83.6 78.8 19.4 17.8
LSTM 2.5 15.9 95.7 83.7 83.9
GRU 2.5 14.7 95.9 84.9 85.1
BLSTM 2.9 12.1 95.9 82.5 85.2
BGRU 2.7 12.3 96.0 84.1 85.9
TABLE IV: Comparison between 6 neural networks.

Towards explaining the effectiveness of BGRU in vulnerability detection. It is important, but an outstanding open problem, to explain the effectiveness of deep neural networks. Now we report our initial effort along this direction. In what follows we focus on BGRU because it is more effective than the others.

In order to explain the effectiveness of BGRU, we review its structure in Fig. 6. For each SeVC and each time step, there is an output (belonging to

) at the activation layer. The output of BGRU is the output of the last time step at the activation layer; the closer this output is to 1, the more likely the SeVC is classified as vulnerable. For the classification of a SeVC, we identify the tokens (i.e., the symbols representing them) that play a critical role in determining its classification. This can be achieved by looking at all pairs of tokens at time steps

. We find that if the activation-layer output corresponding to the token at time step is substantially (e.g., 0.6) greater (vs. smaller) than the activation-layer output corresponding to the token at time step , then the token at time step plays a critical role in classifying the SeVC as vulnerable (correspondingly, not vulnerable). Moreover, we find that many false-negatives are caused by the token “if” or the token(s) following it, because these tokens often appear in SeVCs that are not vulnerable. We also find that many false-positives are caused by the token(s) related to library/API function calls and their arguments, because these tokens often appear in SeVCs that are vulnerable. In summary, we have

Fig. 6: The structure of BGRU.
Insight 3

The effectiveness of BGRU is substantially influenced by the training data. If some syntax elements often appear in SeVCs that are vulnerable (vs. not vulnerable), then these elements may cause false-positives (correspondingly, false-negatives).

Using tailored or universal vulnerability detectors? In practice, we often deal with multiple kinds of vulnerabilities (or SyVCs). This raises a new question: should we use multiple neural networks (one for each kind of SyVCs) or a single neural network (accommodating 4 kinds of SyVCs) to detect vulnerabilities? In order to answer this question, we conduct the following experiment. For FC-, AU- and PU-kind SyVCs, we randomly choose 30,000 SeVCs extracted from the training programs as the training set and 7,500 SeVCs extracted from the testing programs as the testing set, where the SeVCs accommodate semantic information induced by data dependency and control dependency. For the much fewer AE-kind vulnerabilities, we use all of the 20,336 SeVCs extracted from the training programs for training (15.9% of which are vulnerable and the rest 84.1% are not), and all of the 1,818 SeVCs extracted from the testing programs for testing (12.0% of which are vulnerable and the rest 88.0% are not).

Kind of SyVC
FPR (%)
FNR (%)
A (%)
P (%)
F1 (%)
FC-kind 3.1 7.6 95.9 89.5 90.9
AU-kind 3.0 10.2 95.2 90.6 90.2
PU-kind 1.7 22.7 96.2 83.2 80.1
AE-kind 1.4 3.8 98.2 93.7 94.9
All-kinds 2.7 12.3 96.0 84.1 85.9
TABLE V: Effectiveness of tailored vs. universal BGRU.

Table V summarizes the experimental results. We observe that using BGRU specific to each kind of SyVCs is more effective, except for PU, than using a single BGRU to detect vulnerabilities related to 4 kinds of SyVCs. This does not hold for PU likely because as shown in Table II, the ratio of vulnerable vs. not vulnerable SeVCs in PU-kind of SyVCs is 1:9, which is much smaller than the average ratio of 1:4 in the other 3 kinds of SyVCs. As a consequence, the resulting BGRU would be biased for SeVCs that are not vulnerable, leading to a high FNR. In summary,

Insight 4

It is better to use BGRU tailored to specific kind of vulnerabilities than to use a single BGRU for detecting multiple kinds of vulnerabilities.

Iii-C3 Experiments for Answering RQ3

We use experiment to compare the effectiveness of (i) the 6 neural networks learned from the SeVCs that accommodate semantic information induced by data dependency and (ii) the 6 neural networks learned from the SeVCs that accommodate semantic information induced by data dependency and control dependency. In either case, we randomly choose 30,000 SeVCs extracted from the training programs as the training set and 7,500 SeVCs extracted from the testing programs as the testing set. All of these training and testing sets correspond to the 4 kinds of SyVCs, proportional to their amount as shown in the second column of Table II.

Neural
network
Kind of
SeVC
FPR
(%)
FNR
(%)
A
(%)
P
(%)
F1
(%)
CNN DD 1.8 38.8 93.2 83.8 70.7
DDCD 2.1 16.3 95.9 86.5 85.0
DBN DD 10.9 86.3 79.0 16.3 14.9
DDCD 11.0 83.6 78.8 19.4 17.8
LSTM DD 5.3 39.6 90.2 64.0 62.2
DDCD 2.5 15.9 95.7 83.7 83.9
GRU DD 2.5 37.4 92.8 79.2 69.9
DDCD 2.5 14.7 95.9 84.9 85.1
BLSTM DD 2.9 38.5 92.3 76.7 68.3
DDCD 2.9 12.1 95.9 82.5 85.2
BGRU DD 3.1 31.9 93.0 77.2 72.3
DDCD 2.7 12.3 96.0 84.1 85.9
TABLE VI: Effectiveness of semantic information induced by data dependency (“DD” for short) vs. semantic information induced by data dependency and control dependency (“DDCD” for short).

Table VI summarizes the results. We observe that accommodating semantic information induced by data dependency and control dependency can improve the vulnerability detection capability in almost every scenario, and reduce the FNR by 19.6% on average. This can be explained by the fact that control dependency accommodates extra information useful to vulnerability detection.

Insight 5

The more semantic information is accommodated for learning neural networks, the higher vulnerability detection capability of the learned neural networks.

Iii-C4 Experiments for Answering RQ4

We consider BGRU learned from the 341,536 SeVCs corresponding to the 4 kinds of SyVCs extracted from the training programs and the 79,091 SeVCs extracted from the testing programs, while accommodating semantic information induced by data dependency and control dependency. We compare our most effective model BGRU with the commercial static vulnerability detection tool Checkmarx [6] and open-source static analysis tools Flawfinder [4] and RATS [5], because (i) these tools arguably represent the state-of-the-art static analysis for vulnerability detection; (ii) they are widely used for detecting vulnerabilities in C/C++ source code; (iii) they directly operate on the source code (i.e., no need to compile the source code); and (iv) they are available to us. We also consider the state-of-the-art system VUDDY [2], which is particularly suitable for detecting vulnerabilities incurred by code cloning. We further consider VulDeePecker [11], and we consider all 4 kinds of SyVCs and data as well as control dependency for SySeVR.

Method
FPR (%)
FNR (%)
A (%)
P (%)
F1 (%)
Flawfinder 21.6 70.4 69.8 22.8 25.7
RATS 21.5 85.3 67.2 12.8 13.7
Checkmarx 20.8 56.8 72.9 30.9 36.1
VUDDY 4.3 90.1 71.2 47.7 16.4
VulDeePecker 2.5 41.8 92.2 78.0 66.6
SySeVR-BGRU 1.4 5.6 98.0 90.8 92.6
TABLE VII: Comparing BGRU in the SySeVR framework and state-of-the-art vulnerability detectors.
Target product CVE ID
Vulnerable
product reported
Vulnerability
release date
Vulnerable file in
the target product
Kind
of SyVC
1st patched version
of target product
Libav 10.3 CVE-2013-**** Ffmpeg **/**/2013 libavcodec/**.c AU-kind
CVE-2013-7020 Ffmpeg 12/09/2013 libavcodec/ffv1dec.c PU-kind Libav 10.4
CVE-2013-**** Ffmpeg **/**/2013 libavcodec/**.c PU-kind
CVE-2014-**** Ffmpeg **/**/2015 libavcodec/**.c PU-kind
CVE-2014-**** Ffmpeg **/**/2014 libavcodec/**.c PU-kind
Libav 9.10 CVE-2014-9676 Ffmpeg 02/27/2015 libavformat/segment.c PU-kind Libav 10.0
Seamonkey 2.32 CVE-2015-4511 Firefox 09/24/2015 …/src/nestegg.c AU-kind Seamonkey 2.38
Seamonkey 2.35 CVE-2015-**** Firefox **/**/2015 …/gonk/**.cpp FC-kind
Thunderbird 38.0.1 CVE-2015-4511 Firefox 09/24/2015 …/src/nestegg.c AU-kind Thunderbird 43.0b1
CVE-2015-**** Firefox **/**/2015 …/gonk/**.cpp FC-kind
Xen 4.4.2 CVE-2013-4149 Qemu 11/04/2014 …/net/virtio-net.c PU-kind Xen 4.4.3
CVE-2015-1779 Qemu 01/12/2016 ui/vnc-ws.c PU-kind Xen 4.5.5
CVE-2015-3456 Qemu 05/13/2015 …/block/fdc.c PU-kind Xen 4.5.1
Xen 4.7.4 CVE-2016-4453 Qemu 06/01/2016 …/display/vmware_vga.c AE-kind Xen 4.8.0
Xen 4.8.2 CVE-2016-**** Qemu **/**/2016 …/net/**.c PU-kind
TABLE VIII: The 15 vulnerabilities, which are detected by BGRU but not reported in the NVD, include 7 unknown vulnerabilities and 8 vulnerabilities that have been “silently” patched.
Method Degree of automation Vulnerability cause Kind of SyVC Model Granularity Information uses
FC-
kind
AU-
kind
PU-
kind
AE-
kind
Code similarity-based
methods
Semi-automatic
Code
cloning
None
Fine/
coarse
Code
representation
Open source tools
(e.g., Flawfinder, RATS)
Manual Any None Fine
Lexical information
Checkmarx Manual Any None Fine Data-dependency
Feature-based
machine learning
Semi-automatic Any
Machine
learning
Coarse Code metrics
VulDeePecker More automatic Any BLSTM Fine Data-dependency
SySeVR (our work) More automatic Any
6 deep neural
networks
Fine
Data-dependency &
control-dependency
TABLE IX: Summary of methods for detecting various kinds of vulnerabilities.

Table VII summarizes the experimental results. We observe that SySeVR-enabled BGRU substantially outperforms the state-of-the-art vulnerability detection methods. The open-source Flawfinder and RATS have high FPRs and FNRs. Checkmarx is better than Flawfinder and RATS, but still has high FPRs and FNRs. VUDDY is known to trade a high FNR for a low FPR, because it can only detect vulnerabilities that are nearly identical to the vulnerabilities in the training programs. SySeVR-enabled BGRU is much more effective than VulDeePecker because VulDeePecker cannot cope with other kinds of SyVCs (than FC) and cannot accommodate semantic information induced by control dependency. Moreover, BGRU learned from a larger training set (i.e., 341,536 SeVCs) is more effective than BGRU learned from a smaller training set (30,000 SeVCs; see Table IV), by reducing 6.7% in FNR. In summary,

Insight 6

SySeVR-enabled BGRU is much more effective than the state-of-the-art vulnerability detection methods.

Iii-C5 Applying BGRU to Detect Vulnerabilities in Software Products

In order to show the usefulness of SySeVR in detecting software vulnerabilities in real-world software products, we apply SySeVR-enabled BGRU to detect vulnerabilities in 4 products: Libav, Seamonkey, Thunderbird, and Xen. Each of these products contains multiple targets programs, from which we extract their SyVCs, SeVCs, and vectors. For each product, we apply SySeVR-enabled BGRU to its 20 versions so that we can tell whether some vulnerabilities have been “silently” patched by the vendors when releasing a newer version.

As highlighted in Table VIII, we detect 15 vulnerabilities that are not reported in NVD. Among them, 7 are unknown (i.e., their presence in these products are not known until now) and are indeed similar (upon our manual examination) to the CVE IDentifiers (CVE IDs) mentioned in Table VIII. We do not give the full details of these vulnerabilities for ethical considerations, but we have reported these 7 vulnerabilities to the vendors. The other 8 vulnerabilities have been “silently” patched by the vendors when releasing newer versions of the products in question.

Iv Limitations

The present study has several limitations. First, we focus on detecting vulnerabilities in C/C++ program source code, meaning that the framework may need to be adapted to cope with other programming languages and/or executables. Second, our experiments cover 4 kinds of SyVCs; future research needs to accommodate more kinds of SyVCs. Third, the algorithms for generating SyVCs and SeVCs could be improved to accommodate more syntactic/semantic information for vulnerability detection. Fourth, we detect vulnerabilities at the SeVC granularity (i.e., multiple lines of code that are semantically related to each other), which could be improved to more precisely pin down the line of code that contains a vulnerability. Fifth, we truncate the vectors transformed from SeVCs when they are longer than a threshold; future research needs to investigate how to cope with varying length of vectors without losing the information caused by the truncation. Sixth, our experiments show some deep neural networks are more effective than the state-of-the-art vulnerability detection methods. Although we have gained some insights into explaining the “why” part, more investigations are needed to explain the success of deep learning in this context and beyond [39].

V Related Work

Prior studies related to vulnerability detection. There are two methods for source code-based static vulnerability detection: code similarity-based vs. pattern-based. Since code similarity-based detectors can only detect vulnerabilities incurred by code cloning and the present study is a pattern-based method, we review prior studies in the latter method. Table IX summarizes the comparison between SySeVR and previous vulnerability detection methods, which are divided into three categories based on their degree of automation. (i) Manual methods: Vulnerability patterns are manually generated by human experts (e.g., Flawfinder [4], RATS [5], Checkmarx [6]). These tools often incur high false-positives and/or high false-negatives [35], as also confirmed by our experiments (Section III-C4). (ii) Semi-automatic methods: Human experts are needed to manually define features (e.g., imports and function calls [8]; complexity, code churn, and developer activity [40]; API symbols and subtrees [10]; and system calls [7]

) for traditional machine learning models, such as support vector machine and k-nearest neighbor. Different vulnerabilities are often described by different features (e.g., format string vulnerabilities

[41]; information leakage vulnerabilities [42]; missing check vulnerabilities [9]; and taint-style vulnerabilities [43, 44]). This method often detects vulnerabilities at a coarse granularity (e.g., program [7]; component [8]; file [40]; or function [10]), meaning that locations of vulnerabilities cannot be pinned down. (iii) More automatic methods: Human experts do not need to define features. Lin et al. [15] presented a method for automatically learning high-level representations of functions (i.e., coarse-grained and not able to pin down locations of vulnerabilities). VulDeePecker [11] is the first system showing the feasibility of using deep learning to detect vulnerabilities while able to pin down locations of vulnerabilities. SySeVR is the first systematic framework for using deep learning to detect vulnerabilities.

Prior studies related to deep learning. Deep learning has been used for program analysis. CNN has been used for software defect prediction [17], malicious URLs, file paths detection, and registry keys detection [45]; DBN has been used for software defect prediction [19, 20]; RNN has been used for vulnerability detection [11, 15], software traceability [12], code clone detection [13], and recognizing functions in binaries [14]. The present study offers the first framework for using deep learning to detect vulnerabilities.

Vi Conclusion

We presented the SySeVR framework for using deep learning to detect vulnerabilities. Based on a large dataset of vulnerability we collected, we drew a number of insights, including an explanation on the effectiveness of deep learning in vulnerability detection. Moreover, we detected 15 vulnerabilities that were not reported in the NVD. Among these 15 vulnerabilities, 7 are unknown and have been reported to the vendors, and the other 8 have been “silently” patched by the vendors when releasing newer versions.

There are many interesting problems for future research. In particular, it is important to address the limitations discussed in Section IV.

Acknowledgment

We thank Sujuan Wang and Jialai Wang for collecting the vulnerable programs from the NVD and the SARD datasets. This paper is supported by the National Basic Research Program of China (973 Program) under grant No.2014CB340600 and the National Science Foundation of China under grant No. 61672249. Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not reflect the views of the funding agencies.

References

  • [1] “CVE,” 2018, http://cve.mitre.org/.
  • [2] S. Kim, S. Woo, H. Lee, and H. Oh, “VUDDY: A scalable approach for vulnerable code clone discovery,” in 2017 IEEE Symposium on Security and Privacy, San Jose, CA, USA, 2017, pp. 595–614.
  • [3] Z. Li, D. Zou, S. Xu, H. Jin, H. Qi, and J. Hu, “VulPecker: An automated vulnerability detection system based on code similarity analysis,” in Proceedings of the 32nd Annual Conference on Computer Security Applications, Los Angeles, CA, USA, 2016, pp. 201–213.
  • [4] “FlawFinder,” 2018, http://www.dwheeler.com/flawfinder.
  • [5] “Rough Audit Tool for Security,” 2014, https://code.google.com/archive/p/rough-auditing-tool-for-security/.
  • [6] “Checkmarx,” 2018, https://www.checkmarx.com/.
  • [7] G. Grieco, G. L. Grinblat, L. C. Uzal, S. Rawat, J. Feist, and L. Mounier, “Toward large-scale vulnerability discovery using machine learning,” in Proceedings of the 6th ACM on Conference on Data and Application Security and Privacy, New Orleans, LA, USA, 2016, pp. 85–96.
  • [8] S. Neuhaus, T. Zimmermann, C. Holler, and A. Zeller, “Predicting vulnerable software components,” in Proceedings of the 2007 ACM Conference on Computer and Communications Security, Alexandria, Virginia, USA, 2007, pp. 529–540.
  • [9] F. Yamaguchi, C. Wressnegger, H. Gascon, and K. Rieck, “Chucky: Exposing missing checks in source code for vulnerability discovery,” in 2013 ACM SIGSAC Conference on Computer and Communications Security, Berlin, Germany, 2013, pp. 499–510.
  • [10] F. Yamaguchi, M. Lottmann, and K. Rieck, “Generalized vulnerability extrapolation using abstract syntax trees,” in 28th Annual Computer Security Applications Conference, Orlando, FL, USA, 2012, pp. 359–368.
  • [11] Z. Li, D. Zou, S. Xu, X. Ou, H. Jin, S. Wang, Z. Deng, and Y. Zhong, “VulDeePecker: A deep learning-based system for vulnerability detection,” in Proceedings of the 25th Annual Network and Distributed System Security Symposium, San Diego, California, USA, 2018.
  • [12] J. Guo, J. Cheng, and J. Cleland-Huang, “Semantically enhanced software traceability using deep learning techniques,” in Proceedings of the 39th International Conference on Software Engineering, Buenos Aires, Argentina, 2017, pp. 3–14.
  • [13] M. White, M. Tufano, C. Vendome, and D. Poshyvanyk, “Deep learning code fragments for code clone detection,” in Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering, Singapore, 2016, pp. 87–98.
  • [14] E. C. R. Shin, D. Song, and R. Moazzezi, “Recognizing functions in binaries with neural networks,” in 24th USENIX Security Symposium, Washington, D.C., USA, 2015, pp. 611–626.
  • [15] G. Lin, J. Zhang, W. Luo, L. Pan, and Y. Xiang, “POSTER: Vulnerability discovery with function representation learning from unlabeled projects,” in Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, Dallas, TX, USA, 2017, pp. 2539–2541.
  • [16] B. Alsulami, E. Dauber, R. E. Harang, S. Mancoridis, and R. Greenstadt, “Source code authorship attribution using long short-term memory based networks,” in Proceedings of the 22nd European Symposium on Research in Computer Security, Oslo, Norway, 2017, pp. 65–82.
  • [17] J. Li, P. He, J. Zhu, and M. R. Lyu, “Software defect prediction via convolutional neural network,” in 2017 IEEE International Conference on Software Quality, Reliability and Security, Prague, Czech Republic, 2017, pp. 318–328.
  • [18] Q. Geng, Z. Zhou, and X. Cao, “Survey of recent progress in semantic image segmentation with CNNs,” SCIENCE CHINA Information Sciences, vol. 61, no. 5, pp. 051 101:1–051 101:18, 2018.
  • [19] S. Wang, T. Liu, and L. Tan, “Automatically learning semantic features for defect prediction,” in Proceedings of the 38th International Conference on Software Engineering, Austin, TX, USA, 2016, pp. 297–308.
  • [20] X. Yang, D. Lo, X. Xia, Y. Zhang, and J. Sun, “Deep learning for just-in-time defect prediction,” in 2015 IEEE International Conference on Software Quality, Reliability and Security, Vancouver, BC, Canada, 2015, pp. 17–26.
  • [21] “NVD,” 2018, https://nvd.nist.gov/.
  • [22] “Software assurance reference dataset,” 2018, https://samate.nist.gov/SRD/index.php.
  • [23] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in Advances in Neural Information Processing Systems, 2015, pp. 91–99.
  • [24] A. Shrivastava, A. Gupta, and R. B. Girshick, “Training region-based object detectors with online hard example mining,” in

    2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA

    , 2016, pp. 761–769.
  • [25] F. Tip, “A survey of program slicing techniques,” J. Prog. Lang., vol. 3, no. 3, 1995.
  • [26] J. Ferrante, K. J. Ottenstein, and J. D. Warren, “The program dependence graph and its use in optimization,” ACM Trans. Program. Lang. Syst., vol. 9, no. 3, pp. 319–349, 1987.
  • [27] M. Pendleton, R. Garcia-Lebron, J. Cho, and S. Xu, “A survey on systems security metrics,” ACM Comput. Surv., vol. 49, no. 4, pp. 62:1–62:35, 2017.
  • [28] B. James, B. Olivier, B. Frédéric, L. Pascal, and P. Razvan, “Theano: A CPU and GPU math expression compiler,” in Proceedings of the Python for Scientific Computing Conference (SciPy), 2010.
  • [29]

    S. Lawrence, C. L. Giles, A. C. Tsoi, and A. D. Back, “Face recognition: A convolutional neural-network approach,”

    IEEE Trans. Neural Networks, vol. 8, no. 1, pp. 98–113, 1997.
  • [30] G. E. Hinton, “Deep belief networks,” Scholarpedia, vol. 4, no. 5, p. 5947, 2009.
  • [31] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  • [32]

    K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio, “On the properties of neural machine translation: Encoder-decoder approaches,” in

    Proceedings of the 8th Workshop on Syntax, Semantics and Structure in Statistical Translation, Doha, Qatar, 2014, pp. 103–111.
  • [33] A. Graves and J. Schmidhuber, “Framewise phoneme classification with bidirectional LSTM and other neural network architectures,” Neural Networks, vol. 18, no. 5-6, pp. 602–610, 2005.
  • [34] “Common Weakness Enumeration,” 2018, http://cwe.mitre.org/.
  • [35] F. Yamaguchi, “Pattern-based vulnerability discovery,” Ph.D. dissertation, University of Göttingen, 2015.
  • [36] F. Yamaguchi, N. Golde, D. Arp, and K. Rieck, “Modeling and discovering vulnerabilities with code property graphs,” in 2014 IEEE Symposium on Security and Privacy, Berkeley, CA, USA, 2014, pp. 590–604.
  • [37] “word2vec,” 2018, http://radimrehurek.com/gensim/models/word2vec.html.
  • [38] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2014.
  • [39] R. Shwartz-Ziv and N. Tishby, “Opening the black box of deep neural networks via information,” CoRR, vol. abs/1703.00810, 2017.
  • [40] Y. Shin, A. Meneely, L. Williams, and J. A. Osborne, “Evaluating complexity, code churn, and developer activity metrics as indicators of software vulnerabilities,” IEEE Transactions on Software Engineering, vol. 37, no. 6, pp. 772–787, 2011.
  • [41] U. Shankar, K. Talwar, J. S. Foster, and D. A. Wagner, “Detecting format string vulnerabilities with type qualifiers,” in 10th USENIX Security Symposium, Washington, D.C., USA, 2001.
  • [42] M. Backes, B. Köpf, and A. Rybalchenko, “Automatic discovery and quantification of information leaks,” in Proceedings of the 30th IEEE Symposium on Security and Privacy, Oakland, California, USA, 2009, pp. 141–153.
  • [43] F. Yamaguchi, A. Maier, H. Gascon, and K. Rieck, “Automatic inference of search patterns for taint-style vulnerabilities,” in 2015 IEEE Symposium on Security and Privacy, San Jose, CA, USA, 2015, pp. 797–812.
  • [44] L. K. Shar, L. C. Briand, and H. B. K. Tan, “Web application vulnerability prediction using hybrid program analysis and machine learning,” IEEE Trans. Dependable Sec. Comput., vol. 12, no. 6, pp. 688–707, 2015.
  • [45] J. Saxe and K. Berlin, “eXpose: A character-level convolutional neural network with embeddings for detecting malicious URLs, file paths and registry keys,” CoRR, vol. abs/1702.08568, 2017.