Towards Making Deep Learning-based Vulnerability Detectors Robust

Automatically detecting software vulnerabilities in source code is an important problem that has attracted much attention. In particular, deep learning-based vulnerability detectors, or DL-based detectors, are attractive because they do not need human experts to define features or patterns of vulnerabilities. However, such detectors' robustness is unclear. In this paper, we initiate the study in this aspect by demonstrating that DL-based detectors are not robust against simple code transformations, dubbed attacks in this paper, as these transformations may be leveraged for malicious purposes. As a first step towards making DL-based detectors robust against such attacks, we propose an innovative framework, dubbed ZigZag, which is centered at (i) decoupling feature learning and classifier learning and (ii) using a ZigZag-style strategy to iteratively refine them until they converge to robust features and robust classifiers. Experimental results show that the ZigZag framework can substantially improve the robustness of DL-based detectors.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 9

01/08/2020

VulDeeLocator: A Deep Learning-based Fine-grained Vulnerability Detector

Automatically detecting software vulnerabilities is an important problem...
11/21/2021

Challenging Machine Learning-based Clone Detectors via Semantic-preserving Code Transformations

Software clone detection identifies similar code snippets. It has been a...
03/22/2021

Shallow or Deep? An Empirical Study on Detecting Vulnerabilities using Deep Learning

Deep learning (DL) techniques are on the rise in the software engineerin...
04/11/2018

Detecting Malicious PowerShell Commands using Deep Neural Networks

Microsoft's PowerShell is a command-line shell and scripting language th...
09/03/2020

Deep Learning based Vulnerability Detection: Are We There Yet?

Automated detection of software vulnerabilities is a fundamental problem...
10/23/2020

DeFuzz: Deep Learning Guided Directed Fuzzing

Fuzzing is one of the most effective technique to identify potential sof...
09/08/2017

CuRTAIL: ChaRacterizing and Thwarting AdversarIal deep Learning

This paper proposes CuRTAIL, an end-to-end computing framework for chara...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The problem of detecting software vulnerabilities is yet to be solved, as evidenced by a large number of vulnerabilities reported in the Common Vulnerabilities and Exposures (CVE) [1]

. The wide reuse of open-source software and the increasing complexity of software supply-chains make the problem even more imperative

[2]. This can be justified by the Heartbleed vulnerability [3] and the software supply chain attack on the open-source npm package [4], highlighting the importance of detecting vulnerabilities in source code. The importance of the problem has motivated many studies involving two categories: static analysis which analyzes the software’s source code and dynamic analysis which executes software and observes its behavior (e.g., fuzzing [5, 6, 7, 8]). In this paper, we focus on static analysis-based vulnerability detectors. Static analysis-based detectors may analyze the source code in three ways: code similarity-based [9, 10, 11] vs. rule-based [12, 13, 14, 15, 16, 17, 18]

vs. machine learning-based

[19, 20, 21, 22] approaches. A recent development in machine learning-based detection is to use Deep Learning (DL). DL-based detectors are attractive because they do not need human experts to define features or patterns to represent vulnerabilities while achieving high effectiveness [23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35].

However, the robustness of DL-based vulnerability detectors is unclear, which motivates our study. Although DL models are known to suffer from adversarial examples in many domains, such as image processing [36, 37], speech recognition [38], malware detection [39, 40], program analysis [41, 42, 43], and code authorship attribution [44], it is unknown whether or not this robustness problem equally holds for vulnerability detectors. This is so because “adversarial vulnerability examples” must be compiled and executed while preserving semantics and vulnerabilities, a requirement having no counterpart in the aforementioned domains.

Our contributions. We initiate study on the robustness of DL-based vulnerability detectors, by making three contributions. First, to understand the robustness of these detectors, we leverage semantics-preserving code transformation

techniques, dubbed “attacks”, to show that four representative DL-based detectors suffer from adversarial vulnerability examples. These detectors are representative because they operate at different granularities, use different vector representations, and employ different neural networks. For instance, some attacks against a state-of-the-art DL-based detector

[24] can cause its false-positive rate to increase from 7.0% to 19.9% and false-negative rate to increase from 9.9% to 68.1%.

Second, to make DL-based detectors robust against adversarial vulnerability examples, we propose an innovative framework, dubbed “ZigZag”. The key insight is to decouple feature learning and classifier learning and make the resulting features and classifiers robust against code transformations. Specifically, ZigZag iteratively employs two classifiers, which have different decision boundaries but offer similar prediction results. In each iteration, the feature learning phase aims to extract robust features, which characterize input examples well and thus lead to similar, if not the same, predictions by the two different classifiers. Then, the classifier learning phase aims to train two robust classifiers, which have largely discrepant decision boundaries and few classification errors. In other words, the classifier learning phase optimizes two classifiers by increasing the discrepancy between their decision boundaries, whereas the feature learning phase optimizes features by reducing the two classifiers’ prediction discrepancy. This procedure is iterated as many times as needed, explaining the term “ZigZag”. When the iterative process converges, we obtain features and classifiers robust against code transformations. To show the effectiveness of ZigZag, we apply it to the aforementioned detector [24]. Experimental results show that when compared with the original detector, the hardened detector’s false-positive rate (8.4%) and false-negative rate (19.2%) are much lower than the original 19.9% and 68.1% incurred by adversarial examples, respectively.

Third, our experiments are based on a new dataset we collected, which might be of independent value. This dataset is derived from the National Vulnerability Database (NVD) [45] and the Software Assurance Reference Dataset (SARD) [46]. It contains 6,803 programs and their variants, leading to 50,562 vulnerable examples and 80,043 non-vulnerable examples at the function level. We have made the dataset and source code of ZigZag publicly available at https://github.com/ZigZagframework/zigzag_framework.

Paper organization. We analyze the robustness of existing DL-based detectors in Section 2. Then, we present the design of our framework ZigZag and evaluate its robustness in Section 3 and Section 4, respectively. Further, Section 5 discusses the limitations and future work, and Section 6 reviews the related prior work. In the end, Section 7 concludes the paper. Table I summarizes the main notations.

Notation Meaning
The set of all semantics-preserving code transformations
is available to the attacker
is available to the defender; is a subset of transformations used in a specific experiment
is a set of training programs
A set of programs expanded from by including the programs that are transformed from the ones in via some code transformations in
A DL-based detector learned from programs in
A ZigZag-enabled detector learned from programs in
Program has a vulnerability that can be detected by ; is a semantics-preserving transformation of and has a vulnerability that cannot be detected by
is a set of original test programs; is a set of target programs composed of the programs in and their manipulated programs

The probability that

predicts program as vulnerable
A threshold probability, whereby predicts as vulnerable if and non-vulnerable otherwise.
is a set of vectors corresponding to all examples generated from programs in , where () is the vector of an example with label
is a set of all examples generated from the program in , where () is the vector of an example with label
is a set of hard examples (i.e., false positives and false negatives) in , where () is the vector of an example with label
Feature generators used in
Classifiers used in
The probability that / / / predicts an example as vulnerable while using feature generator
Table I: Main notations used in the paper

2 Robustness of DL-based Detectors

A DL-based detector is trained from a set of programs in source code. The defender uses to determine whether a given target program in source code is vulnerable or not. Fig. 1 highlights the basic idea: the attacker attempts to manipulate a program, while preserving its semantics, to cause to classify (i) a manipulated program containing no vulnerability as “vulnerable” or (ii) a manipulated program containing a vulnerability as “non-vulnerable”. In this paper, we call manipulated programs adversarial examples regardless of whether the degree of code manipulation is small or not.

2.1 Attack Requirements

The first attack requirement is to preserve the semantics of a program. This is important because the manipulated program should be as useful as the original program. The second attack requirement is not to use obfuscation technique because users may not use any obfuscated code from a third party that is not known to be trustworthy (in fear of malicious code). The third attack requirement is the preservation of the vulnerability itself. This means that given a piece of vulnerable code, where the vulnerability can be detected by some existing DL-based detectors, the manipulated code remains vulnerable but the vulnerability it contains can evade those detectors. This is important because the attacker’s goal is to make vulnerabilities evade vulnerability detectors.

Figure 1: Illustration of an attack against DL-based detector

2.2 Attack Experiments

2.2.1 Selecting DL-based Detectors

Since DL-based detectors can be characterized by the granularity (e.g., function [32, 27, 28] vs. program slice [23, 24, 25]), the vector representation (e.g., sequence-based [23, 24, 25] vs. Abstract Syntax Tree or AST-based [26, 29, 31]), and the neural network

(e.g., Bidirectional Gated Recurrent Unit or BGRU

[24]

vs. Bidirectional Long Short-Term Memory or BLSTM

[23, 29]

vs. Convolutional Neural Network or CNN

[34]), we consider the following four DL-based detectors.

Program Slice + Sequence + BGRU. It can be instantiated as the detector SySeVR [24], which is an extended version of VulDeePecker [23], is publicly available, and operates at the fine granularity in that each program is represented by multiple program slices. A program slice is composed of a small number of program statements that are semantically related to each other. A slice is parsed as a sequence of tokens (e.g., identifiers, operators, constants, and keywords) and transformed into a vector. This vector representation is used to train a BGRU model for classifying program slices as vulnerable or not.

Function + Sequence + CNN. It can be instantiated as the detector presented in [34], which operates at the coarse granularity in that each function is treated as a unit. Specifically, each program is divided into multiple functions; each function is interpreted as a sequence of tokens and transformed into a vector; and these vectors are used for training a CNN model, which classifies the functions as vulnerable or not.

Function + Sequence + BLSTM. It can be instantiated as the detector presented in [32]. It operates at a coarse granularity by dividing a program into multiple functions, representing each function as a sequence of tokens, and training a BLSTM model to classify functions as vulnerable or not.

Function + AST + BLSTM. It can be instantiated as the detector that extends and enhances the DL-based detector [32] to capture more syntactic and structural information of program source code, by replacing its sequence-based representation with AST-based representation [33] and using the code2vec tool [47, 48] to aggregate multiple AST paths into a vector.

2.2.2 Preparing Dataset

The present study needs a dataset that satisfies the following requirements: (i) the dataset can support the generation of examples at different granularities; (ii) the programs in the dataset can be compiled for code transformation purposes; and (iii) the dataset should contain vulnerabilities in real-world software for training because our goal is to detect real-world software vulnerabilities which may be different from synthetic vulnerabilities. Because existing datasets [32, 28, 34, 24, 23, 49] do not satisfy the preceding three requirements, we create a new dataset by considering two vulnerability sources: NVD [45] and SARD [46]. For NVD, we collect: (i) vulnerable program files that are reported before 2017 and belong to open-source software written in C; and (ii) their patches, which can be obtained from the software vendors’ websites. The rationale for (i) is that we conduct experiments on real-world open-source software to detect vulnerabilities reported from 2017 to 2019 (Section 2.2.4 and Section 4.1). For SARD, each program is labeled as good (not vulnerable), bad (vulnerable), or mixed (vulnerable functions and their patched versions). In total, we collect 6,803 programs, each of which is vulnerable or patched. We take vulnerable (i.e., positive) examples and non-vulnerable (i.e., negative) examples at the function level as the ground truth, because each vulnerability can map to a function and each function has at most one vulnerability in our dataset. The 6,803 original programs includes 6,865 vulnerable examples and 10,843 non-vulnerable examples. The 6,803 programs and their variants generated by applying code transformations include 50,562 vulnerable examples and 80,043 non-vulnerable examples in total.

Figure 2: A vulnerable program (CVE-2012-0849) and its manipulated version obtained by applying the code transformation CT-6

2.2.3 Attack Methods

To demonstrate the feasibility of the attacks, we leverage real-world code transformation tools to launch attacks because they are designed to preserve program semantics. There are multiple real-world code transformation tools [50, 51, 52, 53]; we choose Tigress [50] because it provides various code transformations without obfuscating code. Table II describes 8 code transformations, denoted by , which are selected from what are offered by Tigress. We apply each of the 8 code transformations to each of the original programs to generate manipulated programs. Fig. 2(a) illustrates a vulnerable program containing an integer overflow vulnerability CVE-2012-0849 (vulnerable Line 6). Fig. 2(b) shows its manipulated version obtained by applying the code transformation CT-6 (i.e., SplitTop). CT-6 splits the original vulnerable function ff_j2k_dwt_init into two new functions _1_ff_j2k_dwt_init_ff_j2k_dwt_init_split_1 and _1_ff_j2k_dwt_init_ff_j2k_dwt_init_split_2. The code in the dashed box highlighted with

1
in Fig. 2(a) corresponds to the code in the dashed box highlighted with

4
in Fig. 2(b); the code in the dashed box highlighted with

2
in Fig. 2(a) corresponds to the code in the dashed box highlighted with

5
in Fig. 2(b). CT-6 also replaces the for loop with the while loop (i.e., the code in the dashed box highlighted with

3
in Fig. 2(a) and the code in the dashed box highlighted with

6
in Fig. 2(b)), replaces arrays with pointers (e.g., Line 15 in Fig. 2(a) and Line 35 in Fig. 2(b)), and replaces macro definition identifiers with static values (e.g., Line 6 in Fig. 2(a) and Line 11 in Fig. 2(b)).

No. Name Description
CT-1 EncodeStrings Replace the literal strings with calls to the function that generates these literal strings.
CT-2 RndArgs Reorder function arguments and/or add bogus arguments.
CT-3 Flatten Remove some control flows from a function.
CT-4 MergeSimple Merge multiple functions into one without control-flow flattening.
CT-5 MergeFlatten Merge multiple functions into one with control-flow flattening.
CT-6 SplitTop Split top-level statements into multiple functions.
CT-7 SplitBlock Split a basic block into multiple functions.
CT-8 SplitRecursive Split a basic block into multiple functions, and split the calls to split functions.
Table II: The 8 code transformations we use for attacks

2.2.4 Experimental Results

Our attacks satisfy the aforementioned attack requirements as follows. The requirement of semantics preservation is assured by Tigress, which is used to conduct code transformations and is designed to preserve program semantics. The requirement of no-obfuscation is satisfied by choosing 8 code transformations that do not involve any obfuscation operations. The requirement of vulnerability preservation is assured by manual examination. To check whether semantics-preserving transformations can preserve vulnerabilities, we add a flag to each vulnerable line of code in the original program to trace its corresponding line(s) of code in the manipulated program. We randomly select 200 vulnerable programs and manually check if the manipulated programs are still vulnerable. It takes about 105 hours of domain experts to confirm that the manipulated programs contain the same vulnerabilities as in the original programs.

Program Slice + Sequence + BGRU Function + Sequence + CNN Function + Sequence + BLSTM Function + AST + BLSTM
CT FPR FNR F1 FPR FNR F1 FPR FNR F1 FPR FNR F1
n/a 7.0 9.9 88.1 10.5 18.8 82.1 7.3 12.0 88.2 7.3 12.4 88.0
CT-1 15.5 66.4 39.4 38.8 45.7 50.7 35.0 40.4 55.8 34.3 40.3 56.1
CT-2 16.7 67.9 37.3 24.8 39.4 61.2 21.8 35.2 65.5 19.9 34.5 67.0
CT-3 22.9 75.4 27.8 44.5 47.3 47.8 40.3 43.5 51.8 39.2 43.5 52.2
CT-4 24.0 77.7 25.2 43.6 67.0 34.8 41.0 64.9 37.3 38.5 64.9 37.9
CT-5 24.0 78.6 24.2 42.6 70.4 32.0 40.8 68.2 34.4 40.2 67.9 34.8
CT-6 23.9 77.5 25.4 58.3 57.6 35.9 55.3 54.7 38.7 54.1 53.8 39.6
CT-7 23.7 76.8 26.2 59.2 60.3 34.5 57.1 57.7 36.9 55.9 56.7 38.0
CT-8 24.0 77.8 25.1 56.5 59.7 35.6 54.8 58.3 37.0 54.0 57.1 38.1
Total 19.9 68.1 35.7 41.6 49.9 47.0 38.7 46.3 50.6 37.7 45.9 51.3
Table III: Experimental results showing that the 4 DL-based detectors are lack of robustness against code transformations (metrics unit: %)

Evaluation of vulnerability detection evasion. To show the lack of robustness of DL-based detectors, we conduct attacks against four DL-based detectors. We randomly choose 80% of the 6,803 programs for training and use the rest for test. Target programs involve the original test programs and their manipulated programs with 8 code transformations, which are available to the attacker. At the function level, the training programs contain 4,079 vulnerable examples and 6,530 non-vulnerable examples; the target programs contain 17,516 vulnerable examples and 28,206 non-vulnerable examples.

Let TP denote true positives, FP denote false positives, TN denote true negatives, and FN denote false negatives. We use three standard metrics for evaluation [54]: (i) False-Positive Rate ; (ii) False-Negative Rate ; (iii) overall effectiveness or F1-measure , where precision

. We train four DL-based detectors and choose the hyperparameters that lead to the highest F1. Table

III summarizes the results. Compared with the function-level detectors, the program slice-level detector achieves better results for original test programs with a 1.4% lower FPR, a 4.5% lower FNR, and a 2.0% higher F1 on average, but achieves a 19.4% lower FPR, a 20.7% higher FNR and a 13.9% lower F1 for target programs on average. This indicates that the program slice-level detector misses many more vulnerabilities in manipulated programs. We speculate that this is caused by the following: a program slice has a finer granularity than a function, thus the detector at the program slice-granularity is more sensitive to the changes of vulnerable code. In contrast, the coarser-grained function-level detector can accommodate more changes in both vulnerable and non-vulnerable code. Therefore, the function-level detector causes more false-positives and fewer false-negatives for manipulated programs. In addition, each detector exhibits similar phenomena with respect to different code transformations. Take the “Program Slice + Sequence + BGRU” detector for instance. We observe that the manipulated programs achieve a high FPR of 21.8%, a high FNR of 74.8%, and a low F1 of 28.8% on average, which indicates that the DL-based detectors can easily make mistakes by manipulating programs.

DL-based detector Software product CVE ID Code transformations
Program Slice+ Sequence+ BGRU FFmpeg 2.8.2 CVE-2017-9608 CT-3, CT-5, CT-6
FFmpeg 2.8.2 CVE-2018-14395 CT-3, CT-5, CT-6
FFmpeg 2.8.2 CVE-2018-14394 CT-3, CT-4, CT-5
FFmpeg 2.8.2 CVE-2018-1999010 CT-1, CT-3, CT-4, CT-5, CT-6, CT-7
FFmpeg 2.8.2 CVE-2019-12730 CT-1, CT-2
Wireshark 2.0.5 CVE-2017-6467 CT-1, CT-2, CT-3
Wireshark 2.0.5 CVE-2017-6468 CT-1, CT-2, CT-3, CT-5, CT-6
Wireshark 2.0.5 CVE-2017-6469 CT-2, CT-3, CT-4, CT-5, CT-8
Wireshark 2.0.5 CVE-2017-6474 CT-4, CT-5, CT-6
Wireshark 2.0.5 CVE-2017-7702 CT-1, CT-2
Wireshark 2.0.5 CVE-2017-9345 CT-3, CT-4, CT-6, CT-7, CT-8
Wireshark 2.0.5 CVE-2017-11410 CT-4, CT-6, CT-7, CT-8
Wireshark 2.0.5 CVE-2017-11411 CT-4, CT-6, CT-7, CT-8
Wireshark 2.0.5 CVE-2017-13767 CT-1, CT-3, CT-4, CT-7
OpenSSL 1.1.0 CVE-2017-3730 CT-1, CT-2, CT-6, CT-7
OpenSSL 1.1.0 CVE-2017-3733 CT-1, CT-2, CT-3, CT-4
OpenSSL 1.1.0 CVE-2018-0732 CT-3, CT-4, CT-5, CT-6, CT-7
OpenSSL 1.1.0 CVE-2019-1543 CT-1, CT-2, CT-3, CT-4
OpenSSL 1.1.0 CVE-2019-1563 CT-1, CT-2, CT-3, CT-4
Function+ Sequence+ CNN FFmpeg 2.8.2 CVE-2018-9996 CT-1, CT-3
FFmpeg 2.8.2 CVE-2018-14395 CT-3, CT-4
FFmpeg 2.8.2 CVE-2018-1999010 CT-1, CT-3, CT-5
Wireshark 2.0.5 CVE-2017-6467 CT-2, CT-3
Wireshark 2.0.5 CVE-2017-6474 CT-4, CT-5, CT-6
Wireshark 2.0.5 CVE-2017-9344 CT-6, CT-7
OpenSSL 1.1.0 CVE-2017-3733 CT-1, CT-2
OpenSSL 1.1.0 CVE-2018-0735 CT-4, CT-5
Function+ Sequence+ BLSTM FFmpeg 2.8.2 CVE-2017-9608 CT-3, CT-5
FFmpeg 2.8.2 CVE-2018-14394 CT-3, CT-4, CT-5
FFmpeg 2.8.2 CVE-2018-1999010 CT-1, CT-3, CT-5
Wireshark 2.0.5 CVE-2017-6467 CT-2, CT-3
Wireshark 2.0.5 CVE=2017-6474 CT-4, CT-5, CT-6
Wireshark 2.0.5 CVE-2017-9345 CT-3, CT-5, CT-6
Wireshark 2.0.5 CVE-2017-11411 CT-4, CT-5, CT-6
OpenSSL 1.1.0 CVE-2017-3733 CT-1, CT-2, CT-3
OpenSSL 1.1.0 CVE-2018-0735 CT-3, CT-4, CT-5
OpenSSL 1.1.0 CVE-2018-0737 CT-3, CT-4, CT-5
Function+ AST+ BLSTM FFmpeg 2.8.2 CVE-2017-9608 CT-3, CT-5, CT-6
FFmpeg 2.8.2 CVE-2018-14394 CT-3, CT-4, CT-5
FFmpeg 2.8.2 CVE-2017-9996 CT-1, CT-3, CT-5
FFmpeg 2.8.2 CVE-2019-12730 CT-1, CT-2
Wireshark 2.0.5 CVE-2017-6468 CT-2, CT-3
Wireshark 2.0.5 CVE-2017-6474 CT-4, CT-5, CT-6
Wireshark 2.0.5 CVE-2017-9345 CT-3, CT-4, CT-6, CT-7, CT-8
Wireshark 2.0.5 CVE-2017-11410 CT-4, CT-6, CT-7, CT-8
Wireshark 2.0.5 CVE-2017-13766 CT-3, CT-4, CT-6
OpenSSL 1.1.0 CVE-2017-3733 CT-1, CT-2, CT-3
OpenSSL 1.1.0 CVE-2018-0734 CT-1, CT-2, CT-3
OpenSSL 1.1.0 CVE-2019-1543 CT-1, CT-2, CT-3,CT-4
Table IV: The vulnerabilities in the three software products that evade the DL-based detectors

To show the feasibility of the attack against real-world open-source software, we test it against three open-source software products to detect the vulnerabilities reported in the NVD from 2017 to 2019, while recalling that these detectors are trained using the vulnerabilities reported prior to 2017. We use the four DL-based detectors to detect vulnerabilities in three software products and their manipulated versions. We observe that the program slice-level detectors can detect more vulnerabilities than the function-level detectors and some vulnerabilities detected by different detectors are the same. Table IV summarizes the vulnerabilities in three software products that can evade the DL-based detectors. We observe that there are 19 vulnerabilities, 8 vulnerabilities, 10 vulnerabilities, and 12 vulnerabilities that can respectively evade the “Program Slice + Sequence + BGRU”, the “Function + Sequence + CNN”, the “Function + Sequence + BLSTM”, and the “Function + AST + BLSTM” detectors. Considering the “Program Slice + Sequence + BGRU” detector, we observe that 5 vulnerabilities in FFmpeg 2.8.2, 9 vulnerabilities in Wireshark 2.0.5, and 5 vulnerabilities in OpenSSL 1.1.0 are missed; these vulnerabilities are listed in Table IV. In summary,

Insight 1

DL-based vulnerability detectors are not robust against evasion.

3 ZigZag Framework

3.1 Characterizing DL-based Detectors

Fig. 3(a) and (c) depict the training phase and detection phase of a DL-based detector. The training phase consists of Steps 1, 2, and 3; the detection phase consists of Steps 1, 2, and 4. These steps are elaborated below.

Step 1: generating code fragments. In the training phase, training programs are used to train a DL-based detector. In the detection phase, the DL-based detector is used to detect whether the target programs contain vulnerabilities or not. A detector operates on code fragments at a certain granularity, such as function [32, 27, 28] and program slice [23, 24, 25]. This step extracts code fragments from each training program and each target program at the desired granularity. A code fragment extracted from a training program is labeled as vulnerable if it contains vulnerable statements and labeled as non-vulnerable otherwise.

Step 2: mapping code fragments to vectors. This step maps each code fragment into an appropriate form of representation (e.g., a sequence of tokens [23, 24, 25] or an abstract syntax tree [26, 29, 31, 33]), depending on the specifics of DL-based detectors. The code representation is then embedded into a vector by, for example, the concatenation of the vectors corresponding to the tokens.

Step 3: training a DL-based detector. This step only applies to the training phase. It uses the vectors corresponding to the code fragments and their labels to train a deep neural network, such as BGRU [24], BLSTM [23, 29], or CNN [34].

Step 4: detecting vulnerabilities. This step only applies to the detection phase. It uses the trained DL-based detector to determine whether a vector, which corresponds to a code fragment extracted from a target program, is vulnerable or not.

3.2 System and Threat Model

In the system model, we consider a defender, denoted . The defender trains a DL-based detector, denoted by , from a set of training programs . Let be the set of all possible code transformation methods whereby one can modify or manipulate one program into another program such that these two programs have the same semantics, dubbed semantics-preserving code transformations. Let denote the probability that program is vulnerable, according to detector . For the given threshold probability , predicts as vulnerable if and non-vulnerable otherwise.

In the threat model, the attacker, denoted by , has access to a set , where , of code transformation methods by which the attacker can manipulate a program into a new program via semantics-preserving code transformations. The attacker’s objective is to induce mistakes from the defender’s detector , namely achieving:

(1)

where means a false positive and means a false negative. may be interested in causing false negatives, which is known as the evasion attack [55].

The design objective is to harden into such that can detect the vulnerability in , namely achieving:

(2)
Figure 3: Overview of the ZigZag framework. The ZigZag framework attempts to “compile” a DL-based detector into a new vulnerability detector robust against code transformations. The new steps (i.e., Steps 0 and 3’) introduced by the ZigZag framework are highlighted with shaded boxes. Since the two data flows share Steps 1 and 2, we use blue arrows and red arrows to distinguish the inputs to Step 3’.

3.3 The ZigZag Framework

To achieve the design objective, it is intuitive to allow the defender to extend the set of training programs into a new set by mimicking what the attacker would do to evade . This enhanced set is leveraged to harden into . To produce , the defender needs to use some semantics-preserving code transformation methods. Let denote the set of code transformation methods that are available to the defender, where . The goal of the defender is to harden into to detect the vulnerability in , produced by applying some code transformation methods in to .

Figure 4: Illustrating decision boundaries of detector obtained via the conventional adversarial training and ZigZag-enabled detector

Why is conventional adversarial training incompetent? It may sound intuitive to use the examples corresponding to the programs in as input to the detector to train a detector , which is the conventional adversarial training. However, our experiments show that the effectiveness of is far from satisfactory (see Fig. 6 in Section 4.1). This is because the training process tries to make the distribution of the examples corresponding to the programs in and that corresponding to the programs in similar, causing many examples close to the decision boundary, which would cause misclassifications with small perturbations, as shown in Fig. 4(a).

Basic idea. The ZigZag framework can be seen as a “compiler” that compiles an input DL-based detector, which uses a single classifier, into a robust detector, which uses two classifiers. The key insight is that adversarial vulnerability examples often reside near the boundary of a classifier. Since it is possible that there are always some examples residing near the boundary of any given classifier, the ZigZag framework leverages two classifiers with distant decision boundaries and assures that a successful adversarial example must “fool” these two classifiers, which is harder to achieve when the two classifiers are required to predict consistently. This intuition can be enforced by decoupling feature learning and classifier learning when training a DL-based detector as follows: (i) feature learning aims at optimizing the feature representation such that the two classifiers use different decision boundaries but predict consistently, which implies robust features; and (ii) classifier learning aims at optimizing two classifiers by “pushing” their boundaries away from each other, as illustrated in Fig. 4(b) where the ZigZag-enabled detector consists of classifiers and with distant decision boundaries. Putting the preceding (i) and (ii) together, when the training process converges, a ZigZag-enabled detector is hard to evade because an adversarial example must “fool” both classifiers.

Fig. 3 highlights the ZigZag framework. A ZigZag-enabled detector differs from a DL-based detector in the training phase, as shown in Fig. 3(b) vs. Fig. 3(a). Note that as depicted in Figure 3(c), they share the same detection phase. The training of a DL-based detector has three steps, namely Steps 1, 2, and 3. For training a ZigZag-enabled detector, a new step (Step 0) is introduced and Step 3 is extended to what is called Step 3’.

Step 0: augmenting source code. The input includes (i) the source code of a set of training programs and (ii) the set of code transformations that is available to the defender. This step generates a set of programs, denoted by , which includes all programs in and the programs that are transformed from the programs in via the code transformations in , meaning . This step mimics the way a defender generates additional examples for adversarial training [55].

Step 1: generating code fragments. This is the same as in training a DL-based detector.

Step 2: mapping code fragments into vectors. This is the same as in training a DL-based detector.

Step 3’: training a ZigZag-enabled detector. The goal of this step is to learn vulnerability features robust against code transformations. Denote the set of code fragment vectors by for all code fragments (i.e., examples) generated from the training programs in , where () is the vector of a code fragment with label ( where “0” is non-vulnerable and “1” is vulnerable). We denote the set of all code fragments generated from the manipulated programs in by , where () is the vector of a code fragment with label (). This step has three substeps.

Step 3’.1: training a feature generator and two classifiers. This substep aims to train a feature generator and two classifiers and to correctly classify almost all examples from programs in . Let and respectively be the probability that and predict as vulnerable when using , where . Classifiers and are the same as classifier except that they use different initial parameter values. We initialize and differently at the beginning of training. We use all examples in to train , , and to minimize the classification loss, which is the sum of cross entropies of and :

(3)

where denotes the statistical expectation in . Fig. 5(a) illustrates an instance of classification results of and after Step 3’.1. Almost all examples from programs in are correctly classified, while many examples from manipulated programs in are incorrectly classified.

Step 3’.2: training two classifiers while attempting to maximize their prediction discrepancy. Because the classifiers and obtained from Step 3’.1 may be similar, this substep aims to train and to maximize their prediction discrepancy. The initial values of and are and obtained from Step 3’.1. For the examples in , we follow the training process of Step 3’.1 to minimize the classification loss, because Step 3’.1 is able to ensure that there are hardly any incorrectly classified examples in ; for the examples in , we transform the misclassification of examples into the large prediction discrepancy of two classifiers.

Let and respectively be the probability that and predict as vulnerable when using feature generator . For the examples in , the classification loss is the sum of cross entropies of and :

(4)

From , we identify the examples for which or make an incorrect prediction; we call them hard examples and denote them by a set . Hard examples satisfy (i) and make different predictions, meaning one makes a wrong prediction, or (ii) both classifiers make wrong predictions. We denote the set of hard examples by where () is the vector of a code fragment with label , which is obtained as follows:

(5)

where is a function that maps a probability to a label “1” or “0” and is a function that outputs the hard examples satisfying or . If the probability that an example is predicted as vulnerable is greater than threshold with , the prediction is “1”; otherwise, the prediction is “0”. The discrepancy loss of two classifiers for hard examples is the absolute values of the difference between the probabilities that two classifiers predict them as vulnerable:

(6)

In summary, we train and while using a fixed feature learner to minimize the classification loss for the examples in (i.e., minimize ) and maximize the discrepancy loss of two classifiers for the examples in (i.e., minimize ), which can be represented as

(7)

Fig. 5(b) illustrates the classification of and after the first iteration in Step 3’.2. “Fig. 5(a)(b)” shows the training of new classifiers, denoted by and where “” indicates the first iteration that continues the training but correspondingly starting at and (rather than from scratch). This explains the changes in the two classification boundaries in Fig. 5(b). Note that the training of classification functions penalizes the discrepancy between the classification of and that of . For “Fig. 5(a)(b)”, the positions of the examples remain unchanged and the decision boundaries are changed because the two classifiers are changed.

Figure 5: The ZigZag framework iteratively tunes the classifiers and the feature generator, where a positive (negative) example indicates a vulnerable (non-vulnerable) code fragment.

Step 3’.3: training a feature generator while attempting to minimize the two classifiers’ prediction discrepancy. To generate robust vulnerability features, this substep aims to train the feature generator to learn the features that minimize the prediction discrepancy of and . The initial value of is obtained from Step 3’.1. We use all examples in and fixed and from Step 3’.2 to train . The objective is to minimize the discrepancy loss of and :

(8)

Fig. 5(c) illustrates an instance of classification results of and after the first iteration of Step 3’.3. Fig. 5 (b)(c) corresponds to learning new features, denoted by where “” indicates the first iteration and the training continues, resuming from (rather than from scratch). Note that the classification boundaries, namely and , remain unchanged. For Fig. 5(b)(c), the positions of the examples are changed and the decision boundaries remain unchanged because the feature generator is changed.

The preceding Steps 3’.2 and 3’.3 are iterated, where each step guides the tuning of the other in an iterative fashion. Fig. 5 (c)(d) corresponds to training new classifiers, denoted by and where “” indicates the second iteration, by continuing the training process but respectively starting at and . This explains the changes in the two classification boundaries in Fig. 5(d). Note that the training of classification functions penalizes the discrepancy between the classifications of and . Fig. 5 (d)(e) corresponds to learning new features, denoted by where “” indicates the second iteration. Training continues, resuming from . Note that the classification boundaries, namely and , remain unchanged. This ZigZag learning process proceeds by simultaneously making (i) converge and (ii) and converge. Intuitively, this joint convergence indicates, in a sense, the identification of features and classifications that are insensitive to changes, which leads to robustness against adversarial examples.

4 Evaluation of ZigZag-enabled Robustness

Program Slice + Sequence + BGRU Function + Sequence + CNN Function + Sequence + BLSTM Function + AST + BLSTM
CT FPR FNR F1 FPR FNR F1 FPR FNR F1 FPR FNR F1
n/a 2.2 10.9 92.0 9.8 17.9 83.1 7.5 16.5 85.5 7.5 14.5 86.6
Known code transformations ()
CT-2 6.4 15.7 85.2 15.2 30.4 72.3 19.3 19.5 76.8 17.1 18.1 78.9
CT-7 8.7 19.6 81.2 22.1 33.9 66.2 22.6 20.1 74.5 19.2 21.6 75.5
CT-8 8.9 19.5 81.2 23.0 35.7 64.5 20.2 23.2 73.9 19.7 22.4 74.7
Unknown code transformations ()
CT-1 9.8 19.1 80.3 23.5 25.1 70.9 21.3 20.9 74.6 17.9 18.7 77.8
CT-3 9.7 21.2 79.5 23.3 26.7 70.2 14.9 28.1 73.9 15.8 24.3 75.8
CT-4 9.4 21.7 79.5 32.4 34.5 63.1 19.5 29.2 72.3 19.7 27.1 73.4
CT-5 10.3 21.6 78.8 33.4 35.5 62.1 20.8 29.5 71.4 19.2 25.4 74.8
CT-6 9.5 21.7 79.4 22.6 31.7 66.8 19.6 22.5 74.2 19.7 21.8 74.6
Total 8.4 19.2 81.7 21.4 29.5 69.5 17.8 22.7 75.8 16.8 21.1 77.3
Table V: Effectiveness of the ZigZag-enabled detectors when using (metrics unit: %)
(a) FPR
(b) FNR
(c) F1
Figure 6: Comparing the effectiveness of the original detector trained from , the detector trained from , and the ZigZag-enabled detector trained from with respect to four DL-based detectors, where is derived from via CT-2, CT-7, and CT-8 in .

We conduct experiments on a computer with two NVIDIA GeForce TITAN RTX GPUs and an Intel i9-9900X CPU running at 3.50GHz, and focus on answering the following three Research Questions (RQs):

  • RQ1: Are ZigZag-enabled detectors robust against code transformations? (Section 4.1)

  • RQ2: Does the robustness of ZigZag-enabled detectors depend on the defender’s choices of code transformations? (Section 4.2)

  • RQ3: Are ZigZag-enabled detectors more effective than other widely-used vulnerability detectors? (Section 4.3)

4.1 Robustness against Code Transformation Attacks (RQ1)

To evaluate the robustness of ZigZag-enabled detectors against the code transformation attacks, we set ={CT-2, CT-7, CT-8}, namely , as a concrete set of code transformation methods for the defender; we will discuss the impact of different choices of later. The input training programs are the programs in ; the input target programs are composed of the original test programs and their manipulated programs with 8 code transformations. The programs in are generated in Step 0 and contain 26,181 vulnerable examples and 40,994 non-vulnerable examples. We train the four ZigZag-enabled detectors and choose the hyperparameters that lead to the highest F1. Take the “Program Slice + Sequence + BGRU” detector for instance. The main hyperparameters are as follows: the batch size is 64; the number of hidden layers is 2; the dimension of hidden vectors is 900; the dropout is 0.2; the output dimension is 512; the learning rate is 0.002; the number of iterations for Steps 3’.2 and 3’.3 is 8; and the probability threshold is 0.4.

Table V summarizes the experimental results. Compared with four DL-based detectors, four ZigZag-enabled detectors can respectively improve the FPR, FNR, and F1 by 18.4%, 29.4%, and 29.9% on average for target programs . This means that the ZigZag framework can significantly improve the robustness of DL-based detectors, especially program slice-level ones. When compared with three ZigZag-enabled function-level detectors, the ZigZag-enabled program slice-level detector achieves a 10.3% lower FPR, a 5.2% lower FNR, and a 7.5% higher F1 on average, which can be attributed to its finer granularity. In addition, ZigZag-enabled detectors exhibit a similar degree of effectiveness. Taking the “Program Slice + Sequence + BGRU” detector for instance, we observe that the ZigZag-enabled program slice-level detector, when applied to the manipulated programs, can significantly improve the effectiveness in terms of all three metrics. We also observe that the effectiveness with respect to the known code transformations is a little higher than the effectiveness with respect to the unknown code transformations, with a 1.7% lower FPR, a 2.8% lower FNR, and a 3.0% F1 on average. This can be explained because the examples, which correspond to target programs and are generated via known code transformations, are more similar to the examples which correspond to the training programs in and are generated via the same code transformations in .

Comparing effectiveness of , , and . For each of these detectors, we consider four options: “Program Slice + Sequence + BGRU” vs. “Function + Sequence + CNN” vs. “Function + Sequence + BLSTM” vs. “Function + AST + BLSTM”. Fig. 6 compares FPR, FNR, and F1 of the original detector trained from , detector trained from (i.e., conventional adversarial training), and ZigZag-enabled detector trained from . Consider the “Program Slice + Sequence + BGRU” instances of these detectors. When compared with , achieves a 6.9% lower FPR, a 12.9% lower FNR, and an 11.9% higher F1 for target programs . When compared with , achieves a 3.9% lower FPR, a 35.6% lower FNR, and a 34.1% higher F1. The effectiveness gained by the ZigZag framework can be understood as follows. makes the features of the programs in and their manipulated programs in similar, which causes their examples close to the decision boundary to be misclassified. In contrast, makes the decision boundaries of two classifiers, and , largely discrepant, while reducing their classification errors.

For the computational time cost of the “Program Slice + Sequence + BGRU” detectors, the training and the test time of are about 2.5 hours and 5.3 hours, respectively; the training and the test time of are about 4.7 hours and 5.3 hours, respectively; the training and the test time of are about 11.5 hours and 5.8 hours, respectively. The test time is relatively long because of the large numbers of test examples, but the averaged test time cost is only 0.029 seconds per test example for and and 0.032 seconds for . The costs of would be justifiable by the enhanced robustness.

Effectiveness of the ZigZag framework when applied to real-world open-source software. As discussed in Section 2.2.4, we can respectively manipulate 19, 8, 10, and 12 vulnerabilities in the aforementioned three real-world open-source software products to evade the four DL-based detectors. Table VI lists the vulnerabilities detected in these three software products by the four ZigZag-enabled detectors. We observe, for instance, that the “Program Slice + Sequence + BGRU” detector detects all of the 19 vulnerabilities, but there is still much room for improvement because there are 12 vulnerabilities that can be transformed to evade the detector. Another open problem is to explain why some code transformations can evade the detection but others cannot. This leads to:

Insight 2

ZigZag-enabled detectors are substantially more robust than the original DL-based detectors and the detectors obtained via conventional adversarial training.

DL-based detector Software product CVE ID Detected code transformations Missed code transformations
Program Slice+ Sequence+ BGRU FFmpeg 2.8.2 CVE-2017-9608 CT-3, CT-5, CT-6 None
FFmpeg 2.8.2 CVE-2018-14395 CT-3 CT-5, CT-6
FFmpeg 2.8.2 CVE-2018-14394 CT-3, CT-4 CT-5
FFmpeg 2.8.2 CVE-2018-1999010 CT-1, CT-3, CT-5, CT-6 CT-4, CT-7
FFmpeg 2.8.2 CVE-2019-12730 CT-1, CT-2 None
Wireshark 2.0.5 CVE-2017-6467 CT-1, CT-2 CT-3
Wireshark 2.0.5 CVE-2017-6468 CT-1, CT-2, CT-3 CT-5, CT-6
Wireshark 2.0.5 CVE-2017-6469 CT-2, CT-4, CT-5, CT-8 CT-3
Wireshark 2.0.5 CVE-2017-6474 CT-4, CT-5, CT-6 None
Wireshark 2.0.5 CVE-2017-7702 CT-1, CT-2 None
Wireshark 2.0.5 CVE-2017-9345 CT-3, CT-6, CT-7, CT-8 CT-4
Wireshark 2.0.5 CVE-2017-11410 CT-4, CT-6, CT-7, CT-8 None
Wireshark 2.0.5 CVE-2017-11411 CT-4, CT-7, CT-8 CT-6
Wireshark 2.0.5 CVE-2017-13767 CT-1, CT-4 CT-3, CT-7
OpenSSL 1.1.0 CVE-2017-3730 CT-1, CT-2, CT-6, CT-7 None
OpenSSL 1.1.0 CVE-2017-3733 CT-1, CT-2 CT-3, CT-4
OpenSSL 1.1.0 CVE-2018-0732 CT-3, CT-4, CT-5 CT-6, CT-7
OpenSSL 1.1.0 CVE-2019-1543 CT-1, CT-2, CT-3, CT-4 None
OpenSSL 1.1.0 CVE-2019-1563 CT-1, CT-2 CT-3, CT-4
Function+ Sequence+ CNN FFmpeg 2.8.2 CVE-2018-9996 CT-1 CT-3
FFmpeg 2.8.2 CVE-2018-14395 CT-3 CT-4
FFmpeg 2.8.2 CVE-2018-1999010 CT-1, CT-3 CT-5
Wireshark 2.0.5 CVE-2017-6467 CT-2 CT-3
Wireshark 2.0.5 CVE-2017-6474 CT-4, CT-5 CT-6
Wireshark 2.0.5 CVE-2017-9344 CT-6, CT-7 None
OpenSSL 1.1.0 CVE-2017-3733 CT-1, CT-2 None
OpenSSL 1.1.0 CVE-2018-0735 CT-4 CT-5
Function+ Sequence + BLSTM FFmpeg 2.8.2 CVE-2017-9608 CT-3, CT-5 None
FFmpeg 2.8.2 CVE-2018-14394 CT-3 CT-4, CT-5
FFmpeg 2.8.2 CVE-2018-1999010 CT-1, CT-3 CT-5
Wireshark 2.0.5 CVE-2017-6467 CT-2 CT-3
Wireshark 2.0.5 CVE=2017-6474 CT-4, CT-5 CT-6
Wireshark 2.0.5 CVE-2017-9345 CT-5, CT-6 CT-3
Wireshark 2.0.5 CVE-2017-11411 CT-4 CT-5, CT-6
OpenSSL 1.1.0 CVE-2017-3733 CT-1, CT-2 CT-3
OpenSSL 1.1.0 CVE-2018-0735 CT-3, CT-4 CT-5
OpenSSL 1.1.0 CVE-2018-0737 CT-3 CT-4, CT-5
Function+ AST+ BLSTM FFmpeg 2.8.2 CVE-2017-9608 CT-3, CT-5 CT-6
FFmpeg 2.8.2 CVE-2018-14394 CT-3 CT-4, CT-5
FFmpeg 2.8.2 CVE-2017-9996 CT-1, CT-3 CT-5
FFmpeg 2.8.2 CVE-2019-12730 CT-1 CT-2
Wireshark 2.0.5 CVE-2017-6468 CT-2 CT-3
Wireshark 2.0.5 CVE-2017-6474 CT-4, CT-5 CT-6
Wireshark 2.0.5 CVE-2017-9345 CT-6, CT-7, CT-8 CT-3, CT-4
Wireshark 2.0.5 CVE-2017-11410 CT-7, CT-8 CT-4, CT-6
Wireshark 2.0.5 CVE-2017-13766 CT-3, CT-4 CT-6
OpenSSL 1.1.0 CVE-2017-3733 CT-1, CT-2, CT-3 None
OpenSSL 1.1.0 CVE-2018-0734 CT-1, CT-2 CT-3
OpenSSL 1.1.0 CVE-2019-1543 CT-1, CT-2, CT-3 CT-4
Table VI: The vulnerabilities that are obtained by manipulating the ones in the three real-world software products and detected by the four ZigZag-enabled detectors

4.2 Dependence on Defender’s Code Transformations (RQ2)

(a) FPR
(b) FNR
(c) F1
Figure 7: Comparing the FPRs, FNRs, and F1s among 6 instances of for the ZigZag-enabled “Program Slice + Sequence + BGRU”

To show whether the robustness of ZigZag-enabled detectors depend on the code transformations used by the defender, namely , we consider the following 6 instances of : =, ={CT-2, CT-7, CT-8}, ={CT-3, CT-4, CT-6}, ={CT-2, CT-4, CT-5, CT-6}, ={CT-1, CT-2, CT-3, CT-4, CT-6, CT-7}, and =. Since the ZigZag-enabled function-level detectors achieve similar results, we consider the ZigZag-enabled “Program Slice + Sequence + BGRU” detector with respect to the 6 instances of .

Fig. 7 shows the comparison results. We observe that the ZigZag-enabled detectors using to achieve an 11.5% lower FPR, a 47.4% lower FNR, and a 44.1% higher F1 on average, compared with using no code transformations (i.e., ). This indicates that introducing known code transformations during training can significantly improve the robustness of DL-based detectors. We also observe that the ZigZag-enabled detectors using to and are very close to each other. Specifically, the former detectors only achieve a 0.4% higher FPR, a 0.9% higher FNR, and a 0.7% lower F1 than the latter on average. This indicates that using several code transformations can achieve high effectiveness.

Insight 3

The code transformations available to the defender can significantly improve the robustness of DL-based detectors against code transformations. Several known code transformations can make the ZigZag-enabled detector achieve high effectiveness for manipulated programs.

4.3 Comparison with Other Vulnerability Detectors (RQ3)

We compare the effectiveness of ZigZag-enabled detectors with other widely-used vulnerability detectors. These vulnerability detectors involve a similarity-based detector VUDDY [9], an open-source rule-based tool Flawfinder [12], a commercial rule-based tool Checkmarx [15], a DL-based function-level detector [34], and a program slice-level detector SySeVR [24]. We choose them because they are widely used to detect vulnerabilities in C programs and they are available to us. We use available to the defender and use for test.

Table VII summarizes the experimental results. We observe that VUDDY [9] has a very high FNR and a low F1 because it can only detect vulnerable functions which are nearly identical to the vulnerable functions in the training programs. Therefore, most code transformations can cause VUDDY to miss vulnerabilities. Flawfinder [12] achieves a high FPR, a high FNR, and a low F1, because it does not use data flow analysis, which causes it to detect vulnerabilities inaccurately. Although Checkmarx [15] adopts data flow analysis, its rules which are defined by human experts are far from perfect, resulting in low effectiveness. The DL-based function-level detector [32] and SySeVR [24] are effective for the original test programs, but their effectiveness drops significantly when applied to the manipulated programs. However, the ZigZag framework can improve their effectiveness significantly. In particular, ZigZag-enabled SySeVR achieves an 8.4% FPR, a 19.2% FNR, and an 81.7% F1 when using , which outperforms all other vulnerability detectors in our experiments. This leads to:

Insight 4

The ZigZag-enabled detectors are much more effective than other kinds of vulnerability detectors when detecting vulnerabilities in manipulated programs.

Detector FPR (%) FNR (%) F1 (%)
VUDDY [9] 1.9 93.2 12.4
Flawfinder [12] 62.8 77.1 23.4
Checkmarx [15] 38.6 58.5 44.9
DL-based function-level detector [32]
38.7 46.3 50.6
SySeVR [24] 19.9 68.1 35.7
ZigZag-enabled function-level detector
17.8 22.7 75.8
ZigZag-enabled SySeVR 8.4 19.2 81.7
Table VII: Comparing the effectiveness of ZigZag-enabled detectors and the vulnerability detectors presented in the literature

5 Limitations and Future Work

The present study has some limitations. First, we focus on detecting vulnerabilities in C programs, but the methodology can be adopted or adapted to cope with other programming languages. Experiments need to be conducted for other languages. Second, our experiments only consider 8 code transformations from Tigress, which are sufficient for demonstrating the feasibility of the attack and the effectiveness of the ZigZag framework. Future studies should investigate all possible code transformations. Third, since existing datasets do not serve our purposes, the effectiveness evaluation is conducted on the dataset we collect from the NVD and SARD, which may raise an external validity issue. We have made our dataset publicly available so that third parties can repeat and validate our experiments. Fourth, the attack against vulnerability detectors incurs a large degree of manipulation to the source code. It is an open problem whether or not ZigZag is effective against adversarial examples generated via small manipulations (assuming it is possible). Fifth, the ZigZag framework uses a pair of classifiers. It is an open problem to investigate whether or not using three or more classifiers would make the resulting detectors more robust.

6 Related Work

Prior studies on detecting vulnerabilities. Our vulnerability detector leverages the static analysis of source code, which is complementary to the dynamic analysis approach [5, 6, 7, 8]. Static analysis-based detectors can be further divided into code similarity-based, rule-based, and machine learning-based detectors. Code similarity-based detectors aim to detect vulnerabilities caused by code clones [9, 10, 11]. Rule-based detectors use expert-defined rules to detect vulnerabilities [56], including open-source tools [12, 13, 14], commercial tools [15, 16], and academic efforts [17, 18]. Machine learning-based detectors [33] aim to use models learned from expert-defined feature representations of vulnerabilities [19, 20, 21, 22] or use DL models without requiring expert-defined feature representations [23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35]. DL-based detectors have received much attention recently. However, the robustness of DL-based detectors is not studied until now.

Prior studies on adversarial examples and training. Adversarial examples have attracted much attention in many domains, such as image processing [36, 37], speech recognition [38], malware detection [39, 40], program analysis [41, 42, 43], and code authorship attribution [44]. To the best of our knowledge, this paper is the first to study adversarial examples in the field of vulnerability detection, which make the DL-based detectors miss the vulnerabilities in the manipulated programs. On the other hand, adversarial training is an important method to improve the robustness of DL-based models in the fields such as image processing [57, 58, 59], neural language processing [60, 61], malware detection [62, 63], and source code processing [64, 65, 66]. The ZigZag framework proposed in this paper is an innovative adversarial training method for vulnerability detection.

Prior studies on code transformations. Code transformations are widely used for compiler-based optimizations [67, 68, 69], program readability and maintainability improvement [70, 71, 72], intelligent property protection [73, 74], and program analysis tasks evaluation [75, 76]. Code transformation methods are divided into three classes: semantics-preserving vs. semantics-approximating vs. semantics-changing [76]. There are some code obfuscators or program transformation tools for C programs [50, 51, 52, 53]. In this paper, we leverage semantics-preserving transformations to attack DL-based detectors.

7 Conclusion

We studied the robustness of DL-based vulnerability detectors by using experiments to demonstrate that simple attacks can make vulnerabilities evade them. We presented an innovative ZigZag framework to enhance the robustness of DL-based vulnerability detectors. The key insight underlying the framework is to decouple feature learning and classifier learning and make the resulting features and classifiers robust against code transformations. Experimental results show that the ZigZag framework can substantially improve the robustness of DL-based detectors. The limitations discussed in Section 5 offer open problems for future research.

Acknowledgment

The authors from Huazhong University of Science and Technology were supported in part by the National Natural Science Foundation of China under Grant No. U1936211. S. Xu was supported in part by ARO Grant #W911NF-17-1-0566 as well as NSF Grants #1814825 and #1736209. Any opinions, findings, conclusions or recommendations expressed in this work are those of the authors and do not reflect the views of the funding agencies in any sense.

References

  • [1] “Common Vulnerabilities and Exposures,” http://cve.mitre.org/, 2020.
  • [2] “Open Source Software Supply Chain Security,” https://linuxfoundation.org/wp-content/uploads/oss_supply_chain_security.pdf, 2020.
  • [3] Synopsys, “The heartbleed bug,” https://heartbleed.com/, 2014.
  • [4] “Compromised npm package: event-stream,” https://medium.com/intrinsic/compromised-npm-package-event-stream-d47d08605502, 2018.
  • [5] V. J. M. Manès, H. Han, C. Han, S. K. Cha, M. Egele, E. J. Schwartz, and M. Woo, “The art, science, and engineering of fuzzing: A survey,” IEEE Trans. Software Eng., 2019.
  • [6] S. Gan, C. Zhang, X. Qin, X. Tu, K. Li, Z. Pei, and Z. Chen, “CollAFL: Path sensitive fuzzing,” in Proceedings of 2018 IEEE Symposium on Security and Privacy (S&P), San Francisco, California, USA, 2018, pp. 679–696.
  • [7] M. Böhme, V. Pham, M. Nguyen, and A. Roychoudhury, “Directed greybox fuzzing,” in Proceedings of 2017 ACM SIGSAC Conference on Computer and Communications Security (CCS), Dallas, TX, USA, 2017, pp. 2329–2344.
  • [8] K. K. Ispoglou, D. Austin, V. Mohan, and M. Payer, “Fuzzgen: Automatic fuzzer generation,” in Proceedings of the 29th USENIX Security Symposium, 2020, pp. 2271–2287.
  • [9] S. Kim, S. Woo, H. Lee, and H. Oh, “VUDDY: A scalable approach for vulnerable code clone discovery,” in Proceedings of 2017 IEEE Symposium on Security and Privacy (S&P), San Jose, CA, USA, 2017, pp. 595–614.
  • [10] Z. Li, D. Zou, S. Xu, H. Jin, H. Qi, and J. Hu, “VulPecker: An automated vulnerability detection system based on code similarity analysis,” in Proceedings of the 32nd Annual Conference on Computer Security Applications (ACSAC), Los Angeles, CA, USA, 2016, pp. 201–213.
  • [11] J. Jang, A. Agrawal, and D. Brumley, “ReDeBug: Finding unpatched code clones in entire OS distributions,” in Proceedings of 2012 IEEE Symposium on Security and Privacy (S&P), San Francisco, California, USA, 2012, pp. 48–62.
  • [12] “Flawfinder,” http://www.dwheeler.com/flawfinder, 2019.
  • [13] “Rough Audit Tool for Security,” https://code.google.com/archive/p/rough-auditing-tool-for-security/, 2019.
  • [14] J. Viega, J. T. Bloch, Y. Kohno, and G. McGraw, “ITS4: A static vulnerability scanner for C and C++ code,” in Proceedings of the 16th Annual Computer Security Applications Conference (ACSAC), New Orleans, Louisiana, USA, 2000, pp. 257–267.
  • [15] “Checkmarx,” https://www.checkmarx.com/, 2019.
  • [16] “Coverity,” https://scan.coverity.com/, 2019.
  • [17] D. Gens, S. Schmitt, L. Davi, and A. Sadeghi, “K-Miner: Uncovering memory corruption in linux,” in Proceedings of the 25th Annual Network and Distributed System Security Symposium (NDSS), San Diego, California, USA, 2018.
  • [18] B. Shastry, F. Yamaguchi, K. Rieck, and J. Seifert, “Towards vulnerability discovery using staged program analysis,” in Proceedings of the 13th International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment (DIMVA), San Sebastián, Spain, 2016, pp. 78–97.
  • [19] F. Yamaguchi, M. Lottmann, and K. Rieck, “Generalized vulnerability extrapolation using abstract syntax trees,” in Proceedings of the 28th Annual Computer Security Applications Conference (ACSAC), Orlando, FL, USA, 2012, pp. 359–368.
  • [20] S. Neuhaus, T. Zimmermann, C. Holler, and A. Zeller, “Predicting vulnerable software components,” in Proceedings of 2007 ACM Conference on Computer and Communications Security (CCS), Alexandria, Virginia, USA, 2007, pp. 529–540.
  • [21] G. Grieco, G. L. Grinblat, L. C. Uzal, S. Rawat, J. Feist, and L. Mounier, “Toward large-scale vulnerability discovery using machine learning,” in Proceedings of the 6th ACM on Conference on Data and Application Security and Privacy (CODASPY), New Orleans, LA, USA, 2016, pp. 85–96.
  • [22] F. Yamaguchi, A. Maier, H. Gascon, and K. Rieck, “Automatic inference of search patterns for taint-style vulnerabilities,” in Proceedings of 2015 IEEE Symposium on Security and Privacy (S&P), San Jose, CA, USA, 2015, pp. 797–812.
  • [23] Z. Li, D. Zou, S. Xu, X. Ou, H. Jin, S. Wang, Z. Deng, and Y. Zhong, “VulDeePecker: A deep learning-based system for vulnerability detection,” in Proceedings of the 25th Annual Network and Distributed System Security Symposium (NDSS), San Diego, California, USA, 2018.
  • [24] Z. Li, D. Zou, S. Xu, H. Jin, Y. Zhu, and Z. Chen, “SySeVR: A framework for using deep learning to detect software vulnerabilities,” IEEE Trans. Dependable Sec. Comput., doi: 10.1109/TDSC.2021.3051525, 2021.
  • [25] D. Zou, S. Wang, S. Xu, Z. Li, and H. Jin, “VulDeePecker: A deep learning-based system for multiclass vulnerability detection,” IEEE Trans. Dependable Sec. Comput., doi: 10.1109/TDSC.2019.2942930, 2019.
  • [26] G. Lin, J. Zhang, W. Luo, L. Pan, and Y. Xiang, “POSTER: Vulnerability discovery with function representation learning from unlabeled projects,” in Proceedings of 2017 ACM SIGSAC Conference on Computer and Communications Security (CCS), Dallas, TX, USA, 2017, pp. 2539–2541.
  • [27] X. Duan, J. Wu, S. Ji, Z. Rui, T. Luo, M. Yang, and Y. Wu, “VulSniper: Focus your attention to shoot fine-grained vulnerabilities,” in

    Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI), Macao, China

    , 2019, pp. 4665–4671.
  • [28] Y. Zhou, S. Liu, J. Siow, X. Du, and Y. Liu, “Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks,” in Proceedings of 2019 Annual Conference on Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 2019, pp. 10 197–10 207.
  • [29] G. Lin, J. Zhang, W. Luo, L. Pan, Y. Xiang, O. Y. de Vel, and P. Montague, “Cross-project transfer representation learning for vulnerable function discovery,” IEEE Trans. Industrial Informatics, vol. 14, no. 7, pp. 3289–3297, 2018.
  • [30] V. Nguyen, T. Le, T. Le, K. Nguyen, O. DeVel, P. Montague, L. Qu, and D. Q. Phung, “Deep domain adaptation for vulnerable code function identification,” in Proceedings of 2019 International Joint Conference on Neural Networks (IJCNN), Budapest, Hungary, 2019, pp. 1–8.
  • [31] S. Liu, G. Lin, L. Qu, J. Zhang, O. De Vel, P. Montague, and Y. Xiang, “CD-VulD: Cross-domain vulnerability discovery based on deep domain adaptation,” IEEE Trans. Dependable Sec. Comput., doi: 10.1109/TDSC.2020.2984505, 2020.
  • [32] G. Lin, W. Xiao, J. Zhang, and Y. Xiang, “Deep learning-based vulnerable function detection: A benchmark,” in Proceedings of the 21st International Conference on Information and Communications Security (ICICS), Beijing, China, 2019, pp. 219–232.
  • [33] T. Sonnekalb, “Machine-learning supported vulnerability detection in source code,” in Proceedings of the 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), Tallinn, Estonia, 2019, pp. 1180–1183.
  • [34] R. L. Russell, L. Y. Kim, L. H. Hamilton, T. Lazovich, J. Harer, O. Ozdemir, P. M. Ellingwood, and M. W. McConley, “Automated vulnerability detection in source code using deep representation learning,” in Proceedings of the 17th IEEE International Conference on Machine Learning and Applications (ICMLA), Orlando, FL, USA, 2018, pp. 757–762.
  • [35] H. Wang, G. Ye, Z. Tang, S. H. Tan, S. Huang, D. Fang, Y. Feng, L. Bian, and Z. Wang, “Combining graph-based learning with automated data collection for code vulnerability detection,” IEEE Trans. Inf. Forensics Secur., vol. 16, pp. 1943–1958, 2021.
  • [36] I. J. Goodfellow, “Defense against the dark arts: An overview of adversarial example security research and future research directions,” CoRR, vol. abs/1806.04169, 2018.
  • [37] X. Yuan, P. He, Q. Zhu, and X. Li, “Adversarial examples: Attacks and defenses for deep learning,” IEEE Trans. Neural Networks Learn. Syst., vol. 30, no. 9, pp. 2805–2824, 2019.
  • [38]

    Y. Qin, N. Carlini, G. W. Cottrell, I. J. Goodfellow, and C. Raffel, “Imperceptible, robust, and targeted adversarial examples for automatic speech recognition,” in

    Proceedings of the 36th International Conference on Machine Learning (ICML), Long Beach, California, USA, 2019, pp. 5231–5240.
  • [39] D. Li, Q. Li, Y. Ye, and S. Xu, “SoK: Arms race in adversarial malware detection,” CoRR, vol. abs/2005.11671, 2020.
  • [40] D. Li and Q. Li, “Adversarial deep ensemble: Evasion attacks and defenses for malware detection,” IEEE Trans. Inf. Forensics Secur., vol. 15, pp. 3886–3900, 2020.
  • [41] M. R. I. Rabin, K. Wang, and M. A. Alipour, “Testing neural program analyzers,” in Proceeding of the 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), San Diego, CA, USA, 2019.
  • [42] H. Zhang, Z. Li, G. Li, L. Ma, Y. Liu, and Z. Jin, “Generating adversarial examples for holding robustness of source code processing models,” in Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI), New York, NY, USA, 2020, pp. 1169–1176.
  • [43] N. Yefet, U. Alon, and E. Yahav, “Adversarial examples for models of code,” CoRR, vol. abs/1910.07517, 2019.
  • [44] E. Quiring, A. Maier, and K. Rieck, “Misleading authorship attribution of source code using adversarial learning,” in Proceedings of the 28th USENIX Security Symposium (USENIX Security), Santa Clara, CA, USA, 2019, pp. 479–496.
  • [45] “National Vulnerability Database,” https://nvd.nist.gov, 2020.
  • [46] “Software Assurance Reference Dataset,” https://samate.nist.gov/SRD/index.php, 2020.
  • [47]

    U. Alon, M. Zilberstein, O. Levy, and E. Yahav, “code2vec: Learning distributed representations of code,”

    Proceedings of the ACM on Programming Languages, vol. 3, no. POPL, pp. 40:1–40:29, 2019.
  • [48] V. Kovalenko, E. Bogomolov, T. Bryksin, and A. Bacchelli, “PathMiner: A library for mining of path-based representations of code,” in Proceedings of the 16th International Conference on Mining Software Repositories (MSR), Montreal, Canada, 2019, pp. 13–17.
  • [49]

    J. Harer, O. Ozdemir, T. Lazovich, C. P. Reale, R. L. Russell, L. Y. Kim, and S. P. Chin, “Learning to repair software vulnerabilities with generative adversarial networks,” in

    Proceedings of 2018 Annual Conference on Neural Information Processing Systems (NeurIPS), Montréal, Canada, 2018, pp. 7944–7954.
  • [50] “Tigress,” https://tigress.wtf/, 2020.
  • [51] “Stunnix C/C++ obfuscator,” http://stunnix.com, 2019.
  • [52] “Sourceformatx,” http://www.sourceformat.com/obfuscate-code-cpp.htm, 2019.
  • [53] “Coccinelle,” http://coccinelle.lip6.fr/, 2019.
  • [54] M. Pendleton, R. Garcia-Lebron, J. Cho, and S. Xu, “A survey on systems security metrics,” ACM Comput. Surv., vol. 49, no. 4, pp. 62:1–62:35, 2017.
  • [55]

    L. Huang, A. D. Joseph, B. Nelson, B. I. P. Rubinstein, and J. D. Tygar, “Adversarial machine learning,” in

    Proceedings of the 4th ACM Workshop on Security and Artificial Intelligence (AISec), Chicago, IL, USA, 2011, pp. 43–58.
  • [56] F. Yamaguchi, “Pattern-based vulnerability discovery,” Ph.D. dissertation, University of Göttingen, 2015.
  • [57] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” in Proceedings of the 6th International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 2018.
  • [58] A. Shafahi, M. Najibi, A. Ghiasi, Z. Xu, J. P. Dickerson, C. Studer, L. S. Davis, G. Taylor, and T. Goldstein, “Adversarial training for free!” in Proceedings of 2019 Annual Conference on Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 2019, pp. 3353–3364.
  • [59] X. Gao, R. K. Saha, M. R. Prasad, and A. Roychoudhury, “Fuzz testing based data augmentation to improve robustness of deep neural networks,” in Proceedings of the 42nd International Conference on Software Engineering (ICSE), Seoul, South Korea, 2020, pp. 1147–1158.
  • [60] J. Li, S. Ji, T. Du, B. Li, and T. Wang, “TextBugger: Generating adversarial text against real-world applications,” in Proceedings of the 26th Annual Network and Distributed System Security Symposium (NDSS), San Diego, California, USA, 2019.
  • [61] Y. Zhang, A. Albarghouthi, and L. D’Antoni, “Robustness to programmable string transformations via augmented abstract training,” in Proceedings of the 37th International Conference on Machine Learning (ICML), Virtual Event, 2020, pp. 11 023–11 032.
  • [62] Y. Chen, S. Wang, D. She, and S. Jana, “On training robust PDF malware classifiers,” in Proceedings of the 29th USENIX Security Symposium (USENIX Security), 2020, pp. 2343–2360.
  • [63] D. Li and Q. Li, “Enhancing robustness of deep neural networks against adversarial malware samples: Principles, framework, and application to aics’2019 challenge,” in Proceedings of the AAAI-19 Workshop on Artificial Intelligence for Cyber Security (AICS), Honolulu, Hawaii, USA, 2019.
  • [64] H. Zhang, Z. Li, G. Li, L. Ma, Y. Liu, and Z. Jin, “Generating adversarial examples for holding robustness of source code processing models,” in Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI), New York, NY, USA, 2020, pp. 1169–1176.
  • [65] P. Bielik and M. T. Vechev, “Adversarial robustness for code,” in Proceedings of the 37th International Conference on Machine Learning (ICML), Virtual Event, 2020, pp. 896–907.
  • [66] N. Yefet, U. Alon, and E. Yahav, “Adversarial examples for models of code,” Proc. ACM Program. Lang., vol. 4, no. OOPSLA, pp. 162:1–162:30, 2020.
  • [67] D. Whitfield and M. L. Soffa, “An approach for exploring code-improving transformations,” ACM Trans. Program. Lang. Syst., vol. 19, no. 6, pp. 1053–1084, 1997.
  • [68] C. Kartsaklis, O. R. Hernandez, C. Hsu, T. Ilsche, W. Joubert, and R. L. Graham, “HERCULES: A pattern driven code transformation system,” in Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium Workshops & PhD Forum, Shanghai, China, 2012, pp. 574–583.
  • [69] J. M. P. Cardoso, J. G. de Figueiredo Coutinho, and P. C. Diniz, Embedded Computing for High Performance: Efficient Mapping of Computations Using Customization, Code Transformations and Compilation.   Morgan Kaufmann, 2017.
  • [70] G. Kniesel and H. Koch, “Static composition of refactorings,” Sci. Comput. Program., vol. 52, pp. 9–51, 2004.
  • [71] M. Fowler, Refactoring: Improving the Design of Existing Code.   Addison-Wesley Professional, 2018.
  • [72] J. A. Dallal and A. Abdin, “Empirical evaluation of the impact of object-oriented code refactoring on quality attributes: A systematic literature review,” IEEE Trans. Software Eng., vol. 44, no. 1, pp. 44–69, 2018.
  • [73] S. Ko, J. Choi, and H. Kim, “COAT: Code obfuscation tool to evaluate the performance of code plagiarism detection tools,” in Proceedings of 2017 International Conference on Software Security and Assurance (ICSSA), Altoona, PA, USA, 2017, pp. 32–37.
  • [74] O. M. Mirza, “Style analysis for source code plagiarism detection,” Ph.D. dissertation, University of Warwick, Coventry, UK, 2018.
  • [75] C. K. Roy and J. R. Cordy, “A mutation/injection-based automatic framework for evaluating code clone detection tools,” in Proceedings of the 2nd International Conference on Software Testing Verification and Validation (ICST), Denver, Colorado, USA, 2009, pp. 157–166.
  • [76] K. Wang and M. Christodorescu, “COSET: A benchmark for evaluating neural program embeddings,” CoRR, vol. abs/1905.11445, 2019.