Ranking Warnings of Static Analysis Tools Using Representation Learning

10/07/2021
by   Kien-Tuan Ngo, et al.
VNU
0

Static analysis tools are frequently used to detect potential vulnerabilities in software systems. However, an inevitable problem of these tools is their large number of warnings with a high false positive rate, which consumes time and effort for investigating. In this paper, we present DeFP, a novel method for ranking static analysis warnings. Based on the intuition that warnings which have similar contexts tend to have similar labels (true positive or false positive), DeFP is built with two BiLSTM models to capture the patterns associated with the contexts of labeled warnings. After that, for a set of new warnings, DeFP can calculate and rank them according to their likelihoods to be true positives (i.e., actual vulnerabilities). Our experimental results on a dataset of 10 real-world projects show that using DeFP, by investigating only 60 Moreover, DeFP improves the state-of-the-art approach 30 Recall.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

03/08/2022

Learning to Reduce False Positives in Analytic Bug Detectors

Due to increasingly complex software design and rapid iterative developm...
05/07/2021

Test Suites as a Source of Training Data for Static Analysis Alert Classifiers

Flaw-finding static analysis tools typically generate large volumes of c...
04/18/2022

AB/BA analysis: A framework for estimating keyword spotting recall improvement while maintaining audio privacy

Evaluation of keyword spotting (KWS) systems that detect keywords in spe...
07/06/2021

Furthering a Comprehensive SETI Bibliography

In 2019, Reyes Wright used the NASA Astrophysics Data System (ADS) t...
05/16/2022

Automatic Error Classification and Root Cause Determination while Replaying Recorded Workload Data at SAP HANA

Capturing customer workloads of database systems to replay these workloa...
02/14/2020

Did JHotDraw Respect the Law of Good Style?: A deep dive into the nature of false positives of bad code smells

Developers need to make a constant effort to improve the quality of thei...
02/14/2020

Did JHotDraw respect the Law of Good Style? – An exploratory deep dive into the nature of false positives of bad code smells

Developers need to make a constant effort to improve the quality of thei...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I introduction

In order to guarantee the quality of software, many techniques such as code review, automatic static analysis, and testing, etc. have been applied during the software development life cycle. Especially, static analysis [1, 27] plays an important role in detecting vulnerabilities at the early phases. Without executing programs, static analysis (SA) tools analyze the source code to identify the violations of the pre-defined rules and recommendations. These rules and recommendations are often defined by coding standards such as SEI CERT Coding Rule [11] or MISRA [36].

However, SA tools often report a large number of warnings (a.k.a. alarms) [3]. In particular, a warning indicates the statement containing the potential vulnerability, the vulnerability type, and often the additional meta-information [10]. In practice, developers have to manually inspect all the reported warnings and address them if necessary. However, due to the conservative over-approximation of program behaviors/properties of SA, many warnings are incorrectly reported by tools (i.e., false positive warnings). Among the reported warnings, true positive warnings or true positives (TPs) are actual vulnerabilities, while false positive warnings or false positives (FPs) are the positions which indeed do not violate the checking rules. Investigating FPs consumes time and effort but does not bring any benefit, therefore, the high FP rate reduces the productivity of developers [15, 16] and is bad for the usability of SA tools [15, 35]. In consequence, it is necessary to reduce the number of FPs that developers need to verify.

In previous studies, sophisticated techniques such as model checking and symbolic execution have been applied to eliminate FPs [32, 18, 29]. Although these approaches obtain high precision in identifying FPs, they generally face non-scalability and time consuming issues because of their state space problems [26].

With the growth in size and complexity of the source code, recently machine learning (ML) techniques are leveraged to build models for discovering patterns associated with TPs/FPs. In general, detecting TPs/FPs among SA warnings can be considered as a standard binary classification problem. To build a SA waning classification model, there are two main methods for extracting features from the warnings, one uses a pre-defined set of features and the other encodes the features by ML models. Then, from the extracted features, the ML models need to predict whether the SA warning is a TP or FP.

In several existing studies [37, 10, 4]

, the fixed sets of the hand-engineered features based on static code metrics and warning information have been derived from the source code and then fed to classifiers. The effectiveness of these approaches depends on the quality of the selected set of features. Moreover, the features in these approaches are manually defined by experts for certain kinds of warnings, therefore it is difficult to extend for handling different ones.

Meanwhile, instead of using a fixed pre-defined set of features, Lee et al. [17] proposed a model to learn the lexical patterns of the statements around the warnings at the source code token level. They used Word2vec [23]

to embed code tokens into the vector form and then trained a Convolutional Neural Networks (CNN) classifier. Specially, by investigating their projects, they define the number of surrounding statements that need to be extracted to reflect the contexts of the warnings. These numbers of extracted statements are different for their six proposed checkers. However, it is challenging to apply this approach for different projects and/or different kinds of warnings since it requires expert knowledge and carefully manual investigation to decide how many statements in each function are enough to capture the contexts of the checking warnings. Moreover, not every statement surrounding a warning is all related to its violation and equally important for TP/FP detection. Also, warning unrelated statements would negatively affect the performance of the ML models.

In this paper, we propose DeFP, a novel method to prioritize SA warnings. Instead of classifying SA warnings, DeFP predicts their likelihoods to be TPs and then ranks them with the top entries are more likely to be TPs (actual vulnerabilities), and the last entries are more likely to be FPs. Indeed, ranking SA warnings rather than classifying them gives us three following benefits.

First, for developers, DeFP helps shorten the cycle of developing and releasing products, especially for critical systems. The reason is that critical systems are highly required to be safe, secure, and reliable. Therefore, any potential vulnerabilities (warnings) are all needed to be addressed. In other words, if a warning is eliminated due to being falsely classified as an FP, it would cause the system to be dangerously explored in the future. Instead of directly eliminating any warnings classified as FP, DeFP ranks SA warnings according to their vulnerable likelihoods. With the pressure to release the high-quality software on time, focusing on the top-ranked warnings first helps developers to find more actual vulnerabilities in a fixed duration. Then, they can spend time and effort on the low-ranked warnings later.

Second, for SA tool builders, DeFP suggests case studies that they can examine to improve the quality of their tools. Indeed, to better serve the market, SA tool builders need to frequently improve their tools by not only increasing the TP rate, but also decreasing the FP rate. Among a huge number of reported warnings in various projects, DeFP suggests an effective order for investigation. Specially, SA tool builders can directly concentrate on addressing warnings which are more likely to be FPs, i.e., warnings at the last of the resulting lists, to find the patterns which tends to be incorrectly reported.

Third, for researchers, to build the datasets of real-world warnings, DeFP helps the data collection process be more efficient. In practice, in this field, it still lacks public datasets for evaluating approaches and researchers often have to manually investigate to label warnings. This process is extremely time-consuming. From the ranked lists of DeFP, researchers can effectively collect warnings by selectively labeling top-ranked and last-ranked warnings.

Our key idea is based on the intuition that warnings which have similar contexts tend to have similar labels (TP or FP). For each warning, DeFP exploits both syntax and semantics from the context of the warning and then determines its likelihood to be TP. In order to capture the context of a warning, DeFP

extracts all of the statements in the program which impact and are impacted by the statement containing the warning (the reported statement). After that, to better represent the general patterns of warnings, identifiers and literals, which are specific for functions/files/projects and could make the ML models be biased by the training source code, are replaced by abstract names. Next, the reported statements and their contexts are vectorized and used to train neural network models. One of the models represents the information specifically included in the reported statements, while the other extracts critical information in the warning contexts. Then, the high-level features encoded from these models are utilized to estimate the likelihoods to be TPs of the corresponding warnings. Lastly, SA warnings are ranked according to their predicted scores.

To the best of our knowledge, it still lacks a public real-world dataset for widely evaluating approaches post-handling SA warnings. In existing studies [37, 10, 4], ML models are often trained and tested on synthetic datasets, such as Juliet [31] and SARD [30]. However, Chakraborty et al. [5] has demonstrated that these datasets are quite simple for estimating the performance of ML models on real-world data. Thus, to address the limitation of data shortage, we propose a dataset containing 6,620 warnings of 10 real-world projects.

Our experiments show that about 60% of actual vulnerabilities are ranked by DeFP in Top-20% of warnings. Moreover, +90% of actual vulnerabilities can be found by investigating only 60% of the total warnings. Meanwhile, by using the state-of-the-art approach [17], with the same number of examined warnings, developers can find only 46% and 82% TPs.

In summary, our contributions in this paper are:

  • A novel approach to rank SA warnings, which does not require feature engineering and could be flexible to extend for different kinds of warnings.

  • A public dataset of 6,620 warnings collected from 10 real-world projects, which can be used as a benchmark for evaluating related work.

  • An extensive experimental evaluation showing the performance of DeFP over the state-of-the-art approach [17].

Ii Motivating example and guiding principles

Ii-a Motivating Example

Fig. 1: An FP reported by Flawfinder at line

Fig. 1 shows a simplified version of function aoc_s_event in project Asterisk111https://github.com/asterisk/asterisk, the complete version can be found on our website [8]. In this example, a warning related to Buffer Overflow (BO) is reported at line by Flawfinder [9], a static analysis tool. The reason is that strcat appends the string pointed by rate_str to the end of prefix. This could cause the size of the resulting string stored in prefix to be greater than which is the prefix’s size of allocated memory (line 15).

However, via inter-procedure analysis, we can conclude that this warning is an FP. Specially, at line in Fig. 1, prefix is set to be R(i) where i is the index of the loop, and the maximum size of prefix after this statement is , in the case of (i.e., 2,147,483,647). At line , prefix is appended a character (i.e., "/"), and then at line it is appended a string pointed by rate_str, which has length (rate_str = "Duration", line and line ). As a result, after line , the maximum length of prefix is , which is still much smaller than . Therefore, in order to determine whether a warning is a TP or FP, it need to conduct not only intra-procedure analysis but also inter-procedure analysis. In other words, simply approximating the context of a warning by its surrounding statements or by its containing function could be ineffective.

Ii-B Guiding Principles

In order to determine whether a SA warning is a TP or FP, analyzing only the reported statement is not enough. It requires investigating the context of the warning as well. For example, to conclude the warning at line in Fig. 1 is an FP, not only that statement but also the related statements such as lines 18, 19, 20, etc. need to be examined. Therefore, to build an ML model which can effectively predict the likelihoods to be TP/FP of the warnings, for each warning, we need to extract its appropriate context in the program. From the extracted contexts, the model can capture patterns associated with the warnings. Also, statements unrelated to the warning, which might cause noises and negatively affect the performance of the model, should be excluded from the warning context. In this paper, we propose the following principles for the problem of ranking SA warnings by representation learning.

P1. The warning contexts can be semantically captured by the statements in the program which can impact and be impacted by the reported statements. In practice, to determine whether a warning indicates an actual violation of a specific vulnerability type or not, it is necessary to investigate all of the feasible execution paths containing the reported statement. In other words, we need to examine the control flows and data flows of the program which contain the reported statement. Besides, functions in a program do not work independently, they often execute with the invocations of the others. Thus, inter-procedural analysis is essential to effectively capture the contexts of the warnings. Simply, all the statements in the functions/programs can be considered as the warning contexts, however, this method can cause unnecessarily large contexts and negatively affect the ML model’s predictive performance. Also, not all of the statements in the functions/programs are actually relevant to the warnings. For example, the statement at line 17 in Fig. 1 does not affect the decision about TP or FP of the warning at line 24. Including such irrelevant statements may cause the ML model to falsely learn the actual patterns associated with TPs/FPs. Therefore, inter-procedural slicing techniques [13] can be applied to effectively extract the warning contexts by statements semantically related to the reported statements and eliminate the irrelevant statements.

P2. The reported statements should be highlighted compared to the other statements in the warning contexts. Intuitively, not all the statements in the program slices are equally important regarding the considering warnings. The reported statements are where the vulnerabilities are potentially explored, thus, they should be highlighted compared to the other statements in their contexts. For example in Fig. 1, the statements at lines , , and , which all modify the value of prefix, are all necessary for investigating the warning. However, according to the report of Flawfinder, the BO vulnerability is potentially explored at line , which appends prefix by an unknown size string, i.e., a string is returned by another function. Intuitively, the statement at line should be emphasized compared to statements at lines and . Moreover, in practice, a program slice could contain several warnings, therefore to distinguish warnings, not only the program slices (i.e. context of the warnings) but also the reported statements need to be featurized.

P3. Identifiers should be abstracted. The reason is that identifiers such as variables, function names, constants are project-specific (or even file-specific/function-specific) and considerably vary regarding developers’ coding style. By learning such specific information, the ML model could not capture general patterns of the warnings. Also, this could make the models simply learn the connections between specific identifiers and warning labels (TP/FP). Consequently, the models accurately predict the warnings of several training programs but their performance might decrease dramatically on the different programs. Therefore, to build a general ML model which can work well and stably across programs, the identifiers should be abstracted into symbolic names, for example, VAR1, FUNC1, etc. Moreover, without abstraction, the number of identifiers could be virtually infinite, so ML model could have to deal with the vocabulary explosion problem.

Iii Static analysis warning ranking with representation learning

Fig. 2: Our proposed approach for ranking SA warnings

Fig. 2 illustrates our SA warning ranking approach. Particularly, from the source code and the set of warnings of the analyzed program, we extract the reported statements and their program slices associated with warnings. For each warning, the reported statement and the corresponding program slice are converted into vectors and then fed to the BiLSTM models to predict its likelihood to be TP. After that, all of the warnings of the program are ranked according to their predicted scores.

Iii-a Program Slice Extraction

In this work, to capture the context of each warning, we extract all the statements in the program which impact and are impacted by the corresponding reported statement. Specially, starting from the reported statement, we employ Joern [14] to conduct both backward and forward inter-procedural slicing in the program. For instance, the context of the warning at line 24 in Fig. 1 is captured by the program slice shown in Fig. 3. Therefore, by this approach, not only a large number of irrelevant statements in the program are removed, but also the warning contexts are precisely captured via control/data dependencies relationship throughout the program.

Iii-B Input Vectorization

Program slices and reported statements are lexical source code tokens. Meanwhile, neural network models require their inputs to be formalized as numeric vectors. Therefore, we need to represent the model input data by a suitable data structure. In this section, we show three steps to represent the program slice and the reported statement of each warning: identifier abstraction, tokenization, and vectorization.

Iii-B1 Identifier Abstraction

In general, programs often contain a huge number of identifiers, also their naming conventions and styles are diverse. It could cause difficulties for the ML models to capture the general patterns of the warnings [5]. Besides, ML models would simply learn characteristics of identifiers in certain projects, as well as simply map specific identifiers with corresponding warning labels. In order to avoid this problem, DeFP abstracts all the identifiers before feeding them to the models. In particular, variables, function names, and constants in the extracted program slices are replaced by common symbolic names. For example, function name aoc_s_event is replaced by FUNC1, the array prefix is replaced by VAR5, and its allocated size is replaced by NUMBER_LIT. The details of our rules for abstracting identifiers can be found on our website [8].

Fig. 3: The program slice of the warning at line in Fig. 1

Iii-B2 Tokenization

To represent in numeric vectors for feeding to the neural network models, both the extracted program slices and the reported statements are tokenized into sequences of tokens. In this work, we use lexical analysis to break down each code statement into code tokens, which including identifiers, keywords, punctuation marks, and operators. For instance, the statement at line in Fig. 1, strcat(prefix, rate_str);, is abstracted as strcat(VAR8, VAR11); and then it is tokenized into seven separated tokens: “strcat”, “(”, “VAR8”, “,”, “VAR11”, “)” and “;”.

Iii-B3 Padding and Truncation

In practice, the number of sequence tokens in different slices could be significantly different. For example, in our experiment, the sequence lengths can vary from 5 to 9,566 code tokens. Therefore, to ensure that all the sequences have the same length,

, before inputting to the neural network, we use padding and truncation techniques. To achieve the best performance, the fixed length

is carefully selected via multiple experiments.

Particularly, for sequences having lengths smaller than , we add one or more special tokens (pad) at the end of these sequences. For the sequences whose lengths are greater than , we truncate them to fit the fixed length.

In practice, the positions of the reported statements in their corresponding slices significantly vary. They can appear at the beginning of the slices or at the end of the slices. Thus, truncating from either the beginning or the end of the sequences could lead to the cases that the statements containing warnings are missed in the truncated sequences. Therefore, in order to guarantee that the truncated sequences always contain the reported statements, we take these statements as the center for truncating. Specially, from the reported statements, we extend to both sides of the sequences. until reach the fixed length . Importantly, to capture the correct semantics of each code statement, we ensure that a statement will be fully included in the truncated sequences. It means that when only some tokens of a statement are included in the truncated sequences and the remaining is left due to the length limitation, we will replace all of the tokens of that statement in the truncated sequence by (pad) token.

Fig. 4: The proposed LSTM-based representation learning model for ranking SA warnings

Iii-B4 Vectorization

In this step, the token sequences are embedded into numeric fixed-length vectors. In practice, besides the structural information of the sequences which need to be encoded, relationships between the tokens are also important. The reason is that code tokens have to appear together in a certain order to make the program grammatically and syntactically correct [22]. For this purpose, in the vectorization step, we use Word2vec model [23] with Continuous Bag-of-Words (CBOW) architecture.

Iii-C Representation Learning and Warning Ranking

The DeFP’s model architecture is shown in Fig. 4

. Especially, to effectively learn contextual information that is crucial for revealing TP/FP code patterns, two Bidirectional Long Short-Term Memory networks (BiLSTM) 

[12] are employed to train on the embedding vectors of the program slices and the reported statements. Afterwards, DeFP extracts the meaningful characteristics of related warnings by concatenating the outputs from these BiLSTM models and feed into Fully Connected layers. In particular, we consider the final layer’s output of the model as the likelihood to be TP/FP of each input warning. All these scores are finally gathered by DeFP and ranked accordingly.

Iii-C1 Representation Learning

In this work, the warning contexts (i.e. program slices) and the reported statements are encoded by two BiLSTM networks. In the case of program slices containing multiple statements distributed across functions, LSTM architecture might essentially apprehend the relationships between code tokens. Additionally, by utilizing a gated mechanism, LSTM can handle long-term dependencies and also focus on the most significant parts of the sequences.

However, the information in LSTM is expressed one way through continuous time steps in sequential order. Meanwhile, the occurrence of a code token is usually related to either the previous or the subsequent tokens, or even both. Thus, additionally applying the bidirectional implementation of LSTM can assist the model to build dependencies in both forward and backward directions, which efficiently captures the general pattern of warnings.

DeFP

also employs the Global Max Pooling (GMP) layer to accumulate the output of each BiLSTM network. Especially, GMP layer computes maximum values over LSTM’s time steps, which help to reduce output dimension and only keep the most important elements from LSTM cells. As a result,

DeFP has two GMP layers following each BiLSTM network, and then they are concatenated into a unified one to represent a whole feature map (Fig. 4).

Iii-C2 Warning Ranking

After obtaining the learned representations of the warnings, DeFP distinguishes their patterns by feeding them into three Fully Connected (Dense) layers behind. Particularly, the final layer has only two hidden units activated by the Softmax function, which produces two scores whose total is 1.0. These two values correspond to the likelihoods of each warning to be TP and FP, respectively.

In the training phase, DeFP’s neural network enhances its predictions by finding the best hidden weights through estimating its errors. In other words, an objective function, cross-entropy, is applied to calculate the model’s loss and update the weights towards minimizing this error value. Consequently, in the case of a TP warning, the model tends to converge its TP score towards 1.0 and its FP score closes to 0.0, and vice versa for an FP warning. In the ranking phase, by inputting a list of warnings, DeFP directly calculates their TP scores and sorts them in descending order.

Iv Empirical Methodology

In order to evaluate DeFP, we seek to answer the following research questions:

  • RQ1: How accurate is DeFP in ranking SA warnings? and how is it compared to the state-of-the-art approach [17]?

  • RQ2: How does the extracted warning context affect DeFP’s performance? (P1)

  • RQ3: How does the highlighting reported statement impact the performance of DeFP? (P2)

  • RQ4: How does the identifier abstraction component impact the performance of DeFP? (P3)

Iv-a Dataset

In order to train and evaluate an ML model ranking SA warnings, we need a set of warnings labeled to be TPs or FPs. Currently, most of the approaches are trained and evaluated by synthetic datasets such as Juliet [31] and SARD [30]. However, they only contain simple examples which are artificially created from known vulnerable patterns. Thus, the patterns which the ML models capture from these datasets could not reflect the real-world scenarios [5]

. To evaluate our solution and the others on real-world data, we construct a dataset containing 6,620 warnings in 10 open-source projects 

[38, 21]. Table I shows the overview of our dataset.

In these 10 real-world projects, functions are previously manually labeled as vulnerable and non-vulnerable  [38, 21]. Then, our dataset is constructed by the following steps:

  1. Collecting warnings: We pass the studied projects through three open-source SA tools, Flawfinder [9], CppCheck [6], RATS [33] to collect a set of warnings. In practice, this set contains warnings in multiple kinds of vulnerabilities. However, we only collect the warnings related to Buffer Overflow (BO) and Null Pointer Dereference (NPD) since for the other kinds, the number of reported warnings are too small for training and evaluating an ML model.

  2. Labeling warnings in the non-vulnerable functions: Since, these functions are already marked as clean regarding BO and/or NPD, thus all the corresponding warnings in these functions are annotated as FPs.

  3. Labeling warnings in the vulnerable functions: Although these functions are marked containing BO and/or NPD vulnerabilities, we do not know exactly how many vulnerabilities each function contains and the positions of the vulnerabilities in the source code. Therefore, for each of the warnings in these functions, we manually investigate to label whether it is a TP or FP.

No. Project Buffer Overflow Null Pointer
Dereference
#W #TP #FP #W #TP #FP
1 Asterisk 2049 63 1986 133 0 133
2 FFmpeg 1139 387 752 105 37 68
3 Qemu 882 396 486 72 39 33
4 OpenSSL 595 53 542 9 2 7
5 Xen 388 15 373 23 6 17
6 VLC 288 20 268 16 2 14
7 Httpd 250 45 205 17 0 17
8 Pidgin 250 13 237 242 0 242
9 LibPNG 83 9 74 2 0 2
10 LibTIFF 74 9 65 3 3 0
# Total 5998 1010 4988 622 89 533
  • #W, #TP and #FP are total warnings, true positives and false positives.

TABLE I: Overview of DeFP’s dataset

Iv-B Evaluation Setup, Procedure, and Metrics

Iv-B1 Experimental Setup

We implemented neural network models using Keras together with TensorFlow backend (version 2.5.0). The tokenizer was built upon NLTK library (version 3.6.2) and

Word2vec embedding model was provided by the gensim package (version 3.6.0). All experiments were computed by a server running Ubuntu 18.04 with an NVIDIA Tesla P100 GPU.

We adopt cross-validation to train several neural networks and select the best parameter values corresponding to the effectiveness of predicting likelihoods to be TP warnings in the proposed dataset. Specially, for DeFP

, embedding size are set to 96, the maximum length of each slice and reported statement is fixed to 600 and 40, respectively, and they are learned by two BiLSTM networks which each has 256 hidden nodes. During the training phase, the dropout, batch size, and number of epochs is set to 0.1, 64 and 60, respectively. Also, the minibatch stochastic gradient descent ADAMAX optimizer is selected with the learning rate of 0.002.

Besides, the data is sampled into stratified 5 folds, while 4 folds are picked for training and 1 remaining fold for testing (ratio of 8:2). We then run 5 different experiments on 5 pairs of training and testing data and aggregate average results for the final assessment of the corresponding experiment.

WN Project Method # TP warnings found in top-k% warnings
Top-5% Top-10% Top-20% Top-50% Top-60%
Precision Recall Precision Recall Precision Recall Precision Recall Precision Recall
BO Qemu CNN 71.11% 8.09% 53.33% 12.13% 46.86% 20.72% 44.32% 49.25% 43.02% 57.57%
DeFP 82.22% 9.34% 67.78% 15.40% 65.14% 28.78% 52.27% 58.08% 50.38% 67.43%
FFmpeg CNN 30.91% 4.40% 31.30% 9.30% 33.24% 19.64% 32.46% 47.80% 33.04% 58.39%
DeFP 67.27% 9.56% 61.74% 18.34% 52.43% 31.00% 38.95% 57.37% 37.72% 66.66%
Asterisk CNN 11.00% 17.56% 8.78% 28.59% 7.56% 49.36% 4.49% 72.95% 3.82% 74.49%
DeFP 34.00% 53.97% 18.54% 60.26% 10.73% 70.00% 5.18% 84.10% 4.56% 88.97%
COMBINED CNN 43.00% 12.77% 39.67% 23.56% 34.25% 40.69% 25.40% 75.45% 23.46% 83.56%
DeFP 66.00% 19.60% 56.00% 33.27% 43.92% 52.18% 27.50% 81.68% 24.82% 88.42%
NPD COMBINED CNN 63.33% 21.37% 43.33% 29.15% 38.40% 53.99% 21.29% 74.25% 19.62% 82.09%
DeFP 80.00% 26.93% 65.00% 43.66% 47.20% 66.14% 25.81% 89.74% 22.58% 94.25%
TABLE II: Performance of DeFP and CNN model proposed by Lee et al. [17] in ranking SA warnings

Iv-B2 Empirical Procedure

RQ1. We compare the performance of DeFP and the CNN model proposed by Lee et al. [17] for ranking warnings in the proposed dataset.

RQ2. We study the impact of the warning contexts on the performance of DeFP. Specially, we compare the performance of DeFP in four scenarios of the warning contexts: (1) the raw code of the program, (2) the program slices on control dependencies, (3) the program slices on data dependencies, and (4) the program slices on both control and data dependencies.

RQ3. We study the impact of highlighting the reported statements on DeFP’s performance. We compare the ranking results of DeFP in two scenarios when the reported statements are and are not encoded for training the BiLSTM model.

RQ4. We study the impact of the identifier abstraction component by comparing the performance of DeFP when the inputs are embedded with and without this component.

For evaluation, we have two experimental settings as widely adopted in related studies [7, 21, 38]: within-project setting and combined-project setting. First, in within-project setting, warnings from the same project are split into training and testing sets. Second, in combined-project setting, we collect the warnings from all 10 projects and then split them into training and testing sets. In practice, in several projects, the number of warnings is quite small for training and testing an ML model, and it could cause overfitting or underfitting problems. Thus, we only select three projects which have the largest number of BO warnings for evaluating RQ1 in the corresponding vulnerability type in the within-project setting. RQ1 in the NPD vulnerability and the other research questions are only evaluated in the combined-project setting.

Iv-B3 Evaluation Metrics

In order to evaluate DeFP and compare its performance with the state-of-the-art approach, we applied Top-% Precision (P@K) and Top-% Recall (R@K). These two metrics are widely used in related studies [21, 28], especially when the dataset is severely imbalanced. In this paper, P@K denotes the proportion of actual TP warnings in the Top-% of warnings ranked by the model, and R@K refers to the proportion of correctly predicted TP warnings in Top-% among the total actual TPs warnings. In particular, P@K and R@K are calculated using the following formulas, where is the set of actual TP warnings, is the list of Top-% of warnings ranked first by the model.

V Experimental results

V-a Performance Comparison (RQ1)

Table II shows the performance of DeFP and the CNN model proposed by Lee et al. [17] in Top-5%–Top-60% warnings of the ranked lists. Overall, DeFP improves their model by nearly 30% in Precision and Recall for both BO and NPD warnings. For example, in FFmpeg, Qemu, and Asterisk with the Top-20% of warnings returned by DeFP, developers can find 23/79, 24/77, and 9/13 actual vulnerabilities. Meanwhile, by using the CNN model [17], the corresponding figures are only 16/79, 15/77, and 6/13, respectively. When the warnings of all the projects are combined, by investing 20% of the warnings in the top of the ranked list, 105/202 actual BO vulnerabilities and 12/18 actual NPD vulnerabilities can be found by DeFP, while these figures for the CNN model are only 82/202 and 10/18, respectively. Interestingly, with the results of DeFP, developers can find +90% of actual vulnerabilities by investigating only 60% of the total warnings, which is 8% better than the CNN model.

Indeed, DeFP obtains better results because it concentrates on statements which semantically describe the contexts of the warnings and DeFP is not negatively affected by the statements which are unrelated to the warnings. For example, the warning in Fig. 1, DeFP captures its context by the statements which impact and are impacted by the reported statement at line 24 as shown in Fig 3. These statements are essential for semantically capturing the context of the warning because they show when and how the value of prefix, which is reported to potentially overflow, is changed. Additionally, unlike the CNN model [17], DeFP ignores statements at lines 17 and 26, which do not play any role in reflecting the violation of the reported statement, although they are near it. Therefore, they could cause noises if they are encoded as the context of the warning.

In addition, by inter-procedural analysis, DeFP does not miss important information to validate the violation of the reported statements. Specially, there are statements, which are essential for validating the warnings, could be in multiple functions. For example in Fig 1, the statement at line 6 specifying the concatenated string to prefix is extremely important to determine whether the BO vulnerability could occur at line 24 or not. However, this statement is not in the same function with the reported statement, yet in another function aoc_rate_type_str. This statement is captured by DeFP, but it will be missed if only intra-procedural analysis is considered.

Interestingly, among the studied projects, DeFP achieved the highest results in Qemu and the lowest results in Asterisk. Specially, for Top-20% warnings of Qemu, DeFP obtained 65.14% in Precision, whereas this figure of Asterisk is only 10.73%. The reason is that the models are impacted by the imbalance of the dataset. For instance, the numbers of TPs and FPs in Qemu are quite balanced, while they are greatly imbalanced in Asterisk. Moreover, Asterisk only contains 63 TPs, which is extremely small compared to its 1986 FPs.

V-B Impact of the Warning Context (RQ2)

Fig. 5: Impact of the extracted warning contexts on DeFP’s performance. RAW, CD, DD, and CD && DD denote the warning contexts which are captured by raw source code, program slices on control dependencies, program slices on data dependencies, and program slices on both control and data dependencies, respectively.

Fig. 5 shows the performance of DeFP when the contexts of the warnings are captured by different kinds of dependencies. DeFP obtains the best performance when the warning contexts are captured by both control and data dependencies on the PDG. The reason is that, by using slicing techniques on both of these dependencies, unrelated statements are removed from the warnings’ contexts and only related statements are encoded and fed to the BiLSTM models. Therefore, the models can better capture the patterns of warnings without being affected by noises caused by the statements which are semantically unrelated to the warnings. In particular, when program slices are conducted on both control and data dependencies, the performance of DeFP in two studied vulnerability types is 16% and 19% better than when the warning contexts captured by the raw code of the containing functions. Interestingly, by slicing on both control and data dependencies, the performance of DeFP is significantly improved in the warnings which are ranked at the top of the lists. Specially, compare to the raw code, for the Top-1%–Top-50% of the ranked warnings, DeFP’s performance with this kind of program slices is improved 42% for NPD warnings and 34% for BO warnings. In other words, among Top-20% of warnings (243 warnings for BO and 25 warnings for NPD) in the resulting lists, DeFP correctly ranks 105/202 and 12/18 actual BO and NPD vulnerabilities. Meanwhile, when the warning context is captured by raw code of the whole functions, these figures are only 79/202 and 9/18, respectively.

Importantly, for both BO and NPD vulnerabilities, program slices on only data dependencies capture the warning contexts better than the raw code of functions, however, the program slices on only control dependencies do worse. The reason is that for these two kinds of vulnerabilities, the information about data dependencies, which illustrates how the values of the variables are propagated, is more informative for reasoning the warnings. For example in Fig. 1, to determine whether the warning (line 24) is an FP, it is essential to analyze the statements which have data-dependent on, such as lines 6, 15, 19, etc. Although the raw code may contain noises and unrelated statements, it still contains all of this information. However, this important information is missed in the program slices on control dependencies only. Therefore, the performance of DeFP with raw code is worse than the program slices on data dependencies, yet better than the program slices on control dependencies. Specifically for Top-1%–Top-60% of warnings, compared to the raw code, DeFP’s results with program slices on the data dependencies is 7% and 29% better for BO and NPD. Also, compared to the program slices on the control dependencies, these figures are 8% and 46%, respectively.

In practice, for different kinds of vulnerabilities, it could require control or data dependencies or both of these two kinds of information for validating the warnings. Therefore, to guarantee the best performance of DeFP, program slices on both control and data dependencies should be leveraged to capture the warning contexts.

V-C Impact of the Reported Statements (RQ3)

Fig. 6: Impact of highlighting the reported statements on the performance of DeFP

As seen in Fig 6, the performance of DeFP is slightly improved by 4% and 7% for BO and NPD vulnerabilities when the reported statements are highlighted by being encoded as an input of the BiLSTM model. For example, in Top-20% of ranked warnings, by encoding the reported statements, developers can find 7 more actual BO vulnerabilities and 1 more actual NPD vulnerabilities. More details about the performance of DeFP on P@K can be found on our website [8].

Indeed, highlighting the reported statements can help the neural network model not only capture the patterns associated with the warning contexts, but also explicitly emphasize the positions of warnings. Consequently, this would be considerably helpful when several warnings having similar contexts but labeled (TP and FP) differently. However, our dataset is built from the set functions which are already classified as vulnerable or non-vulnerable. Thus, most of the warnings in a vulnerable function tend to have the same TP labels. Also, all of the warnings in a non-vulnerable function are labeled as FPs. That is the reason why the DeFP’s performance is just slightly improved when the reported statements are encoded as an input of the representation model.

V-D Impact of the Identifier Abstraction Component (RQ4)

Fig. 7: Impact of identifier abstraction on DeFP’s performance

Fig. 7 shows that by abstracting identifiers, DeFP can capture the general patterns associated with the warnings better. Specially, with identifier abstraction, DeFP achieves about 7% and 12% better in two kinds of studied vulnerabilities for Top-1%-Top-60%. For instance, in Top-20% of warnings, DeFP can find 105 actual BO vulnerabilities and 12 actual NPD vulnerabilities, which is about 52% and 66% of their total actual vulnerabilities. Meanwhile, without identifier abstraction, these numbers are only 96 and 11 vulnerabilities, respectively. More details about the performance of DeFP on P@K can be found on our website [8].

Moreover, identifier abstraction decreases Word2vec vocabulary size on BO and NPD datasets from 37,170 to 512 and 4,094 to 259 tokens, respectively. This helps the model deal with the vocabulary explosion problem, better generalize rare identifiers, and avoid out-of-vocabulary. As well, Word2vec might also beneficially reduce vector dimension to represent each code token, thus improve memory usage and shorten the training/prediction time.

Vi Threats to Validity

There are three main threats to validity in this paper, they are external validity, internal validity, and construct validity, respectively, which are illustrated as follows.

Vi-1 External validity

Our dataset contains 10 open-source programs and warnings in only two vulnerability types. Therefore, our results may not be general for all software projects and other kinds of vulnerabilities. To reduce the threat, we chose the programs which are widely used in the related work [38, 21] and the top most popular vulnerabilities [20]. Also, we plan to collect more data for the future work.

Vi-2 Internal validity

For this paper, the internal validity mainly lies in the data used for learning process. We manually labeled for the warnings based on the labels of the functions, which are assigned by Zhou et al. [38] and Lin et al. [21]. The threat may come from their incorrect labels at function level or our misleading labels at warning level. To minimize this threat, we carefully investigate to label the warnings.

Vi-3 Construct validity

In this study, we adopt P@K and R@K for evaluating the performance of the ranking models. However, with the problem of handling SA warnings, evaluation in terms of other metrics may also require in practice. We will conduct experiments using more evaluation measures in our future work.

Vii Related Work

There are various approaches have been applied to detect source code vulnerabilities in the early phases of the software development process. Specially, using SA tools is an automatic and simple way to detect various kinds of vulnerabilities without executing the programs [1, 27]. Baca et al. [2] has demonstrated that SA tools is better than average developers in detecting warnings, especially the security ones. However, the generated warnings of SA tools often contain a high number of FP rate [15, 16]. Therefore, developers still need to waste a lot of time and effort for investing such FP warnings.

In order to improve the precision of SA tools, sophisticated program verification techniques such as model checking, symbolic execution, or deductive verification, etc., have been applied to reduce the number of FPs [24, 25, 29, 32, 18]. For instance, Muske et al. [24, 25] uses model checking to eliminate FPs. Specially, for each warning, they generate appropriate assertions and then use model checking to verify whether those assertions hold. Nguyen et al. [29] also generate proper annotations to describe the verified properties of the warnings and then prove them by deductive verification. These approaches can precisely discard a number of FPs. However, not all of the generated warnings can be formally proved to be FPs or TPs by these approaches. Additional, model checking approaches also suffer from the enormous states space, which affect their performance and lead them to be non-scalable.

In addition, several studies applied ML models to address SA warnings. Specially, some research [37, 10, 4] propose sets of features about statistic information of the warnings and then build a model which learn these features to classify SA warnings. However, these features are manually defined based on the dataset and the used SA tools. This process is error-prone even for experts. Meanwhile, instead of using a fixed set of features, Lee et al. [17] trained a CNN model classifying warnings based on features which are learned from lexical patterns in source code. However, they manually defined different contexts for different kinds of warnings based on their dataset. This limits the adaptation of their approach for other kinds of vulnerabilities and different dataset. In this paper, we propose an approach which can be fully automated and easily to adapt for handling different warnings in different projects. Specially, our models are trained to capture the patterns associated with the warnings in their corresponding contexts, which are extracted by inter-procedural slicing techniques.

Moreover, ML are also actively adopted in vulnerabilities detection. Particularly, to leverage the syntax and semantics information presented in the Abstract Syntax Tree, Dam et al. [7]

proposed a deep learning tree based model to predict whether a source file is clean or defective. Besides, there are multiple studies also propose token-based models 

[34, 19] or graph-based models [38, 5] to predict whether a function containing vulnerabilities. However, these research focuses on detecting vulnerabilities at the file level or function level, which are quite coarse-grained in granularity. Developers still need to investigate the whole source code in the detected files or functions to localize the vulnerabilities. In this research, our objective is more fine-grained in granularity. We focus on ranking the warnings which are reported by SA tools. With the resulting lists, developers can decide which vulnerabilities should be investigated and fixed in a given a amount of time.

Viii Conclusion

SA tools have demonstrated their usefulness in detecting potential vulnerabilities. However, these tools often report a large number of warnings containing both TPs and FPs, which causes time-consuming for post-handling warnings and affects the productivity of developers. In this paper, we introduce DeFP, a novel method for ranking SA warnings. Based on the reported statements and the corresponding warning contexts, we train two BiLSTM models to capture the patterns associated with the TPs and FPs. After that, for a set of new warnings, DeFP can predict the likelihood to be TP of each warning and then rank them according to the predicted scores. By using DeFP, more actual vulnerabilities can be found in a given time. In order to evaluate the effectiveness of DeFP, we conducted experiments on 6,620 warnings in 10 real-world projects. Our experimental results show that using DeFP, developers can find +90% of actual vulnerabilities by investigating only 60% of the total warnings.

Acknowledgment

This work has been supported by VNU University of Engineering and Technology under project number CN20.26.

In this work, Kien-Tuan Ngo was funded by Vingroup Joint Stock Company and supported by the Domestic Master/ PhD Scholarship Programme of Vingroup Innovation Foundation (VINIF), Vingroup Big Data Institute (VINBIGDATA), code VINIF.2020.ThS.04.

References

  • [1] N. Ayewah, W. Pugh, D. Hovemeyer, J. D. Morgenthaler, and J. Penix (2008) Using static analysis to find bugs. IEEE software 25 (5), pp. 22–29. Cited by: §I, §VII.
  • [2] D. Baca, K. Petersen, B. Carlsson, and L. Lundberg (2009) Static code analysis to detect software security vulnerabilities-does experience matter?. In 2009 International Conference on Availability, Reliability and Security, pp. 804–810. Cited by: §VII.
  • [3] M. Beller, R. Bholanath, S. McIntosh, and A. Zaidman (2016) Analyzing the state of static analysis: a large-scale evaluation in open source software. In 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER), Vol. 1, pp. 470–481. Cited by: §I.
  • [4] M. Berman, S. Adams, T. Sherburne, C. Fleming, and P. Beling (2019) Active learning to improve static analysis. In 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA), pp. 1322–1327. Cited by: §I, §I, §VII.
  • [5] S. Chakraborty, R. Krishna, Y. Ding, and B. Ray (2021) Deep learning based vulnerability detection: are we there yet. IEEE Transactions on Software Engineering. Cited by: §I, §III-B1, §IV-A, §VII.
  • [6] CppCheck. External Links: Link Cited by: item 1.
  • [7] H. K. Dam, T. Pham, S. W. Ng, T. Tran, J. Grundy, A. Ghose, T. Kim, and C. Kim (2019) Lessons learned from using a deep tree-based model for software defect prediction in practice. In 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR), pp. 46–57. Cited by: §IV-B2, §VII.
  • [8] DeFP. External Links: Link Cited by: §II-A, §III-B1, §V-C, §V-D.
  • [9] Flawfinder. External Links: Link Cited by: §II-A, item 1.
  • [10] L. Flynn, W. Snavely, D. Svoboda, N. VanHoudnos, R. Qin, J. Burns, D. Zubrow, R. Stoddard, and G. Marce-Santurio (2018) Prioritizing alerts from multiple static analysis tools, using classification models. In 2018 IEEE/ACM 1st International Workshop on Software Qualities and their Dependencies (SQUADE), pp. 13–20. Cited by: §I, §I, §I, §VII.
  • [11] C. S. C. Group SEI CERT Coding Standards (wiki). External Links: Link Cited by: §I.
  • [12] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §III-C.
  • [13] S. Horwitz, T. Reps, and D. Binkley (1990) Interprocedural slicing using dependence graphs. ACM Transactions on Programming Languages and Systems (TOPLAS) 12 (1), pp. 26–60. Cited by: §II-B.
  • [14] Joern. External Links: Link Cited by: §III-A.
  • [15] B. Johnson, Y. Song, E. Murphy-Hill, and R. Bowdidge (2013) Why don’t software developers use static analysis tools to find bugs?. In 2013 35th International Conference on Software Engineering (ICSE), pp. 672–681. Cited by: §I, §VII.
  • [16] U. Koc, S. Wei, J. S. Foster, M. Carpuat, and A. A. Porter (2019) An empirical assessment of machine learning approaches for triaging reports of a java static analysis tool. In 2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST), pp. 288–299. Cited by: §I, §VII.
  • [17] S. Lee, S. Hong, J. Yi, T. Kim, C. Kim, and S. Yoo (2019) Classifying false positive static checker alarms in continuous integration using convolutional neural networks. In 2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST), pp. 391–401. Cited by: 3rd item, §I, §I, 1st item, §IV-B2, TABLE II, §V-A, §V-A, §VII.
  • [18] H. Li, T. Kim, M. Bat-Erdene, and H. Lee (2013) Software vulnerability detection using backward trace analysis and symbolic execution. In 2013 International Conference on Availability, Reliability and Security, pp. 446–454. Cited by: §I, §VII.
  • [19] Z. Li, D. Zou, S. Xu, H. Jin, Y. Zhu, and Z. Chen (2021) Sysevr: a framework for using deep learning to detect software vulnerabilities. IEEE Transactions on Dependable and Secure Computing. Cited by: §VII.
  • [20] Z. Li, D. Zou, S. Xu, X. Ou, H. Jin, S. Wang, Z. Deng, and Y. Zhong (2018) Vuldeepecker: a deep learning-based system for vulnerability detection. arXiv preprint arXiv:1801.01681. Cited by: §VI-1.
  • [21] G. Lin, W. Xiao, J. Zhang, and Y. Xiang (2019) Deep learning-based vulnerable function detection: a benchmark. In International Conference on Information and Communications Security, pp. 219–232. Cited by: §IV-A, §IV-A, §IV-B2, §IV-B3, §VI-1, §VI-2.
  • [22] G. Lin, J. Zhang, W. Luo, L. Pan, O. De Vel, P. Montague, and Y. Xiang (2019) Software vulnerability discovery via learning multi-domain knowledge bases. IEEE Transactions on Dependable and Secure Computing. Cited by: §III-B4.
  • [23] T. Mikolov, K. Chen, G. Corrado, and J. Dean (2013) Efficient estimation of word representations in vector space. In ICLR, Cited by: §I, §III-B4.
  • [24] T. Muske, A. Datar, M. Khanzode, and K. Madhukar (2013) Efficient elimination of false positives using bounded model checking. In ISSRE, Vol. 15, pp. 2–5. Cited by: §VII.
  • [25] T. Muske and U. P. Khedker (2015) Efficient elimination of false positives using static analysis. In 2015 IEEE 26th International Symposium on Software Reliability Engineering (ISSRE), pp. 270–280. Cited by: §VII.
  • [26] T. Muske and A. Serebrenik (2016) Survey of approaches for handling static analysis alarms. In 2016 IEEE 16th International Working Conference on Source Code Analysis and Manipulation (SCAM), pp. 157–166. Cited by: §I.
  • [27] N. Nagappan and T. Ball (2005) Static analysis tools as early indicators of pre-release defect density. In Proceedings. 27th International Conference on Software Engineering, 2005. ICSE 2005., pp. 580–586. Cited by: §I, §VII.
  • [28] H. N. Nguyen, S. Teerakanok, A. Inomata, and T. Uehara (2021) The comparison of word embedding techniques in rnns for vulnerability detection.. In International Conference on Information Systems Security and Privacy (ICISSP), pp. 109–120. Cited by: §IV-B3.
  • [29] T. T. Nguyen, P. Maleehuan, T. Aoki, T. Tomita, and I. Yamada (2019) Reducing false positives of static analysis for sei cert c coding standard. In 2019 IEEE/ACM Joint 7th International Workshop on Conducting Empirical Studies in Industry (CESI) and 6th International Workshop on Software Engineering Research and Industrial Practice (SER&IP), pp. 41–48. Cited by: §I, §VII.
  • [30] N. I. of Standards and Technology Software assurance reference dataset. External Links: Link Cited by: §I, §IV-A.
  • [31] V. Okun, A. Delaitre, P. E. Black, et al. (2013) Report on the static analysis tool exposition (sate) iv. NIST Special Publication 500, pp. 297. Cited by: §I, §IV-A.
  • [32] H. Post, C. Sinz, A. Kaiser, and T. Gorges (2008) Reducing false positives by combining abstract interpretation and bounded model checking. In 2008 23rd IEEE/ACM International Conference on Automated Software Engineering, pp. 188–197. Cited by: §I, §VII.
  • [33] RATS - Rough Auditing Tool for Security. External Links: Link Cited by: item 1.
  • [34] R. Russell, L. Kim, L. Hamilton, T. Lazovich, J. Harer, O. Ozdemir, P. Ellingwood, and M. McConley (2018) Automated vulnerability detection in source code using deep representation learning. In 2018 17th IEEE international conference on machine learning and applications (ICMLA), pp. 757–762. Cited by: §VII.
  • [35] J. Ruthruff, J. Penix, J. Morgenthaler, S. Elbaum, and G. Rothermel (2008) Predicting accurate and actionable static analysis warnings. In 2008 ACM/IEEE 30th International Conference on Software Engineering, pp. 341–350. Cited by: §I.
  • [36] M. The Motor Industry Software Reliability - Association (2012-03) Guidelines for the use of the c language in critical systems. External Links: ISBN 9781906400101 Cited by: §I.
  • [37] U. Yüksel, H. Sözer, and M. Şensoy (2014) Trust-based fusion of classifiers for static code analysis. In 17th International Conference on Information Fusion (FUSION), pp. 1–6. Cited by: §I, §I, §VII.
  • [38] Y. Zhou, S. Liu, J. Siow, X. Du, and Y. Liu (2019) Devign: effective vulnerability identification by learning comprehensive program semantics via graph neural networks. arXiv preprint arXiv:1909.03496. Cited by: §IV-A, §IV-A, §IV-B2, §VI-1, §VI-2, §VII.