Identifying potential software vulnerabilities is a crucial step in defending against cyber attacks (cybersec). However, it can be difficult and time-consuming for developers to determine which parts of the software will contribute to vulnerabilities in large software systems. Consequently, interest in more accurate and efficient automated software vulnerability detection (SVD) solutions has been increasing (svd2012survey; svd2017survey)
. Automated SVD can be broadly classified into two categories: (1) traditional methods, which can include both static and dynamic analysis, and (2) data-driven solutions, which leverage data mining and machine learning to predict the presence of software vulnerabilities(svdtaxonomy). Traditional static tools are often rule-based, leveraging knowledge from security domain experts. This can result in inconsistent performance due to a high number of false positive alerts, or completely missing more complex vulnerabilities (sastempirical). As it is challenging to define vulnerable patterns in an accurate and comprehensive way, data-driven solutions have become a promising alternative.
One primary reason for the increasing popularity of data-driven solutions can be attributed to the ever growing quantity of open-source vulnerability data. The accumulating security vulnerabilities in open-source software are regularly updated and reported to sources like the National Vulnerability Database(nvd). There have been many efforts that utilize the publicly available information to reduce the need for manually defining patterns and to learn from the data (sajnani2016sourcerercc; kim2017vuddy). As a result, data-driven solutions have generally shown outstanding performance in comparison to traditional static application security testing tools (rolandsast). One potential reason could be due to the learning ability of the models to incorporate latent information from patterns where vulnerabilities would appear in source code, in addition to the underlying causes of the vulnerability.
Despite the success of current data-driven approaches in the identification of software vulnerabilities, they are often limited to a coarse level of granularity. The model outputs often present developers with limited information for prediction outcome validation and interpretation, leading to extra efforts when evaluating and mitigating the software vulnerabilities. Consequently, many proposed SVD solutions have transitioned to either function-level (devign; bgnn4vd; reveal; ivdetect) or slice-level (vuldeepecker; sysevr; vuldeelocator; deepwukong) predictions, which are a major improvement from file-level predictions (hovsepyan2016newer; scandariato2014predicting; du2019leopard). Some other works further leverage supplementary information, such as commit-level code changes with accompanying log messages, to build the prediction model (hoang2020cc2vec; pornprasit2021jitline). While the goal is to help practitioners to prioritize the defective codes, vulnerabilities can often be localized to a few key lines (duan2019vulsniper). Hence, reviewing large functions could still be a considerable burden. From a preliminary analysis on a large C/C++ vulnerability detection dataset (bigvul), we find that vulnerable functions in the dataset are on average 95 lines of code.
Fig. 1 shows an example of how statement-level SVD can be beneficial. To save space, we choose a vulnerability from a smaller function, which contains an integer overflow vulnerability from the Linux kernel (CVE-2018-12896) that can ultimately be exploited to cause denial-of-service. With explicit statement-level predictions, it can be easier to interpret why the function has been predicted as vulnerable (or alternatively, verify that the prediction was erroneous). Focusing the developer’s attention on the highlighted lines can help a developer narrow down the lines needing further inspection based on model confidence. In this case, a statement-level SVD model flags the addition assignment operation on line 22, which contains the vulnerable integer casting operation, as most suspicious, allowing a developer to more efficiently validate and mitigate the vulnerability.
Refining SVD granularity towards the statement-level is still in its infancy; latest work by Li et. al (ivdetect) has indicated the possibility of leveraging the interpretable ML model, namely GNNExplainer, to derive the vulnerable statements as the interpretation of learnt model. However, in our work, we find that the performance is not sufficient and effective when classifying and ranking the latent vulnerable statements. Alternatively, we aim to explore the feasibility and effectiveness of directly training and predicting on vulnerabilities at the statement level for SVD granularity refinement, which would allow data-driven solutions to directly utilize any available statement-level information in a fully supervised manner.
In this paper, we propose a novel framework for statement-level SVD, namely LineVD. We focus on revisiting core components of data-driven SVD from the perspective of statement-level classification, which can serve as a way of allowing developers to better interpret vulnerability predictions. We have explored statement-level SVD in a thorough way to find the best data-driven architecture to achieve optimal performance. With the coverage of various feature extraction methods and model architectures to tackle the latent challenges in SVD, we show that LineVD could provide sufficient capacity to incorporate contextual information for each statement in an efficient manner. The extensive evaluation of the model has been delivered in realistic scenarios; namely, heavily imbalanced labels and cross-project testing. In summary, this paper makes the following contributions:
We propose a novel and efficient statement-level SVD approach, LineVD. With the investigation of current state-of-the-art interpretation-based SVD model showing declined performance, LineVD achieves a significant improvement for an increase of in F1-score.
We investigate the performance effects of each stage for building a GNN-based statement-level SVD model, including the node embedding methods and GNN model selection. Upon the findings, LineVD is developed to largely improve the performance via learning from the function- and statement-level information simultaneously.
LineVD is the first approach to jointly learn from function-level and statement-level information via graph neural networks to enhance the SVD performance, which in the empirical evaluation has significantly outperformed the traditional models only using one type information.
We publish our dataset, source code, and models with supporting scripts (package2022), which provides a ready-to-use implementation solution for future work with regards to benchmarking and comparison.
In this section, we will introduce relevant key concepts relating to the source code embedding and graph neural networks (GNNs), which have played key roles for providing effective and explainable capabilities to SVD prediction models in recent literature works.
2.1. Source code embedding
To tackle a language-related task with machine learning models, it is necessary to transform the related textual corpora into vector representations. The corpora are usually specific to a given domain; e.g., one recently popular topic is source code-related tasks. The process is commonly referred to as building a language model and extracting language embeddings, which can be at the word, sentence, or document-level(word2vec). Source code modelling is the application of language modelling to source code, which can be considered a special type of structured natural language (trietdeepcode), and has demonstrated encouraging results for a wide range of downstream tasks. These include code completion (liu2020multi) and code clone detection (wang2020detecting), among others (wang2021mulcode).
For software vulnerability detection, prior approaches have utilized document embedding methods like Doc2Vec (doc2vec), or word embedding methods such as GloVe (glove) and Word2Vec (word2vec) to generate pre-trained vectors for singular tokens, which are then aggregated in some way. For example, Cao et al. (bgnn4vd) utilized averaged Word2Vec embeddings to transform raw code statements into vector representations.
Tokenization approaches for source code can vary significantly. For word embedding-based approaches, code tokens can be split by whitespace alongside punctuation, such as parenthesis and semi-colons (sysevr). Alternatively, punctuation can be completely removed (ivdetect). In addition, code identifiers are sometimes further tokenized according to common naming conventions, such as underscores or camel case. This is to help reduce the number of out-of-vocabulary tokens, which can otherwise be significantly higher than regular natural language due to the nature of how developers name identifiers. An alternative approach to manually defined tokenization rules is unsupervised subword tokenization. In CodeBERT, byte-pair encoding (BPE) (bpe) is used to tokenize the source code tokens. In particular, long variable names are split into subwords based on the BPE algorithm. For example, ’add_one’ may be tokenized to ’add’, ’_’, and ’one’. This is arguably a more consistent way of reducing out-of-vocabulary issues in source code identifiers, as this approach is not reliant on pre-defined rules.
2.2. Graph Neural Network
Recently, graph neural networks have demonstrated superior performance at mining graph data structures for social networks (bian2019network), spatial-temporal related traffic networks (chen2019gated), and so on. For the downstream source code modelling tasks, transformer-based models have demonstrated promising results (allamanis2018learning). However, the complex syntactic and semantic characteristics, which are inherently presented in the programming languages, are not explicitly leveraged.
Rather than solely representing source code as a sequence of tokens, the intrinsic structure information for source code can also be effectively modelled by representing a source code snippet (or program) as a graph, , allowing a model to more easily learn latent relationships within the source code where is the set of nodes representing the program graph and is the edge matrix. With different types of program-related structures, the edge matrix denotes the corresponding syntactic and semantic information of the source code. Using this graph-based representation in combination with graph neural networks has resulted in improved performance for function-level vulnerability detection, in which a single graph represents a single source code function (see Section 7 for further details). The program dependency graph (PDG) of a program (see Fig. 2) is an overall focus in this work, as software vulnerabilities often involve data and control flows (recurringvd; yamaguchi2014modeling; ivdetect).
3. The LineVD Framework
In this section, we present the LineVD framework by firstly summarizing the problem definition. The overall architecture is subsequently discussed with details for three main components. Particularly, we demonstrate how we have incorporated the large-scale pretrained model and developed the novel graph construction method.
3.1. Problem Definition
We formalize the identification of vulnerable statements in a function as a binary node classification problem, i.e., learning to predict which source code statements in a function are vulnerable. Let us define a sample of data as , where is the set of all nodes representing a statement of code in the dataset and is the statement-level label where 1 is vulnerable and 0 is non-vulnerable. We collectively represent for the set of labels . is the number of nodes in the dataset.
For each , we utilize the -hop neighborhood graph to encode with the contextual information from neighboring nodes. indicates the neighborhood nodes for , represents the corresponding edges as an adjacency matrix, is the node feature matrix for , and is the number of nodes in . The goal of LineVD is to learn a mapping to determine the label of a given node; i.e., whether a statement is vulnerable or not. The prediction function of LineVD, which is denoted as
, can be learned by minimizing the following loss function:
where is the cross entropy loss function.
3.2. Approach Overview
In this section, we present a GNN-based approach to identify the vulnerable statements. One fundamental element is that the identified data and control dependencies between statements could sufficiently serve as the contextual information for the statement-level SVD task. Furthermore, we propose a novel architecture to better leverage the semantics conveyed within the statement and between the statements, which overall framework is illustrated in Figure. 3. LineVD can be divided into three main components, described in the following sections.
3.2.1. Feature Extraction
Given a snippet of source code, achieving an informative and comprehensive code representation is critical for subsequent model construction. LineVD is firstly designed to extract the code features against a transformer-based method, which is considered as effective for source code related tasks with the self-supervised learning objectives.
LineVD takes a single function of source code as the raw input. By processing and splitting the function into individual statements , each sample is firstly tokenized via CodeBERT’s pretrained BPE tokenizer. Following the collection of , the entire function and the individual statements comprising the function are passed into CodeBERT. Thus, the function-level and statement-level code representation can be acquired.
Specifically, LineVD has separately embedded the function-level and statement-level codes, rather than aggregating the statement-level embeddings for the function-level embeddings. CodeBERT is a bimodal model, meaning it was trained on both the natural language description of a function in addition to the function code itself. As input, it uses a special separator token to distinguish the natural language description from the function code. While the natural language descriptions for the functions is not accessible, a general operation, as specified in the literature, is applied in this work to prepend each input with an additional separator token, leaving the description blank. For the output of CodeBERT, we utilize the embedding of the classification token, which is suited for code summarization tasks. This allows us to better leverage the powerful pretrained source code summarization capability of the CodeBERT model.
Overall, the feature extraction component of LineVD using CodeBERT produces feature embeddings: one embedding for the overall function, and embeddings for each statement, for which we denote as separately.
3.2.2. Graph Construction
In LineVD, we focus on the data and control dependency information, for which we have introduced the graph attention network (GAT) model (gat). As discussed in Sec. 2.2, graph neural networks (GNNs) learn the graph structured data based on an information diffusion mechanism rather than squashing the information into a flat vector, which update the node states according to the graph connectivity to preserve the important information, i.e., the topological dependency information (scarselli2008graph).
As shown in Figure. 3, a graph attention network is used to construct the model for learning topological dependency information from the graph. A GAT layer will firstly take the output statement embeddings from CodeBERT along with the edges between each node. The graph structure, including the nodes and edges information, for the function is extracted and provided to GAT. Self-loops are added to include a given node within the set of its neighborhood graph. We initialize the GAT layer state vector with . indicates the current state. LineVD will propagate the information by embedding the data and control dependent statement (i.e., the program dependence graph) between neighboring statements in an incremental manner. Therefore, two graph attention networks are implemented in the LineVD architecture. Overall, GAT is defined by its use of attention over features of neighbors in its aggregation function in Eq. (2) - (5):
where is the current state, is the node embedding vectors at current layer, is the learnable weight matrix, is a learnable weight vector, and
is an activation function.
3.2.3. Classifier Learning
As multi-layer perceptron (MLP) models are dominantly evaluated as one of the top classifiers(lessmann2008benchmarking; naeem2020identifying), we leverage the superior learning capability of a deep neural network to better train the classifier. In this work, the goal is to train a model that could jointly learn from the function-level and statement-level code simultaneously.
To achieve this, we consider that both function-level and statement-level code snippet will contribute equally to the prediction outcomes. Thus, we build a shared set of linear and dropout layers taking the input of function-level CodeBERT embedding and the statement embeddings obtained from the GAT layer. The ReLU(relu) function serves as the activation function. In addition, LineVD retains the consistency of the prediction outcomes from its learning algorithm. It incorporates a element-wise multiplication between the output class of each statement and the output class of the function-level embedding, which could be either one or zero. While the vulnerable statement could be sufficient to direct the function to be vulnerable, we build LineVD with the element-wise multiplication to further leverage the function-level information for training. Moreover, this operation harmoniously balance the conflicting outputs between function-level and statement-level embeddings, and will justify the decision for some scenarios, i.e., if the output class of the function-level embedding is zero, then all statement-level outputs are also zero. The intuition for this is that a non-vulnerable function cannot have vulnerable lines. LineVD outputs the predictions corresponding to the statements in the input function. The cross-entropy loss function is used to train the LineVD. More details of the implementation can be found in the replication package (package2022).
4. Experimental Design and Setup
In this section, we will report the experiment design details, including the research questions and evaluation process. Particularly, we present the dataset details for the empirical evaluation. We further discuss the applied evaluation metrics, which is considered to thoroughly quantify the model performance.
4.1. Research Questions
To explore statement-level vulnerability detection task and investigate the performance of LineVD, we answer and motivate the following research questions:
RQ1: How much performance increasement can LineVD achieve in comparison with the state-of-the-art interpretati-on-based SVD model? To evaluate the relative improvement of LineVD in the statement-level SVD task, we choose to compare against the state-of-the-art interpretation-based model. It is conducted from two diverse measures, which are binary classification and ranked metrics for vulnerable statements.
RQ2: How do different code embedding methods affect statement-level vulnerability detection? Code embedding methods have not yet been explored for statement-level SVD compared to SVD at other levels of granularity.
RQ3: How do graph neural networks and function-level information contribute to LineVD performance? The effect of information propagation using graph neural networks on statement-level SVD has yet to be explored.
RQ4: How does LineVD perform in a cross-project classification scenario? While training on a dataset containing multiple projects already reduces misrepresentation of model generalisability, it is still possible for samples from the same project to appear in both the training and test set. Using a cross-project scenario can better represent how the model performs on a completely unseen project, rather than only unseen samples.
RQ5: Which statement types are best distinguished by Lin-eVD for real-world data? Investigating the model prediction outcomes from the perspective of statement types, particularly for real-world data, can help to understand where the model performs best and where it fails, which can guide future work and improvements in statement-level SVD.
Recent research suggests that SVD models should be evaluated on data that could represent the distinct characteristics of real-world vulnerabilities (reveal). This means evaluating on source code extracted from real-world projects (i.e. non-synthetic) while maintaining an imbalanced ratio, which is inherent for vulnerabilities in software projects. The usage of datasets that do not satisfy these conditions would result in the inconsistency of model performance when applied in real world scenarios. Another dataset requirement is a sufficiently large number of samples, ideally spanning multiple projects in order to acquire a model that can generalize well to unseen code. The final requirement is access to ground truth labels at the statement level or traceability to the before-fix code, i.e., the original git commit.
By extracting the code vulnerabilities from over 300 different open-source C/C++ GitHub projects, Big-Vul contains the trustworthy source code vulnerabilities spanning 91 different vulnerability types, which are linked to the public Common Vulnerabilities and Exposures (CVE) database (bigvul). A substantial amount of manual resources has also been dedicated to ensure the quality of the dataset. Meanwhile, it has provided the enriched information including CVE IDs, CVE severity scores and particularly the code changes, along with other metadata. In the end, Big-Vul provides the best fit for meeting the requirements to model code-centric vulnerability detection to the best of our knowledge, containing approximately 10,000 vulnerable samples and 177,000 non-vulnerable samples. This large diversity in projects, rather than focusing on a limited number of projects in particular, allows for better representation of all open-source C/C++ projects (see Table 5 for the top 10 most common projects in the dataset).
4.2.1. Ground-truth Labels
To obtain the ground-truth labels for the vulnerable and non-vulnerable lines, we follow the assertions from literature (ivdetect; bigvul)
rather than proposing our own heuristics: (1) removed lines in a vulnerability-fixing commit serve as an indicator for a vulnerable line, and (2) all lines that are control or data dependent on the added lines are also treated as vulnerable. The reasoning for the second point is that any added lines in a vulnerability-fixing commit were added to help patch the vulnerability. Hence, lines that were not modified in vulnerability-fixing commit, but are related to these added lines, can be considered as related to the vulnerability.
To obtain labels corresponding to lines that are dependent on the added lines, we first obtain the code changes from the before and after version of the sample, where a sample in Big-Vul refers to a function-level code snippet. For the before version, we remove all the added lines, and for the after version, we remove all the deleted lines. In both cases, we keep blank placeholder lines to ensure line number consistency. The code graph extracted from the after version can be used to find all lines that are control or data dependent on the added lines, whose line numbers correspond to the before version. This set of lines can be combined with the set of deleted lines to obtain the final set of vulnerable lines for a single sample. Commented lines are excluded in the code graph, and hence are not used for training or prediction. This can be seen in Fig. 1 and 2; in this case, there is only a modified line, which is treated as both a deleted line (22) and added line (23). The control and data dependency edges in this case happen to be the same for both the before and after version, and hence we can use Fig. 2 to identify the lines that are control/data dependent on line 23, which are lines 3, 19, and 21.
4.2.2. Dataset Cleaning
We performed multiple filtering steps on the Big-Vul dataset. First, we remove all comments from the code. Second, we ignore code changes that are purely cosmetic, such as changes to whitespace, and consequently remove any functions with no non-cosmetic code changes. Third, we removed improperly truncated functions. A few samples in the original dataset were truncated incorrectly, resulting in an unparsable and invalid code sample. For example, a function that was originally 50 lines may be incorrectly truncated to 40 lines for no apparent reason. The reason may be due to an error in how the dataset was originally constructed; however, there were only 30 such samples in the whole dataset. We use a random training/validation/test split ratio of 80:10:10. For the training set, we undersample the number of non-vulnerable samples to produce an approximately balanced dataset at the function-level, while the test and validation set is left in the original imbalanced ratio. We choose to balance the samples at the function-level as it is non-trivial to balance at the statement-level while maintaining the contextual dependencies between statements within a function.
4.3. Evaluation Metrics
We report F1 score, precision, recall, area under the receiver operating characteristic curve (ROCAUC), and area under the precision-recall curve (PRAUC). While ROCAUC is widely used to directly measure the predictive power of the model without choosing a specific threshold, it cannot fully reflect the effectiveness when dealing with imbalanced datasets. Hence, in addition to reporting ROCAUC for comparison with past literature, we have also used PR-AUC, which is better suited for imbalanced problems.
While binary classification is the primary focus, as described in Section 3.1, we also report ranked metrics to evaluate the performance of the most confident predictions of the model. The ranked metrics include mean average precision (MAP), normalized discounted cumulative gain (nDCG), and mean first ranking (MFR). Here, we define first ranking as the rank of the first correctly predicted vulnerable statement. We care about how well the model performs at a certain number of most confident lines, as we can thus further limit the amount of code that needs to be reviewed by the developer. In this work, we choose to use as an effective performance representation at the top following the recommendation of Li et al. (ivdetect). However, in practical usage, this threshold could be adjusted by the user. Finally, we report accuracy of interpretation given N nodes, where N is equal to 5 (denoted as N5 in Table 1). This can be interpreted as function-level accuracy given only the prediction results of the top five lines.
Due to the imbalanced nature of the dataset, we first find the best threshold for the F1-score using the validation set before calculating the F1-score on the test set. When determining the significance of improvements, we use the Wilcoxon signed-rank test (wilcoxon) on the F1-scores of ten runs with random seeds. The best models according to the automated hyper parameter tuning are used for the test.
4.4. Hyper parameters
Hyper parameters of LineVD were tuned using Ray Tune (raytune) with randomized grid search. The hyper parameter details can be found in the publicly released source code. The scores are reported corresponding to the mean test results across ten runs using the best hyper parameters based on the loss of the validation set.
4.5. Experimental Design of Research Questions
In RQ1, we focus on the comparison between our proposed model and the existing literature methods. For SVD task, a practical experiment setting is that the vulnerability remains unknown for a given piece of code regardless at function level or file level. Thus, leveraging the interpretable ML models for the validation and interpretation of prediction results of SVD models is dominant. In this work, we compare LineVD with the state-of-the-art interpretation-based SVD model, namely IVDetect from (ivdetect). IVDetect
is designed as an interpretable vulnerability detector which embeds both artificial intelligence to detect vulnerabilities and intelligence assistant to provide vulnerabilities interpretations from statements. It has presented the state-of-the-art performance in terms of the detection and localization capabilities with the graph-based software vulnerability detection and interpretation models. While it has included an empirical evaluation against the existing function-level deep learning-based approaches, a full replication ofIVDetect for comparison in RQ1 is provided in terms of the performance of vulnerable statements identification.
In RQ2, we investigate the impacts of utilizing pretrained embedding methods for statement-level vulnerability prediction. While vectorizing the code statement information (also node information in this work) is a critical step, we have included different feature embedding methods, including CodeBERT, Doc2Vec, and averaged GloVe embeddings to understand their impacts on prediction performance. These two baselines were chosen to represent the embedding methods usually seen in vulnerability detection models (sysevr; ivdetect; vuldeepecker; deepwukong; bgnn4vd). GloVe and Word2Vec embeddings usually perform similarly in the vulnerability detection context (glovew2vvd), and thus we only use GloVe, which was also the chosen word embedding technique used in IVDetect (ivdetect). For classification of the embeddings, we use multiple hidden layers, and freeze all parameters of the CodeBERT model during training.
In RQ3, we explore the effect of introducing GNN layers into the feature extraction component of the model. In this experiment, we compare the use of two popular GNN variants: Graph Convolution Networks (GCN) (gcn) and Graph Attention Networks (GAT) (gat). These two GNN types were chosen to explore the effect of different GNN architectures on statement-level vulnerability classification. The GNN layers are inserted after the feature extraction component and before the final hidden layers. We also test two different types of graphs extracted from the source code: Program Dependence Graphs (PDG) and Control Dependence Graphs (CDG). PDGs consist of both data dependency edges and control dependency edges. Data dependency edges describe how data flows between statements in the program (which statements are influenced by which variables), while control dependency edges describe the order in which statements execute, as well as whether they execute or not. We use both variants to explore whether GCNs and GATs can utilize data dependency information from related statements to better distinguish vulnerable statements. The control and data dependencies are extracted using the Joern program (yamaguchi2014modeling). We remove samples that either cannot be parsed by Joern, or are correctly parsed but do not contain control or data dependency edges. Since the nodes produced by Joern are not necessarily at the line-level, we group together nodes with the same line number, and remove nodes with no line numbers (e.g. metadata nodes).
In addition to the GNN component, we simultaneously explore how to incorporate the function-level information, for which it may benefit LineVD to reduce the false positive rate of statement-level classification. This is done through training on the function-level label of the function, which is embedded using CodeBERT.
In RQ4, we explore how LineVD performs in a cross-project scenario; i.e. samples from the target project do not appear in the source projects which are used in model training. We simulate this by producing multiple splits where the test set is made of a single project, and the rest of the projects are used in the training and validation set. We choose the top 10 projects with the most vulnerable samples in the dataset to use as cross-project splits.
In RQ5, we examine which statement types (e.g. if-statement, goto-statement) LineVD can correctly distinguish. We use the node types given by Joern. For the ”Control Structure” node type, we replace it with the given control structure type (e.g. if, while, for). For ”Function Call” node types, we split them into two different categories: ”built-in” function calls, which are functions that appear in the C standard library, and ”external” function calls, which are any other functions not found in the standard C library. This is an important distinction, as each sample in the dataset consists of a code snippet at the function-level. This means the only information present in external function calls is in the identifier name itself, along with its arguments, rather than the contents of the external function. For ”Operator” node types, we group them into the following categories: assignment, arithmetic, comparisons, access, and logical. Any operator node types that do not fall into these categories are grouped into ”other”.
We run our experiments on a computing cluster utilizing multiple NVIDIA Tesla V100 GPUs and Xeon E5-2698v3 CPUs operating at 2.30 GHz.
5.1. RQ1: How much performance advantage can LineVD achieve in comparison with the state-of-the-art SVD model?
Table 1 and Table 2 summarize the performance comparison of LineVD with respect to IVDetect, a recently proposed fine-grained vulnerability detection approach, using various evaluation measures. We show that LineVD significantly outperforms IVDetect method in all metrics. In Table 1, the ranked metrics are included to show the accuracy of different models. A higher accuracy means that the model can generate more precise vulnerability detection at the statement level. As seen in Table 1, LineVD largely outperforms IVDetect in all ranked metrics (). For ranked metrics, LineVD improves by in accuracy value from to , and increases MAP@5 by over IVDetect from to . For MFR, LineVD improves the performance by ranks, which significantly increase the efficiency of correctly locating the first vulnerable statement by 114.5%.
When examining the distribution of rankings for LineVD, we find that it is extremely skewed towards the lower end, with some highly incorrect predictions raising the MFR. The distribution of first-rank scores is plotted in Figure.4, which shows that the majority (89%) of first ranking scores are between 1 to 5. This suggests that LineVD may struggle with certain types of longer functions, but correctly ranks them in most cases.
Table 2 provides the binary classification metrics, which is the most straightforward evaluation (whether or not a line is correctly predicted as vulnerable statement), and the one directly aligned with our problem definition. Overall, LineVD outperforms IVDetect by 104% () in F1-score. Specifically, for practical usage, we notice that recall score has also been substantially improved from to , indicating that LineVD has boosted the ability to correctly determine and locate the vulnerable statements.
We also note that another advantage of our architecture regards its efficiency in generating statement-level predictions compared to IVDetect. The use of an explanation model significantly increases the inference time for a single sample, as an entire model must be trained to find a subgraph that explains the model’s prediction, compared to the single forward pass required by LineVD. Using an NVIDIA Tesla V100, a single inference using GNNExplainer can take over a minute, depending on its configuration, while the single forward pass in LineVD takes less than a second.
5.2. RQ2: How do different code embedding methods affect statement-level vulnerability detection?
We evaluate the effectiveness of CodeBERT regarding the code embedding methods. To be able to justify and simplify the comparison with same process, as stated in section. 4.5.2, the multi-layer perceptron model is chosen and we limit the embedding process for statement-level information. In Table 3, CodeBERT has outperformed the baseline feature embedding methods, which are Doc2Vec and GloVe, in the context of statement-level vulnerability prediction. CodeBERT increases the value of F1-score in comparison to Doc2Vec by 134% (), and GloVe by 16% () respectively. This is to be expected, as the pre-trained CodeBERT model is designed with over 125 million parameters, which provide an enhanced capability to encode richer information in a larger and deeper model from code snippets comparing to the other models with less layers and parameters.
We also note that, while CodeBERT has the advantage over GloVe and Doc2Vec being pre-trained on a large corpus, it has the disadvantage of being trained on an external dataset consisting of non C/C++ code in five other programming languages. In comparison, the GloVe and Doc2Vec models are directly trained on our C/C++ dataset. Since Doc2Vec should be able to provide a better encoding capability for the contextual information about statements, GloVe presents a second best performance by the averaged aggregation method. Nonetheless, CodeBERT provides a best code embedding method given the comparative experiments with the other most popular code embedding methods for SVD to date.
5.3. RQ3: Can graph neural networks and function-level information benefit statement-level classification?
Table. 4 shows the effect of different experiment settings involving GNN type, program graph type, and whether or not the function-level classification component is included in the model. As it could be seen from Table. 4, using graph attention network for the feature learning from PDG information is the best fit, which we have hence included as the graph component in LineVD. When comparing this GNN feature extraction combination with the model variation without GNN, we achieve an increase of 24% in F1 score () from 0.296 to 0.360, indicating that the presence of the GNN benefits the model’s learning capability. However, the performance of other graph-based combinations is generally comparable to using the model without a GNN. This suggests that the program graph type and GNN type non-trivially affects the performance of the model. In particular, only using control dependency edges generally results in lower performance than using both control and data dependencies, suggesting that data dependency edges are important in predicting statement-level vulnerabilities.
In addition, we find that GCN achieves worse performance in comparison to GAT for all model types, which is to be expected, as the feature aggregation in GCN treats all neighboring nodes equally, unlike GAT, which attends to certain neighbors. However, when comparing only using statement-level information for classification (i.e. no function-level information involved), the use of GCN with either of the graph types or a GAT with CDG results in worse performance compared to using no GNN. Finally, we find that the enrichment of function-level information in the model significantly increases the statement-level performance, regardless of whether a GNN is used. Using the GAT+PDG combination, we achieve a performance increase of 140% ().
5.4. RQ4: How does LineVD perform in a cross-project classification scenario?
From Table 5, we can see that performance is generally consistent throughout the different projects. Due to the nature of the various projects, there are varying numbers of vulnerable samples for each project. The top 10 software projects are reported in Table 5. While the performance varies, in general, they are all slightly lower than the results from the random splits in Table 2. This is par for the course, as even partial inclusion of within-project information such as variable and function names may assist the model in distinguishing vulnerable samples.
In Table 5, the investigated projects are ranked according to the number of vulnerabilities, which is indicated in the last column ‘Vuln’. We note that the Chromium and Linux splits contain a disproportionately large number of vulnerable samples. The Chromium split itself accounts for 30% of all vulnerable samples in the dataset. Despite this, the performance is competitive with the other splits, indicating that the model attains a high generalisability for different settings, which has shown comparable performance with a smaller training set in Table 5.
5.5. RQ5: Which statement types are best distinguished by LineVD?
shows the raw prediction results for different statement types. We report the confusion matrix values for each statement type and sort them by F1 score. We see that that assignment operations and external function calls are the most commonly occurring vulnerable lines, and that the model most often correctly classifies Function Declaration statements. This could be due to the significantly higher level of information in function declarations compared to other statements, which can consist of the return type, function name, parameter types, and the parameter names. Operation-related statements generally perform similarly, except for logical operations, which LineVD can generally better distinguish, and access and cast operations, which are achieve relatively poorer performance.
LineVD has higher performance with built-in function call statements compared to external function calls. This is somewhat expected, as the same built-in functions are likely to appear across more samples, unlike user-defined function names. In addition, due to the function-level nature of the dataset, the only information in external function call statements is from the function name itself and the arguments passed. Without knowledge of what occurs within an external function call, it can be difficult to find distinguishing patterns relating to their vulnerability.
|Builtin Function Call||142||246||4537||72||0.47|
|External Function Call||644||2212||63378||582||0.32|
LineVD struggles most with continue, break, and goto statements. These are all control structure nodes that affect the control dependency graph, and do not contain any direct statement-level information that could distinguish them from each other. Hence, to distinguish the vulnerability of these statements, the model must rely on the patterns relating to the context in which they appear. Despite using a GNN to capture these relationships, these statements still cannot be classified correctly. Hence, while the use of GNN does improve overall performance, these results indicate that neither GCN or GAT are sufficient for effectively distinguishing these particular control statements. In contrast, statement types like function declarations, function calls, and for/while/if control structures often contain significantly more information within the statement, and hence can be distinguished more easily. This focus on more complex control structure statement types and built-in function calls is suitable for practical usage as they are more likely to be directly involved in the vulnerability.
In this section, we discuss the threats to validity and limitations of our work, along with ideas for future work.
6.1. Threats to Validity
The first threat relates to the potential sub-optimal hyper parameter tuning; we rely on a random grid search for determining the combinations of hyper parameters; however, it would be close to impossible to exhaustively test all combinations of hyper parameters. To mitigate this threat, we adopt best practices where possible when choosing the default range of values for the hyper parameter search, and tune each model variation with the same number of random parameter configurations.
Another threat is about the dataset. We use a maintained C/C++ dataset built for vulnerability detection as the same usage as in (ivdetect). Thus, it has appeared that the information from the dataset generally target on intral-procedural information. Most samples in current vulnerability detection datasets (devign; reveal; bigvul) lack inter-procedural information (including functions from other files), which make it difficult for both learning-based approaches and regular static analysis tools to determine the underlying nature of a vulnerability. Certain requirements would be necessary, such as handling macros and inter-procedural function calls, to capture all necessary information to learn the underlying nature of the vulnerabilities. These requirements are non-trivial such that training a learning-based model is an extremely challenging task if only looking at source code. However, future works could explore the possibility of producing datasets with deeper contextual information for individual samples, which could then be used to more comprehensively test the capabilities of GNNs for solving these issues.
There has also been recent work on the effect of time-based validation on software defect prediction (timesplitsdefect), which could also be relevant to vulnerability detection. We mitigate this threat by also exploring cross-project vulnerability prediction, since time-based validation primarily applies to within-project prediction. Furthermore, any graph-based model should intuitively have similar capabilities in regards to learning patterns associated with vulnerabilities, and hence the performance loss (or gain) should be linear across the different baseline models. However, this is something that could be tested more thoroughly in future works.
6.2. Limitations and Future Work
While LineVD outperforms the current state-of-the-art, the performance is still quite low, meaning there is still much room for improvement in the statement-level SVD task. A particular area of interest is finding the best way to produce better code embeddings. While CodeBERT can perform well, it is not trained on C/C++. The use of a large language model pretrained on the specific target language (e.g. C-BERT (cbert)) could improve the performance of downstream tasks. However, at the time of writing, there were no openly released C/C++-based large pretrained models.
Another limitation regards the learning capability of the GNN layer, which inherently only propagates information from immediately neighboring nodes (i.e. 1-hop). To propagate further range information, multiple graph neural network layers can be stacked. For example, two GNN layers would propagate information across a 2-hop neighborhood for each node. However, most graph neural network architectures using more than just a few layers has been shown to result in over-squashing, where the exponentially growing information between each layer cannot be captured within a fixed-length vector (gnnbottleneck), resulting in bottlenecked performance. This is consistent with our findings; we tuned the number GNN layers during experimentation but did not find any significant increase in performance beyond two layers.
Future work could explore novel GNN architectures that can better capture vulnerability patterns in program graphs. Moreover, it is recognized that, amid the two different categories of software metrics-based (hovsepyan2016newer; scandariato2014predicting; du2019leopard) and pattern-based (li2016vulpecker; vuldeepecker; vuldeelocator) approaches as data-driven SVD solutions, a few of them could have the potential of delivering a fine-grained prediction outcome. In the future work, the investigation of a systematic comparison with other tools would give broader and deeper perspective of evaluation.
7. Related work
To recap the state-of-the-art of software vulnerability detection researches, we introduce the related work from two perspectives: 1) the application of GNN on SVD, 2) the explanability of machine learning models for SVD.
7.1. Software vulnerability detection with GNN
Detecting software vulnerability is an ongoing topic to secure software systems from cyber attacks. Recent advances have advocated the application of GNN and its variants for function level vulnerability detections, which is considered as the best way of representing source code in the SVD context (plbart; devign), and have demonstrated enhanced performance over other approaches (bgnn4vd; ivdetect; duan2019vulsniper).
The introduction of GNNs for modelling vulnerabilities was originally inspired by the vulnerability discovery approach proposed by Yamaguchi et al. (yamaguchi2014modeling) using Code Property Graphs, a type of program graph incorporating program dependency edges (liu2006gplag; johnson2015exploring), control flow edges, and the abstract syntax tree of the program, which provide an additional source of information to learn from (cui2020vuldetector). Hence, the performance improvements using GNNs can primarily be attributed to leveraging the domain knowledge that lines of source code within a program have specific relationships to other lines; i.e., training using both semantic and syntactical information, rather than only syntactical information.
To achieve this goal, the lines of source code, which is also the nodes in the graph, will be firstly vectorized according to the possible code tokens. The vectorized information is then concatenated as initial node information to allow GNNs to capture the semantic and syntactical information. This propagation of information between semantically relevant statements theoretically allows the model to better make use of relevant contextual lines. While GNNs have been shown to perform well in function-level vulnerability classification (graph classification) (devign; bgnn4vd; reveal; deepwukong; lin2021deep), the effects of vectorized methods and GNNs models on statement-level vulnerability classification (node classification) have yet to be explored.
7.2. Interpretation machine-learning based models for SVD
One way to improve the performance of machine-learning based method for SVD in practice is the development of explainable detection results, which could provide a fine-grained vulnerability prediction outcome. Particularly, there have been works attempting to detecting lines-level information by leveraging explainable artificial intelligence for software engineering tasks, such as detecting source code lines for defect prediction (wattanakriengkrai2020predicting; pornprasit2022deeplinedp). This raises the importance of research for interpretable machine-learning based models. Besides the novel machine-learning based model for function level vulnerability detection output, the existing works are limited to providing partial information for the explanation generation, i.e., tokens from (svdheuristicxai) and intermediate code by (vuldeelocator). Other way for statement-level SVD task may be via localizing the specific vulnerable statements with the assumption of receiving vulnerable source codes in function level (ding2022velvet). However, it demonstrates a certain limitation as SV detection and localization are mostly demanded at the same time.
A recently explored benefit of GNNs for SVD is direct access to explainable GNN approaches (ivdetect). These can be attached to any GNN-based function-level SVD model to obtain fine-grained predictions; i.e., statement-level predictions if the nodes themselves represent statements. However, whether or not it is the best way to build a statement-level vulnerability classifier has yet to be explored. For example, one disadvantage of using a function-level detector as the base model for interpretation is the inability to directly leverage any statement-level information during the training process, in addition to the significantly longer inference times.
We introduce LineVD, a novel deep learning approach for statement-level vulnerability detection, which can allow developers to more efficiently evaluate potentially vulnerable functions. LineVD achieves a new state-of-the-art on statement-level vulnerability detection on real-world open source projects by leveraging graph neural networks and statement-level information during training. The significant improvement in comparison to the latest fine-grained machine-learning based model indicates the effectiveness of directly utilizing statement-level information for statement-level SVD. Finally, LineVD achieves reasonable cross-project performance, indicating its effectiveness and generalization capabilities even for completely unseen software projects. Future directions will include exploring alternate pretrained feature embedding methods and novel GNN architectures that can better accommodate the underlying nature of software source code and the vulnerabilities.
The work was supported by the Cyber Security Research Centre Limited whose activities are partially funded by the Australian Government’s Cooperative Research Centres Programme. This work was supported with supercomputing resources provided by the Phoenix HPC service at the University of Adelaide.