DeepAI
Log In Sign Up

ReGVD: Revisiting Graph Neural Networks for Vulnerability Detection

Identifying vulnerabilities in the source code is essential to protect the software systems from cyber security attacks. It, however, is also a challenging step that requires specialized expertise in security and code representation. Inspired by the successful applications of pre-trained programming language (PL) models such as CodeBERT and graph neural networks (GNNs), we propose ReGVD, a general and novel graph neural network-based model for vulnerability detection. In particular, ReGVD views a given source code as a flat sequence of tokens and then examines two effective methods of utilizing unique tokens and indexes respectively to construct a single graph as an input, wherein node features are initialized only by the embedding layer of a pre-trained PL model. Next, ReGVD leverages a practical advantage of residual connection among GNN layers and explores a beneficial mixture of graph-level sum and max poolings to return a graph embedding for the given source code. Experimental results demonstrate that ReGVD outperforms the existing state-of-the-art models and obtain the highest accuracy on the real-world benchmark dataset from CodeXGLUE for vulnerability detection.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

09/08/2019

Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks

Vulnerability identification is crucial to protect the software systems ...
03/10/2022

LineVD: Statement-level Vulnerability Detection using Graph Neural Networks

Current machine-learning based software vulnerability detection methods ...
06/15/2020

Learning to map source code to software vulnerability using code-as-a-graph

We explore the applicability of Graph Neural Networks in learning the nu...
07/19/2022

Enhancing Security Patch Identification by Capturing Structures in Commits

With the rapid increasing number of open source software (OSS), the majo...
02/20/2021

Spotting Silent Buffer Overflows in Execution Trace through Graph Neural Network Assisted Data Flow Analysis

A software vulnerability could be exploited without any visible symptoms...
06/22/2022

Heterogeneous Graph Neural Networks for Software Effort Estimation

Software effort can be measured by story point [35]. Current approaches ...
12/20/2021

Vulnerability Analysis of the Android Kernel

We describe a workflow used to analyze the source code of the Android OS...

1 Introduction

The software vulnerability problems have been rapidly grown recently, either reported through publicly disclosed information-security flaws and exposures (CVE) or exposed inside privately-owned source codes and open-source libraries. These vulnerabilities are the main reasons for cyber security attacks on the software systems that cause substantial damages economically and socially

(Neuhaus et al., 2007; Zhou et al., 2019). Therefore, vulnerability detection is an essential yet challenging step to identify vulnerabilities in the source codes to provide security solutions for the software systems.

Early approaches (Neuhaus et al., 2007; Nguyen & Tran, 2010; Shin et al., 2010)

have been proposed to carefully design hand-engineered features for machine learning algorithms to detect vulnerabilities. These early approaches, however, suffer from two major drawbacks. First, designing good features requires prior knowledge, hence needs domain experts, and is usually time-consuming. Second, hand-engineered features are impractical and not straightforward to adapt to all vulnerabilities in numerous libraries evolving over time.

To reduce human efforts on feature engineering, recent approaches (Li et al., 2018; Russell et al., 2018)

consider each raw source code as a flat natural language sequence and explore deep learning architectures applied for natural language processing (NLP) (such as LSTMs

(Hochreiter & Schmidhuber, 1997) and CNNs (Kim, 2014)

) in vulnerability detection. Besides, it is worth noting that pre-trained language models such as BERT

(Devlin et al., 2018) have recently emerged as a significantly trending learning paradigm, offering numerous successful applications in NLP. Inspired by the successes of BERT-style models, pre-trained programming language (PL) models such as CodeBERT (Feng et al., 2020) have also made a significant improvement for PL downstream tasks, including vulnerability detection. However, as mentioned in (Nguyen et al., 2019), all interactions among all positions in the input sequence inside the self-attention layer of the BERT-style model build up a complete graph, i.e., every position has an edge to all other positions. Hence, this limits learning local structures within the source code to differentiate vulnerabilities.

Graph neural networks (GNNs) have recently become a central method to embed nodes and graphs into low-dimensional continuous vector spaces

(Hamilton et al., 2017; Wu et al., 2019). GNNs provide faster and practical training, higher accuracy, and state-of-the-art results for downstream tasks such as text classification (Yao et al., 2019). Inspired by this advanced architecture, Devign (Zhou et al., 2019) is proposed to utilize GNNs for vulnerability detection, using a complex pre-process to extract multi-edged graph information such as Abstract Syntax Tree (AST), data flow, and control flow from the source code. This complex pre-process, however, is difficult of being practiced for many programming languages and numerous open-source codes and libraries.

In this paper, we propose a general and novel graph neural network-based model, named ReGVD, for vulnerability identification. In particular, we also consider programming language as natural language. Hence, ReGVD treats a given source code as a flat sequence of tokens and leverages two effective graph construction methods to build a single graph. The first method is to consider unique tokens as nodes and co-occurrences between nodes (within a fixed-size sliding window) as edges. The second is to consider indexes as nodes and also co-occurrences between indexes as edges. To make a fair comparison with pre-trained PL models such as CodeBERT, ReGVD employs only the embedding layer of the pre-trained PL model to initialize node feature vectors. Then, ReGVD examines GNNs, but with a novelty of using residual connection among GNN layers. Next, ReGVD exploits the sum and max poolings and utilizes a beneficial mixture between these poolings to produce a graph embedding for the given source code. This graph embedding is finally fed to a single fully-connected layer followed by a softmax layer to predict the code vulnerabilities. To sum up, our main contributions are as follows:

  • [leftmargin=0.75cm]

  • We are inspired by pre-trained programming language models and graph neural neural networks to introduce ReGVD – a novel GNN-based model for vulnerability detection.

  • ReGVD makes use of effective code representation through two graph construction methods to build a graph for each given source code, wherein node features are initialized only by the embedding layer of a pre-trained PL model. ReGVD then introduces a novel adaptation of residual connection among GNN layers and an advantageous mixture of the sum and max poolings to enhance learning better code graph representation.

  • Extensive experiments show that ReGVD significantly outperforms the existing state-of-the-art models and produces the highest accuracy of 63.69%, gaining absolute improvements of 1.61% and 1.39% over CodeBERT and GraphCodeBERT respectively, on the benchmark vulnerability detection dataset from CodeXGLUE (Lu et al., 2021).

2 The proposed ReGVD

2.1 Problem definition

We formalize vulnerability detection as an inductive binary classification problem for source code at the function level, i.e., we aim to identify whether a given function in raw source code is vulnerable or not (Zhou et al., 2019). We define a data sample as , where represents the set of raw source codes, denotes the label set with for vulnerable and otherwise, and is the number of instances. As graph neural networks (GNNs) provide faster and practical training, higher accuracy, and state-of-the-art results for many downstream tasks (Kipf & Welling, 2017), we leverage GNNs for vulnerability detection. Therefore, we construct a graph for each given source code , wherein is a set of nodes in the graph; is the node feature matrix, wherein each node in is represented by a -dimensional real-valued vector ; is the adjacency matrix, where equal to 1 means having an edge between node and node , and 0 otherwise. We aim to learn a mapping function to determine whether a given source code is vulnerable or not. The mapping function

can be learned by minimizing the loss function with the regularization on model parameters

as:

(1)

where is the cross-entropy loss function and and is an adjustable weight.

Figure 1: An illustration for two graph construction methods with a fixed-size sliding window of length 3 (the window size equals 3).

Note that Devign (Zhou et al., 2019) uses a complex pre-process to build a multi-edged graph for each given source code. Hence it is impractical for many programming languages (PLs) and numerous open-source codes and libraries. It is also worth noting that recently, pre-trained PL models such as CodeBERT (Feng et al., 2020) have significantly improved the performance of PL downstream tasks such as vulnerability detection. However, these BERT-style PL models limit learning local and logical structures inside the source code to differentiate vulnerabilities. To this end, we propose ReGVD – a novel and general GNN-based model using effective code representation for vulnerability detection as follows: (i) ReGVD views a given raw source code as a flat sequence of tokens and transforms this sequence into a single graph. (ii) ReGVD examines GCNs (Kipf & Welling, 2017) and Gated GNNs (Li et al., 2016), with a novel use of residual connection among GNN layers. (iii) ReGVD utilizes a new and beneficial mixture between the sum and max poolings to produce a graph embedding for the given source code.

In what follows, we first introduce two effective methods to construct a graph for each raw source code in Section 2.2, then describe our ReGVD in utilizing graph neural networks with residual connection in Section 2.3, finally focus on presenting a graph-level readout layer in Section 2.4 to obtain the graph embedding to perform the classification task.

2.2 Graph construction

We consider a given source code as a flat sequence of tokens and illustrate two graph construction methods in Figure 1 to keep the local programming logic of source code. Note that we omit self-loops in these two methods since the self-loops do not help to improve performance in our pilot experiments. A possible reason is that source code is more structural than natural language where the self-loops can contribute useful graph information (Yao et al., 2019; Huang et al., 2019; Zhang et al., 2020).

Unique token-focused construction

We represent unique tokens as nodes and co-occurrences between tokens (within a fixed-size sliding window) as edges, and the obtained graph has an adjacency matrix as:

As the size of the graph is much smaller than the actual length of the source code, this method can consume less GPU memory.

Index-focused construction

Given a flat sequence of tokens , we represent all tokens as the nodes, i.e., treating each index as a node to represent token . The number of nodes equals the sequence length. We also consider co-occurrences between indexes (within a fixed-size sliding window) as edges, and the obtained graph has an adjacency matrix as:

Node feature initialization

To attain the advantage of pre-trained PL models such as CodeBERT and make a fair comparison, we use only the embedding layer of the pre-trained PL model to initialize node feature vectors.

2.3 Graph neural networks with residual connection

Figure 2: An illustration for our proposed ReGVD.

GNNs aim to update vector representations of nodes by recursively aggregating vector representations from their neighbours (Scarselli et al., 2009; Kipf & Welling, 2017). Mathematically, given a graph , we formulate GNNs as follows:

(2)

where is the vector representation of node at the -th iteration/layer; is the set of neighbours of node ; and is the node feature vector of .

There have been many GNNs proposed in recent literature (Wu et al., 2019), wherein Graph Convolutional Networks (GCNs) (Kipf & Welling, 2017) is the most widely-used one, and Gated graph neural networks (“Gated GNNs” or “GGNNs” for short) (Li et al., 2016) is also suitable for our data structure. Our ReGVD leverages GCNs and GGNNs as the base models.

Formally, GCNs is given as follows:

(3)

where is an edge constant between nodes and in the Laplacian re-normalized adjacency matrix (as we omit self-loops), wherein D is the diagonal node degree matrix of ; is a weight matrix; and

is a nonlinear activation function such as

.

GGNNs adopts GRUs (Cho et al., 2014), unrolls the recurrence for a fixed number of timesteps, and removes the need to constrain parameters to ensure convergence as:

(4)

where z and r are the update and reset gates;

is the sigmoid function; and

is the element-wise multiplication.

The residual connection (He et al., 2016)

is used to incorporate information learned in the lower layers to the higher layers, and more importantly, to allow gradients to directly pass through the layers to avoid vanishing gradient or exploding gradient problems. The residual connection is employed in many architectures in computer vision and NLP. Motivated by that, ReGVD presents a novel adaptation of residual connection among the GNN layers, with fixing the same hidden size for the different layers. In particular, ReGVD redefines Equation

3 as:

(5)

Similarly, ReGVD also redefines Equation 4 as follows:

(6)

2.4 Graph-level readout pooling layer

The graph-level readout layer is used to produce a graph embedding for each input graph. This layer can be built more complex poolings such as hierarchical pooling (Cangea et al., 2018), differentiable pooling (Ying et al., 2018), and Conv pooling (Zhou et al., 2019). As the simple sum pooling produces better results for graph classification (Xu et al., 2019), ReGVD leverages the sum pooling to obtain the graph embedding. Besides, ReGVD utilizes the max pooling to exploit more information on the key nodes. ReGVD defines a beneficial mixture between the sum and max poolings to produce the graph embedding as follows:

(7)
(8)

where is the final vector representation of node , wherein acts as soft attention mechanisms over nodes (Li et al., 2016), and is the vector representation of node at the last -th layer; and denotes an arbitrary function. ReGVD examines three functions consisting of , , and as:

(9)
(10)
(11)

After that, ReGVD feeds to a single fully-connected layer followed by a layer to predict whether a given source code is vulnerable or not as:

(12)

Finally, ReGVD is trained by minimizing the cross-entropy loss function. We illustrate our proposed ReGVD in Figure 2 and briefly present the learning process in ReGVD in Algorithm 1.

1 Input: A source code with its label for  do
2      
Algorithm 1 The learning process in ReGVD.

3 Experimental setup and results

In this section, we evaluate the benefits of our proposed ReGVD and address the following questions:

Q1

How does ReGVD compare to other state-of-the-art vulnerability detection methods?

Q2

Can the graph-level readout pooling layer proposed in our ReGVD work better than the more complex Conv pooling layer employed in Devign (Zhou et al., 2019)?

Q3

How is the influence of the residual connection, the mixture function, and the sliding window size on the GNNs performance?

Q4

Can ReGVD obtain satisfactory accuracy results even with limited training data?

3.1 Experimental setup

Dataset #Instances
Training set 21,854
Validation set 2,732
Test set 2,732
Table 1: Statistics of the dataset.

Dataset

We use the real-world benchmark dataset from CodeXGLUE (Lu et al., 2021) for vulnerability detection at the function level.111https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/Defect-detection The dataset was firstly created by Zhou et al. (2019), including 27,318 manually-labeled vulnerable or non-vulnerable functions extracted from security-related commits in two large and popular C programming language open-source projects (i.e., QEMU and FFmpeg) and diversified in functionality. Since Zhou et al. (2019) did not provide official training/validation/test sets, Lu et al. (2021) combined these projects and then split into the training/validation/test sets. Table 1 reports the statistics of this benchmark dataset.

Training protocol

We construct a 2-layer model, set the batch size to 128, and employ the Adam optimizer (Kingma & Ba, 2014)

to train our model up to 100 epochs. As mentioned in Section

2.3, we set the same hidden size (“hs”) for the hidden GNN layers, wherein we vary the size value in {128, 256, 384}. We vary the sliding window size (“ws”) in {2, 3, 4, 5} and the Adam initial learning rate (“lr”) in . The final accuracy on the test set is reported for the best model checkpoint, which obtains the highest accuracy on the validation set. Table 2 shows the optimal hyper-parameters for each setting in our ReGVD.

Const Init Base lr ws hs
Idx CB GGNN 2
GCN 2
G-CB GGNN 2
GCN 2
UniT CB GGNN 2
GCN 5
G-CB GGNN 3
GCN 5
Table 2: The optimal hyper-parameters on the validation set for each ReGVD setting. “Const” denotes the graph construction method, wherein “Idx” and “UniT” denote the index-focused construction and the unique token-focused one, respectively. “Init” denotes the feature initialization, wherein “CB” and “G-CB” denote using only the embedding layer of CodeBERT and GraphCodeBERT to initialize the node features, respectively.

Baselines

We compare our ReGVD with strong and up-to-date baselines as follows:

  • [leftmargin=0.75cm]

  • BiLSTM (Hochreiter & Schmidhuber, 1997) and TextCNN (Kim, 2014) are two well-known standard models applied for text classification. Since we consider a given source code as a flat sequence of tokens, we can adapt these two models for vulnerability detection.

  • RoBERTa (Liu et al., 2019) is built based on BERT (Devlin et al., 2018) by removing the next-sentence pre-training objective and training on a massive dataset with larger mini-batches and learning rates.

  • Devign (Zhou et al., 2019) builds a multi-edged graph from a raw source code, then uses Gated GNNs (Li et al., 2016) to update node representations, and finally utilizes a 1-D CNN-based pooling (“Conv”) to make a prediction.

  • CodeBERT (Feng et al., 2020) is a bimodal pre-trained model also based on BERT for 6 programming languages (Python, Java, JavaScript, PHP, Ruby, Go), using the objectives of masked language model (Devlin et al., 2018) and replaced token detection (Clark et al., 2020).

  • GraphCodeBERT (Guo et al., 2021) is a new pre-trained PL model, extending CodeBERT to consider the inherent structure of code data flow into the training objective.

We note that Zhou et al. (2019) did not release the official implementation of Devign. Thus, we re-implement Devign using our two graph construction methods and the same training protocol w.r.t. the optimizer, the window sizes, the initial learning rate values, and the number of training epochs.

Model Accuracy (%)
BiLSTM 59.37
TextCNN 60.69
RoBERTa 61.05
CodeBERT 62.08
GraphCodeBERT 62.30
Devign (Idx + CB) 60.43
Devign (Idx + G-CB) 61.31
Devign (UniT + CB) 60.40
Devign (UniT + G-CB) 59.77
ReGVD (GGNN + Idx + CB) 63.54
ReGVD (GGNN + Idx + G-CB) 63.29
ReGVD (GGNN + UniT + CB) 63.62
ReGVD (GGNN + UniT + G-CB) 62.41
ReGVD (GCN + Idx + CB) 62.63
ReGVD (GCN + Idx + G-CB) 62.70
ReGVD (GCN + UniT + CB) 63.14
ReGVD (GCN + UniT + G-CB) 63.69
Table 3: Vulnerability detection accuracies (%) on the test set. The best scores are in bold, while the second best scores are in underline. The results of BiLSTM, TextCNN, RoBERTa, and CodeBERT are taken from (Lu et al., 2021). denotes that we report our own results for other baselines.

3.2 Main results

Table 3 presents the accuracy results of the proposed ReGVD and the strong and up-to-date baselines on the real-world benchmark dataset from CodeXGLUE for vulnerability detection regarding Q1. We note that TextCNN and RoBERTa outperform Devign, except the Devign setting (Idx+G-CB). Both the recent models CodeBERT and GraphCodeBERT obtain competitive performances and perform better than Devign, indicating the effectiveness of the pre-trained PL models. More importantly, ReGVD gains absolute improvements of 1.61% and 1.39% over CodeBERT and GraphCodeBERT, respectively. This shows the benefit of ReGVD in learning the local structures inside the source code to differentiate vulnerabilities (w.r.t using only the embedding layer of the pre-trained PL model). Hence, our ReGVD significantly outperforms the up-to-date baseline models. In particular, ReGVD produces the highest accuracy of 63.69% – a new state-of-the-art result on the CodeXGLUE vulnerability detection dataset.

In our pilot studies, we achieve higher results for our ReGVD by feeding the flat sequence of tokens as an input for the pre-trained PL model to obtain the contextualized embeddings, which are then used to initialize the node feature vectors. But for a fair comparison, we use only the embedding layer of the pre-trained PL model to initialize the feature vectors.

(a) Accuracy with and without residual connection.
(b) Accuracy w.r.t the functions.
Figure 3: Accuracy with different settings.

We look at Figure 2(a) to address Q2 to investigate whether the graph-level readout layer proposed in ReGVD performs better than the more complex Conv pooling layer utilized in Devign. Since Devign also uses Gated GNNs to update the node representations and gains the best accuracy of 61.31% for the setting (Idx+G-CB); thus, we consider the ReGVD setting (GGNN+Idx +G-CB) without using the residual connection for a fair comparison, wherein ReGVD achieves an accuracy of 63.51%, which is 2.20% higher accuracy than that of Devign. More generally, we get a similar conclusion from the results of three remaining ReGVD settings (without using the residual connection) that the graph-level readout layer utilized in ReGVD outperforms that used in Devign.

(a) Index-focused graph construction.
(b) Unique token-focused graph construction.
Figure 4: Accuracy with varying window sizes.
Figure 5: Accuracy with different percents of the training set.

We analyze the influence of the residual connection, the mixture function, and the sliding window size to answer Q3. We first look back Figure 2(a) for the ReGVD accuracies w.r.t with and without using the residual connection among the GNN layers. It demonstrates that the residual connection helps to boost the GNNs performance on seven settings, where the maximum accuracy gain is 2.05% for the ReGVD setting (GCN+Idx+G-CB). Next, we look at Figure 2(b) for the ReGVD results w.r.t the functions. We find that ReGVD generally gains the highest accuracies on six settings using the operator and on two remaining settings using the operator. But it is worth noting that the ReGVD setting (GGNN+Idx+CB) using the operator obtains an accuracy of 62.59%, which is still higher than that of Devign, CodeBERT, and GraphCodeBERT. Then, we check the results shown in Figure 4 to explore the influence of the sliding window size on the accuracy performance w.r.t the graph construction method. We see that using smaller sizes produces better accuracies than using the larger ones regarding the index-focused construction method, otherwise regarding the unique token-focused construction method. We also find that the window size 3 gives stable results and the highest average accuracy of 62.67% over all eight settings.

We test the best performing settings with different percents of the training data regarding Q4. Figure 5 shows the test accuracies when training ReGVD with 20%, 40%, 60%, 80%, and full training sets. Our model achieves satisfactory performance with limited training data, compared to the baselines using the full training data. For example, ReGVD obtains an accuracy of 61.68% with 60% training set, which is higher than that of BiLSTM, TextCNN, RoBERTa, and Devign. It also achieves an accuracy of 62.55% with 80% training set, which is better than that of CodeBERT and GraphCodeBERT.

4 Conclusion

We introduce a novel graph neural network-based model, ReGVD, to detect vulnerabilities in source code. ReGVD transforms each raw source code into a single graph to benefit the local structures inside the source code through two effective graph construction methods, wherein ReGVD utilizes only the embedding layer of the pre-trained programming language model to initialize node feature vectors. ReGVD then makes novel use of residual connection among GNN layers and a valuable mixture of the sum and max poolings to learn better code graph representation. To demonstrate the effectiveness of ReGVD, we conduct extensive experiments to compare ReGVD with the strong and up-to-date baselines on the benchmark vulnerability detection dataset from CodeXGLUE. Experimental results show that the proposed ReGVD is significantly better than the baseline models and obtains the highest accuracy of 63.69% on the benchmark dataset.

ReGVD can be seen as a general, unified and practical framework. Future work can adapt ReGVD for similar classification tasks such as clone detection. Furthermore, we plan to extend and combine ReGVD using the index-focused graph construction with a BERT-style model to build an encoder for other programming language tasks such as code completion and code translation.

Acknowledgements

We would like to thank Anh Bui (tuananh.bui@monash.edu) for his kind help and support.

References