DeepAI
Log In Sign Up

Multi-View Pre-Trained Model for Code Vulnerability Identification

08/10/2022
by   Xuxiang Jiang, et al.
4

Vulnerability identification is crucial for cyber security in the software-related industry. Early identification methods require significant manual efforts in crafting features or annotating vulnerable code. Although the recent pre-trained models alleviate this issue, they overlook the multiple rich structural information contained in the code itself. In this paper, we propose a novel Multi-View Pre-Trained Model (MV-PTM) that encodes both sequential and multi-type structural information of the source code and uses contrastive learning to enhance code representations. The experiments conducted on two public datasets demonstrate the superiority of MV-PTM. In particular, MV-PTM improves GraphCodeBERT by 3.36% on average in terms of F1 score.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

08/25/2021

What do pre-trained code models know about code?

Pre-trained models of code built on the transformer architecture have pe...
09/08/2019

Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks

Vulnerability identification is crucial to protect the software systems ...
05/04/2022

CODE-MVP: Learning to Represent Source Code from Multiple Views with Contrastive Pre-Training

Recent years have witnessed increasing interest in code representation l...
05/02/2021

MathBERT: A Pre-Trained Model for Mathematical Formula Understanding

Large-scale pre-trained models like BERT, have obtained a great success ...
01/26/2022

Team Yao at Factify 2022: Utilizing Pre-trained Models and Co-attention Networks for Multi-Modal Fact Verification

In recent years, social media has enabled users to get exposed to a myri...
03/21/2022

Domain Generalization by Mutual-Information Regularization with Pre-trained Models

Domain generalization (DG) aims to learn a generalized model to an unsee...
11/23/2022

DeepVulSeeker: A Novel Vulnerability Identification Framework via Code Graph Structure and Pre-training Mechanism

Software vulnerabilities can pose severe harms to a computing system. Th...

1 Introduction

Code vulnerabilities are a major threat to the software-related industry. It is reported that the number of vulnerabilities has grown from 4,600 in 2010 to 175,477 by 2022 111http://cve.mitre.org/. The number of vulnerabilities is still rapidly increasing.

Accordingly, the field of vulnerability identification is under intensive exploration in academia. In early-stage research, vulnerability identification methods can be categorized into three types: static analysis [22, 1], dynamic analysis [10]

, and machine learning methods 

[5, 17] based on hand-crafted features. Yet, a drawback of these methods restrains their performance. That is, they require vulnerability-related expertise and significant manual efforts, yielding them hard to be deployed and poorly scalable [24].

In later-stage research, researchers applied deep learning to address the aforementioned drawback existing in early-stage research 

[13]. Some studies [12, 24] leveraged several state-of-the-art deep learning techniques, e.g., LSTM and GGRN. A common feature of these methods is that they require large amounts of labeled data to perform supervised training and achieve better performance than conventional methods. Unfortunately, there is currently a lack of data annotated with vulnerability categories, and manually annotating data is labor-intensive. This hinders the further development of these methods in vulnerability identification.

The emergence of pre-training techniques alleviates the aforementioned problem. Thanks to the advancement, some pre-trained models for source code have been proposed, such as CodeBERT [3] and CodeT5 [21]. However, a significant disadvantage of these methods is that they ignores rich structural information such as abstract syntax and control flow. As such, a natural research question arises: how to combine multiple structural information with pre-trained models for vulnerability identification.

To tackle this question, we propose a novel Multi-View Pre-Trained Model (MV-PTM). Based on the pre-trained model, MV-PTM encodes both sequential information and multi-type structural information of the source code in a unified framework. Specifically, it generates representations of code under different structural information constraints. We term these representations as multiple views of source code. In this work, we use analysis tools to extract the Abstract Syntax Tree (AST), Control Flow Graph (CFG), and Data Flow Graph (DFG) of the source code and represent them as adjacency matrices that are taken as input by Structural-Aware Self-Attention Encoder to produce views containing different semantics. MV-PTM makes vulnerability predictions based on these views. In addition, MV-PTM uses contrastive learning [19] method for representation enhancement of structural information.

Our contributions are listed as follows:

  • We propose a novel approach based on the pre-trained model that learns different structural information of the source code in a unified framework, which endows our model with the capability to represent the semantics of code more accurately.

  • We perform contrastive learning based on multiple views of code to improve the performance of code representation learning, which is demonstrated to be better at characterizing code in the experiment.

  • MV-PTM outperforms the state of the arts significantly with an average of 3.85% higher Accuracy and 6.80% F-1 Score.

Figure 1: The architecture of MV-PTM. In the adjacency matrix, 1 denotes there is an edge between corresponding nodes of code, and 0 otherwise. The layer of pre-trained model is used to obtain the source code embedding.

2 Methodology

Overview. Fig. 1 shows the overview of MV-PTM. First, we use Tree-sitter 222https://github.com/tree-sitter/tree-sitter-c

to parse the source codes and get the code structural graphs We then convert these graphs into adjacency matrices to guide the generation of multi-view code representations based on the Structural-Aware Self-Attention Encoder and pre-trained model. Afterward, the multi-view representations are fed to a Pooling Layer and Multi-layer Perceptron (MLP) for identification.

2.1 Structural Information

As aforementioned, we first obtain different structural graphs: CFG, AST, and DFG. Each node in these graphs represents a program statement, and each edge represents certain structural information. We use an adjacency matrix to represent a certain type of edges in the graph ( is the number of tokens). We set if the -th node and the -th node are connected in the graph; Otherwise, .

CFG is a graphical representation of the paths that are traversed during the execution of a program. For example, as shown in Fig. 2 (b), when the program executes the “if (a > 3)” statement, it decides whether “b = a - b” is executed according to the variable “a”. AST is a tree-structured representation of the syntax structure of the source codes. Each node on the tree represents a syntactic structure. We use the subtrees in AST to analyze each statement in the program. Specifically, the tokens in the same statement can be connected to each other. DFG tracks the use of variables during program execution, including access or modification of variables. Take Fig. 2 (d) for instance, the variable “a” in “b = a - b” comes from “a > 3’.

(a) (b) (c) (d)
Figure 2: An example of different structural adjacency matrices.

2.2 Structure-Aware Self-Attention Encoder

We utilize the pre-trained CodeBERT as the backbone in our approach to generate contextualized token representations, but our approach is flexible to other pre-trained models. Taking the source code as an example, the representations are obtained by:

(1)

On top of the backbone, we further design Structure-Aware Self-Attention Encoder (SASA) based on the self-attention mechanism proposed by [20]. SASA combines the structural information matrix with the scaled dot-product attention using addition operations, given by:

(2)

where and denote the query, key, and value matrix, respectively, and are set by . is the dimension of and is Normalized Exponential Function. is the adjacency matrix generated according to the specific structure information and it can constrain what the -th token can attend to when computing attention values. Under the constraints of adjacency matrices, SASA can generate multiple views containing different structural information.

Under our observation, we notice that there are some similar dependencies between different structural information. Inspired by this, we make the node representation learning on different views share the same self-attention head. Besides, we also have the view-specific self-attention head. To fuse the representations learned from shared heads and view-specific heads, we use linear mapping to project them into the same space. As a whole, the SASA attention for one structural adjacency matrix (one view) is calculated as follows:

(3)

where and correspond to the representation learned from the shared self-attention head and view-specific self-attention head, as shown in Eq. 2. means the concatenation operation.

2.3 Multi-view Contrastive Learning

To enhance the representations learned from different structural information, we regard each type of information as one view and perform Contrastive Learning. This is motivated by the fact that different views of the same piece of code have some correlations and tend to cluster together in the semantic space.

To realize contrastive learning, we consider different views of the same code as positive pairs and those of different codes as negative pairs. The loss function w.r.t. contrastive learning is expressed as:

(4)

where takes AST as the Matched Structural view, and it is analogous to and . is Normalized Temperature Scaled Cross Entropy Loss [2].

Figure 3: Diamonds, triangles, pentagons, and squares correspond to code sequence, DFG, CFG, and AST, respectively. The circle denotes the semantic space.

2.4 Training Loss

We leverage Cross Entropy for training the main task, i.e., vulnerability identification, and the total loss for fine-tuning MV-PTM is given by:

(5)

where is the hyper parameter and we set in the experiments.

3 Experiments

3.1 Experimental Setup

Datasets. We evaluate our approach on two C-language datasets used in previous studies [24, 21]

, which contain manually-labeled functions collected from open-source projects FFmpeg and QEMU.

Since some code snippets in the dataset exceed the length limit of CodeBERT, we discarded the codes that have more than 512 tokens. For long code segments, the recognition accuracy of MV-PTM is not ideal.

FFmpeg QEMU
Training Set 3958 10903
Validation Set 462 1378
Test Set 499 1319
Total 4919 13600
Average Length 274.5 325.3
Table 1: Statistics of the datasets.

Baselines. We choose the following six methods as the baselines since they represent the most up-to-date vulnerability identification mechanisms:

VulDeePecker [12]: It turns the source codes into a token sequence. The initial embeddings of tokens are trained via Word2Vec [14].

CNN [15]: It models the source codes as natural language and applies CNN to extract features from the code. The embedding initialization is the same as that of VulDeePecker.

Devign [24]: It represents the source code with the code property graph (CPG) which integrates all syntax and dependency semantics. Based on the graph, it uses Gated Graph Recurrent Network[11] for graph-level classification.

SELFATT [20]: Similar to [12], it takes the source code as sequences and exploits the multi-head attention mechanism for code representation learning.

CodeBERT [3]: It is a pre-trained model for programming language which has achieved acceptable performance on many code-related tasks such as code search and code documentation generation.

GraphCodeBERT [6]: It is the first pre-trained model that leverages code structure to learn code representation to improve code understanding.

3.2 Experimental Results

Performance comparison. As shown in Table 2, MV-PTM outperforms all baseline methods on both two datasets. According to the experimental results, we summarize the following findings:

The local and structural characteristics of the code can improve the performance of vulnerability identification. Comparing CNN with VulDeePecker, we can find that the Accuracy is significantly improved in both two datasets, implying that the local characteristics learned by CNN are indeed helpful for vulnerability identification. GraphCodeBERT outperforms CodeBERT with an average of 1.64% higher Accuracy and 3.44% F-1 Score.

MV-PTM performs best among all methods. MV-PTM has further improved its performance based on CodeBERT. It is noteworthy that the F-1 Score of baselines on the QEMU dataset is not ideal, while MV-PTM raised the F-1 Score to 0.7049.

Methods FFmpeg QEMU
Accuracy F-1 Score Accuracy F-1 Score
VulDeePecker[12] 0.5622 0.5923 0.5956 0.5644
CNN[15] 0.6032 0.6278 0.6482 0.3974
Devign[24] 0.5904 0.6015 0.6039 0.3244
SELFATT[20] 0.6152 0.6323 0.6361 0.3701
CodeBERT[3] 0.6353 0.6431 0.6907 0.6102
GraphCodeBERT[6] 0.6613 0.6724 0.6975 0.6497
MV-PTM 0.6874 0.6843 0.7149 0.7049
Table 2: Experimental results on the two datasets.

3.3 Ablation study

In this section, we verify the effectiveness of the three structural information and contrastive learning methods used in our work according to the results of the ablation study. Table 3 shows the experimental results.

Methods FFmpeg QEMU
Accuracy F-1 Score Accuracy F-1 Score
MV-PTM 0.6874 0.6843 0.7149 0.7049
MV-PTM w/o CFG 0.6553 0.6643 0.6983 0.6865
MV-PTM w/o AST 0.6573 0.6674 0.7036 0.6845
MV-PTM w/o DFG 0.6513 0.6683 0.6990 0.6892
MV-PTM w/o Contrastive 0.6693 0.6845 0.7005 0.6688
Table 3: Effect of structural information and contrastive learning.

Structural information improves the performance of the model. It can be observed from Table 3 that when any kind of structural information is removed, the performances of the model on both datasets decrease significantly.

Contrastive Learning can learn a more accurate multi-view representation of source code. We find that after removing the contrastive learning module, the performance of the model also decreases to a certain extent. This phenomenon implies that the contrastive learning method we proposed can make the model learn different code structure information more effectively.

4 Related Work

Vulnerability Identification. In academia, there usually are rule-based and learning-based methods for vulnerability identification. Rule-Based methods are widely explored in academia. SUTURE [23] is a static analysis method, which is capable of identifying high-order vulnerabilities in OS kernels. Learning-Based methods are a novel research direction that attracts much attention. VulDeePecker [12]

is an LSTM-based model, which represents the source code as vectors.

Pre-trained Model for Programming Languages. CodeBERT proposed in [3] is a bimodal model for programming language and natural language trained by Masked Language Modeling and Replaced Token Detection. GraphCodeBERT  [6] considers the inherent structure of code by Edge Prediction and Node Alignment to support tasks like code clone detection [16, 8, 9]. Compared to CodeBERT, MV-PTM improves the accuracy by an average of 3.81% at the cost of 15% additional parameters and 25% training time, which is within acceptable limits.

Contrastive Learning. Contrastive learning is usually conducted in an unsupervised manner by increasing the similarity between the representations of positive pairs and decreasing the similarity between the representations of negative pairs [7, 18]

. Data augmentation is a commonly-used technique to construct positive pairs, including rotation, scaling, and cropping in computer vision 

[2] and dropout in NLP [4].

5 Conclusion

In this paper, we propose MV-PTM, a pre-trained based model which uses structural information including AST, DFG, and CFG, to obtain multiple views of the source code. Besides, we introduce an auxiliary task based on contrastive learning to improve the performance of code representation. The experiments on two datasets demonstrate that structural information and contrastive learning are effective for vulnerability identification. In the future, we plan to introduce the knowledge graph to generate reasonable explanations for the identified vulnerabilities.

Acknowledgment

This study was supported in part by the National Key R&D Program of China (2019YFB2102600), the National Natural Science Foundation of China (62002067), and the Guangzhou Youth Talent Program of Science (QT20220101174).

References

  • [1] M. Chandramohan, Y. Xue, Z. Xu, Y. Liu, C. Y. Cho, and H. B. K. Tan (2016) BinGo: cross-architecture cross-os binary search. In SIGSOFT FSE, pp. 678–689. Cited by: §1.
  • [2] T. Chen, S. Kornblith, M. Norouzi, and G. E. Hinton (2020) A simple framework for contrastive learning of visual representations. In ICML, pp. 1597–1607. Cited by: §2.3, §4.
  • [3] Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang, and M. Zhou (2020) CodeBERT: A pre-trained model for programming and natural languages. In EMNLP, pp. 1536–1547. Cited by: §1, §3.1, Table 2, §4.
  • [4] T. Gao, X. Yao, and D. Chen (2021) SimCSE: simple contrastive learning of sentence embeddings. In EMNLP, pp. 6894–6910. Cited by: §4.
  • [5] G. Grieco, G. L. Grinblat, L. C. Uzal, S. Rawat, J. Feist, and L. Mounier (2016) Toward large-scale vulnerability discovery using machine learning. In CODASPY, pp. 85–96. Cited by: §1.
  • [6] D. Guo, S. Ren, S. Lu, Z. Feng, D. Tang, S. Liu, L. Zhou, N. Duan, A. Svyatkovskiy, S. Fu, M. Tufano, S. K. Deng, C. B. Clement, D. Drain, N. Sundaresan, J. Yin, D. Jiang, and M. Zhou (2021) GraphCodeBERT: pre-training code representations with data flow. In ICLR, Cited by: §3.1, Table 2, §4.
  • [7] K. He, H. Fan, Y. Wu, S. Xie, and R. B. Girshick (2020) Momentum contrast for unsupervised visual representation learning. In CVPR, pp. 9726–9735. Cited by: §4.
  • [8] C. Huang, H. Zhou, C. Ye, and B. Li (2021) Code clone detection based on event embedding and event dependency. CoRR. Cited by: §4.
  • [9] X. Ji, L. Liu, and J. Zhu (2021) Code clone detection with hierarchical attentive graph embedding. Int. J. Softw. Eng. Knowl. Eng., pp. 837–861. Cited by: §4.
  • [10] Y. Li, B. Chen, M. Chandramohan, S. Lin, Y. Liu, and A. Tiu (2017) Steelix: program-state based binary fuzzing. In ESEC/SIGSOFT FSE, pp. 627–637. Cited by: §1.
  • [11] Y. Li, D. Tarlow, M. Brockschmidt, and R. S. Zemel (2016)

    Gated graph sequence neural networks

    .
    In ICLR, Cited by: §3.1.
  • [12] Z. Li, D. Zou, S. Xu, X. Ou, H. Jin, S. Wang, Z. Deng, and Y. Zhong (2018) VulDeePecker: A deep learning-based system for vulnerability detection. In NDSS, Cited by: §1, §3.1, §3.1, Table 2, §4.
  • [13] G. Lin, S. Wen, Q. Han, J. Zhang, and Y. Xiang (2020) Software vulnerability detection using deep neural networks: A survey. Proc. IEEE, pp. 1825–1848. Cited by: §1.
  • [14] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In NeurIPS, pp. 3111–3119. Cited by: §3.1.
  • [15] R. L. Russell, L. Y. Kim, L. H. Hamilton, T. Lazovich, J. Harer, O. Ozdemir, P. M. Ellingwood, and M. W. McConley (2018) Automated vulnerability detection in source code using deep representation learning. In ICMLA, pp. 757–762. Cited by: §3.1, Table 2.
  • [16] A. Sheneamer, S. Roy, and J. Kalita (2021) An effective semantic code clone detection framework using pairwise feature fusion. IEEE Access, pp. 84828–84844. Cited by: §4.
  • [17] Y. Shin, A. Meneely, L. A. Williams, and J. A. Osborne (2011) Evaluating complexity, code churn, and developer activity metrics as indicators of software vulnerabilities. IEEE Trans. Software Eng., pp. 772–787. Cited by: §1.
  • [18] X. Tang, C. Dong, and W. Zhang (2022) Contrastive author-aware text clustering. Pattern Recognit. 130, pp. 108787. Cited by: §4.
  • [19] A. van den Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. CoRR. Cited by: §1.
  • [20] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NeurIPS, pp. 5998–6008. Cited by: §2.2, §3.1, Table 2.
  • [21] Y. Wang, W. Wang, S. R. Joty, and S. C. H. Hoi (2021) CodeT5: identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In EMNLP, pp. 8696–8708. Cited by: §1, §3.1.
  • [22] Z. Xu, T. Kremenek, and J. Zhang (2010) A memory model for static analysis of C programs. In ISoLA, pp. 535–548. Cited by: §1.
  • [23] H. Zhang, W. Chen, Y. Hao, G. Li, Y. Zhai, X. Zou, and Z. Qian (2021) Statically discovering high-order taint style vulnerabilities in OS kernels. In CCS, pp. 811–824. Cited by: §4.
  • [24] Y. Zhou, S. Liu, J. K. Siow, X. Du, and Y. Liu (2019) Devign: effective vulnerability identification by learning comprehensive program semantics via graph neural networks. In NeurIPS, pp. 10197–10207. Cited by: §1, §1, §3.1, §3.1, Table 2.