A Comparison of Different Source Code Representation Methods for Vulnerability Prediction in Python

08/04/2021
by   Amirreza Bagheri, et al.
0

In the age of big data and machine learning, at a time when the techniques and methods of software development are evolving rapidly, a problem has arisen: programmers can no longer detect all the security flaws and vulnerabilities in their code manually. To overcome this problem, developers can now rely on automatic techniques, like machine learning based prediction models, to detect such issues. An inherent property of such approaches is that they work with numeric vectors (i.e., feature vectors) as inputs. Therefore, one needs to transform the source code into such feature vectors, often referred to as code embedding. A popular approach for code embedding is to adapt natural language processing techniques, like text representation, to automatically derive the necessary features from the source code. However, the suitability and comparison of different text representation techniques for solving Software Engineering (SE) problems is rarely studied systematically. In this paper, we present a comparative study on three popular text representation methods, word2vec, fastText, and BERT applied to the SE task of detecting vulnerabilities in Python code. Using a data mining approach, we collected a large volume of Python source code in both vulnerable and fixed forms that we embedded with word2vec, fastText, and BERT to vectors and used a Long Short-Term Memory network to train on them. Using the same LSTM architecture, we could compare the efficiency of the different embeddings in deriving meaningful feature vectors. Our findings show that all the text representation methods are suitable for code representation in this particular task, but the BERT model is the most promising as it is the least time consuming and the LSTM model based on it achieved the best overall accuracy(93.8 Python source code vulnerabilities.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/13/2023

Automated Vulnerability Detection in Source Code Using Quantum Natural Language Processing

One of the most important challenges in the field of software code audit...
research
06/01/2023

Feature Engineering-Based Detection of Buffer Overflow Vulnerability in Source Code Using Neural Networks

One of the most significant challenges in the field of software code aud...
research
12/20/2021

Vulnerability Analysis of the Android Kernel

We describe a workflow used to analyze the source code of the Android OS...
research
01/20/2022

VUDENC: Vulnerability Detection with Deep Learning on a Natural Codebase for Python

Context: Identifying potential vulnerable code is important to improve t...
research
08/08/2017

Automatic feature learning for vulnerability prediction

Code flaws or vulnerabilities are prevalent in software systems and can ...
research
04/05/2019

A Literature Study of Embeddings on Source Code

Natural language processing has improved tremendously after the success ...
research
04/29/2021

A comparative study of neural network techniques for automatic software vulnerability detection

Software vulnerabilities are usually caused by design flaws or implement...

Please sign up or login with your details

Forgot password? Click here to reset