Bug Prediction Using Source Code Embedding Based on Doc2Vec

10/11/2021
by   Tamás Aladics, et al.
0

Bug prediction is a resource demanding task that is hard to automate using static source code analysis. In many fields of computer science, machine learning has proven to be extremely useful in tasks like this, however, for it to work we need a way to use source code as input. We propose a simple, but meaningful representation for source code based on its abstract syntax tree and the Doc2Vec embedding algorithm. This representation maps the source code to a fixed length vector which can be used for various upstream tasks – one of which is bug prediction. We measured this approach's validity by itself and its effectiveness compared to bug prediction based solely on code metrics. We also experimented on numerous machine learning approaches to check the connection between different embedding parameters with different machine learning models. Our results show that this representation provides meaningful information as it improves the bug prediction accuracy in most cases, and is always at least as good as only using code metrics as features.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/06/2023

Method-Level Bug Severity Prediction using Source Code Metrics and LLMs

In the past couple of decades, significant research efforts are devoted ...
research
06/17/2020

An Automatically Created Novel Bug Dataset and its Validation in Bug Prediction

Bugs are inescapable during software development due to frequent code ch...
research
03/29/2023

An AST-based Code Change Representation and its Performance in Just-in-time Vulnerability Prediction

The presence of software vulnerabilities is an ever-growing issue in sof...
research
10/26/2021

A Controlled Experiment of Different Code Representations for Learning-Based Bug Repair

Training a deep learning model on source code has gained significant tra...
research
06/01/2023

Analysis of ChatGPT on Source Code

This paper explores the use of Large Language Models (LLMs) and in parti...
research
06/18/2019

A Static Analysis-based Cross-Architecture Performance Prediction Using Machine Learning

Porting code from CPU to GPU is costly and time-consuming; Unless much t...
research
10/23/2020

A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code

There is an emerging interest in the application of deep learning models...

Please sign up or login with your details

Forgot password? Click here to reset