Cross-Language Source Code Clone Detection Using Deep Learning with InferCode

05/10/2022
by   Mohammad A. Yahya, et al.
0

Software clones are beneficial to detect security gaps and software maintenance in one programming language or across multiple languages. The existing work on source clone detection performs well but in a single programming language. However, if a piece of code with the same functionality is written in different programming languages, detecting it is harder as different programming languages have a different lexical structure. Moreover, most existing work rely on manual feature engineering. In this paper, we propose a deep neural network model based on source code AST embeddings to detect cross-language clones in an end-to-end fashion of the source code without the need of the manual process to pinpoint similar features across different programming languages. To overcome data shortage and reduce overfitting, a Siamese architecture is employed. The design methodology of our model is twofold – (a) it accepts AST embeddings as input for two different programming languages, and (b) it uses a deep neural network to learn abstract features from these embeddings to improve the accuracy of cross-language clone detection. The early evaluation of the model observes an average precision, recall and F-measure score of 0.99, 0.59 and 0.80 respectively, which indicates that our model outperforms all available models in cross-language clone detection.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/19/2022

Cross-Language Binary-Source Code Matching with Intermediate Representations

Binary-source code matching plays an important role in many security and...
research
05/01/2022

Unified Abstract Syntax Tree Representation Learning for Cross-Language Program Classification

Program classification can be regarded as a high-level abstraction of co...
research
03/10/2023

Software Vulnerability Prediction Knowledge Transferring Between Programming Languages

Developing automated and smart software vulnerability detection models h...
research
03/02/2023

Pathways to Leverage Transcompiler based Data Augmentation for Cross-Language Clone Detection

Software clones are often introduced when developers reuse code fragment...
research
03/07/2017

End-to-End Prediction of Buffer Overruns from Raw Source Code via Neural Memory Networks

Detecting buffer overruns from a source code is one of the most common a...
research
03/29/2019

Using Structured Input and Modularity for Improved Learning

We describe a method for utilizing the known structure of input data to ...
research
09/18/2017

A Survey of Machine Learning for Big Code and Naturalness

Research at the intersection of machine learning, programming languages,...

Please sign up or login with your details

Forgot password? Click here to reset