Pathways to Leverage Transcompiler based Data Augmentation for Cross-Language Clone Detection

03/02/2023
by   Subroto Nag Pinku, et al.
0

Software clones are often introduced when developers reuse code fragments to implement similar functionalities in the same or different software systems. Many high-performing clone detection tools today are based on deep learning techniques and are mostly used for detecting clones written in the same programming language, whereas clone detection tools for detecting cross-language clones are also emerging rapidly. The popularity of deep learning-based clone detection tools creates an opportunity to investigate how known strategies that boost the performances of deep learning models could be further leveraged to improve clone detection tools. In this paper, we investigate such a strategy, data augmentation, which has not yet been explored for cross-language clone detection as opposed to single-language clone detection. We show how the existing knowledge on transcompilers (source-to-source translators) can be used for data augmentation to boost the performance of cross-language clone detection models, as well as to adapt single-language clone detection models to create cross-language clone detection pipelines. To demonstrate the performance boost for cross-language clone detection through data augmentation, we exploit Transcoder, which is a pre-trained source-to-source translator. To show how to extend single-language models for cross-language clone detection, we extend a popular single-language model, Graph Matching Network (GMN) in a combination with the transcompilers. We evaluated our models on popular benchmark datasets. Our experimental results showed improvements in F1 scores (sometimes up to 3 cross-language clone detection models. Even when extending GMN for cross-language clone detection, the models built leveraging data augmentation outperformed the baseline with scores of 0.90, 0.92, and 0.91 for precision, recall, and F1 score, respectively.

READ FULL TEXT
research
04/18/2022

UMass PCL at SemEval-2022 Task 4: Pre-trained Language Model Ensembles for Detecting Patronizing and Condescending Language

Patronizing and condescending language (PCL) is everywhere, but rarely i...
research
05/10/2022

Cross-Language Source Code Clone Detection Using Deep Learning with InferCode

Software clones are beneficial to detect security gaps and software main...
research
11/28/2022

Exoplanet Detection by Machine Learning with Data Augmentation

It has recently been demonstrated that deep learning has significant pot...
research
03/16/2023

Measuring Improvement of F_1-Scores in Detection of Self-Admitted Technical Debt

Artificial Intelligence and Machine Learning have witnessed rapid, signi...
research
04/27/2023

Human-machine knowledge hybrid augmentation method for surface defect detection based few-data learning

Visual-based defect detection is a crucial but challenging task in indus...
research
12/15/2022

DeepDFA: Dataflow Analysis-Guided Efficient Graph Learning for Vulnerability Detection

Deep learning-based vulnerability detection models have recently been sh...

Please sign up or login with your details

Forgot password? Click here to reset