GPTCloneBench: A comprehensive benchmark of semantic clones and cross-language clones using GPT-3 model and SemanticCloneBench

08/26/2023
by   Ajmain Inqiad Alam, et al.
0

With the emergence of Machine Learning, there has been a surge in leveraging its capabilities for problem-solving across various domains. In the code clone realm, the identification of type-4 or semantic clones has emerged as a crucial yet challenging task. Researchers aim to utilize Machine Learning to tackle this challenge, often relying on the BigCloneBench dataset. However, it's worth noting that BigCloneBench, originally not designed for semantic clone detection, presents several limitations that hinder its suitability as a comprehensive training dataset for this specific purpose. Furthermore, CLCDSA dataset suffers from a lack of reusable examples aligning with real-world software systems, rendering it inadequate for cross-language clone detection approaches. In this work, we present a comprehensive semantic clone and cross-language clone benchmark, GPTCloneBench by exploiting SemanticCloneBench and OpenAI's GPT-3 model. In particular, using code fragments from SemanticCloneBench as sample inputs along with appropriate prompt engineering for GPT-3 model, we generate semantic and cross-language clones for these specific fragments and then conduct a combination of extensive manual analysis, tool-assisted filtering, functionality testing and automated validation in building the benchmark. From 79,928 clone pairs of GPT-3 output, we created a benchmark with 37,149 true semantic clone pairs, 19,288 false semantic pairs(Type-1/Type-2), and 20,770 cross-language clones across four languages (Java, C, C#, and Python). Our benchmark is 15-fold larger than SemanticCloneBench, has more functional code examples for software systems and programming language support than CLCDSA, and overcomes BigCloneBench's qualities, quantification, and language variety limitations.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/12/2023

Unveiling the potential of large language models in generating semantic and cross-language clones

Semantic and Cross-language code clone generation may be useful for code...
research
01/25/2022

Semantic Code Classification for Automated Machine Learning

A range of applications for automatic machine learning need the generati...
research
02/07/2020

SLACC: Simion-based Language Agnostic Code Clones

Successful cross-language clone detection could enable researchers and d...
research
12/17/2021

Challenge Dataset of Cognates and False Friend Pairs from Indian Languages

Cognates are present in multiple variants of the same text across differ...
research
08/26/2022

Generalizability of Code Clone Detection on CodeBERT

Transformer networks such as CodeBERT already achieve outstanding result...
research
02/08/2019

Code Smell Detection using Multilabel Classification Approach

Code smells are characteristics of the software that indicates a code or...
research
01/18/2018

Challenges of the Dynamic Detection of Functionally Similar Code Fragments

Classic clone detection approaches are hardly capable of finding redunda...

Please sign up or login with your details

Forgot password? Click here to reset