Discriminating Human-authored from ChatGPT-Generated Code Via Discernable Feature Analysis

06/26/2023
by   Li Ke, et al.
0

The ubiquitous adoption of Large Language Generation Models (LLMs) in programming has underscored the importance of differentiating between human-written code and code generated by intelligent models. This paper specifically aims to distinguish code generated by ChatGPT from that authored by humans. Our investigation reveals disparities in programming style, technical level, and readability between these two sources. Consequently, we develop a discriminative feature set for differentiation and evaluate its efficacy through ablation experiments. Additionally, we devise a dataset cleansing technique, which employs temporal and spatial segmentation, to mitigate the dearth of datasets and to secure high-caliber, uncontaminated datasets. To further enrich data resources, we employ "code transformation," "feature transformation," and "feature customization" techniques, generating an extensive dataset comprising 10,000 lines of ChatGPT-generated code. The salient contributions of our research include: proposing a discriminative feature set yielding high accuracy in differentiating ChatGPT-generated code from human-authored code in binary classification tasks; devising methods for generating extensive ChatGPT-generated codes; and introducing a dataset cleansing strategy that extracts immaculate, high-grade code datasets from open-source repositories, thus achieving exceptional accuracy in code authorship attribution tasks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/18/2022

An Empirical Evaluation of Competitive Programming AI: A Case Study of AlphaCode

AlphaCode is a code generation system for assisting software developers ...
research
05/22/2023

The "code” of Ethics:A Holistic Audit of AI Code Generators

AI-powered programming language generation (PLG) models have gained incr...
research
04/27/2023

Large Language Models Are State-of-the-Art Evaluators of Code Generation

Recent advancements in the field of natural language generation have fac...
research
05/06/2023

Self-Edit: Fault-Aware Code Editor for Code Generation

Large language models (LLMs) have demonstrated an impressive ability to ...
research
05/16/2023

The Good, the Bad, and the Missing: Neural Code Generation for Machine Learning Tasks

Machine learning (ML) has been increasingly used in a variety of domains...
research
03/18/2022

Transferable Class-Modelling for Decentralized Source Attribution of GAN-Generated Images

GAN-generated deepfakes as a genre of digital images are gaining ground ...

Please sign up or login with your details

Forgot password? Click here to reset