The "code” of Ethics:A Holistic Audit of AI Code Generators

05/22/2023
by   Wanlun Ma, et al.
0

AI-powered programming language generation (PLG) models have gained increasing attention due to their ability to generate source code of programs in a few seconds with a plain program description. Despite their remarkable performance, many concerns are raised over the potential risks of their development and deployment, such as legal issues of copyright infringement induced by training usage of licensed code, and malicious consequences due to the unregulated use of these models. In this paper, we present the first-of-its-kind study to systematically investigate the accountability of PLG models from the perspectives of both model development and deployment. In particular, we develop a holistic framework not only to audit the training data usage of PLG models, but also to identify neural code generated by PLG models as well as determine its attribution to a source model. To this end, we propose using membership inference to audit whether a code snippet used is in the PLG model's training data. In addition, we propose a learning-based method to distinguish between human-written code and neural code. In neural code attribution, through both empirical and theoretical analysis, we show that it is impossible to reliably attribute the generation of one code snippet to one model. We then propose two feasible alternative methods: one is to attribute one neural code snippet to one of the candidate PLG models, and the other is to verify whether a set of neural code snippets can be attributed to a given PLG model. The proposed framework thoroughly examines the accountability of PLG models which are verified by extensive experiments. The implementations of our proposed framework are also encapsulated into a new artifact, named CodeForensic, to foster further research.

READ FULL TEXT

page 1

page 6

page 7

page 16

page 18

page 19

research
07/18/2023

Is this Snippet Written by ChatGPT? An Empirical Study with a CodeBERT-Based Classifier

Since its launch in November 2022, ChatGPT has gained popularity among u...
research
08/04/2023

Vulnerabilities in AI Code Generators: Exploring Targeted Data Poisoning Attacks

In this work, we assess the security of AI code generators via data pois...
research
06/26/2023

Discriminating Human-authored from ChatGPT-Generated Code Via Discernable Feature Analysis

The ubiquitous adoption of Large Language Generation Models (LLMs) in pr...
research
04/17/2022

WhyGen: Explaining ML-powered Code Generation by Referring to Training Examples

Deep learning has demonstrated great abilities in various code generatio...
research
08/22/2023

Open Set Synthetic Image Source Attribution

AI-generated images have become increasingly realistic and have garnered...
research
05/24/2023

Who Wrote this Code? Watermarking for Code Generation

Large language models for code have recently shown remarkable performanc...
research
10/10/2022

SimSCOOD: Systematic Analysis of Out-of-Distribution Behavior of Source Code Models

While large code datasets have become available in recent years, acquiri...

Please sign up or login with your details

Forgot password? Click here to reset