The significance of user-defined identifiers in Java source code authorship identification

01/29/2021
by   Georgia Frantzeskou, et al.
0

When writing source code, programmers have varying levels of freedom when it comes to the creation and use of identifiers. Do they habitually use the same identifiers, names that are different to those used by others? Is it then possible to tell who the author of a piece of code is by examining these identifiers? If so, can we use the presence or absence of identifiers to assist in correctly classifying programs to authors? Is it possible to hide the provenance of programs by identifier renaming? In this study, we assess the importance of three types of identifiers in source code author classification for two different Java program data sets. We do this through a sequence of experiments in which we disguise one type of identifier at a time. These experiments are performed using as a tool the Source Code Author Profiles (SCAP) method. The results show that, although identifiers when examined as a whole do not seem to reflect program authorship for these data sets, when examined separately there is evidence that class names do signal the author of the program. In contrast, simple variables and method names used in Java programs do not appear to reflect program authorship. On the contrary, our analysis suggests that such identifiers are so common as to mask authorship. We believe that these results have applicability in relation to the robustness of code plagiarism analysis and that the underlying methods could be valuable in cases of litigation arising from disputes over program authorship.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/06/2020

Embedding Java Classes with code2vec: Improvements from Variable Obfuscation

Automatic source code analysis in key areas of software engineering, suc...
research
07/31/2020

On the Generalizability of Neural Program Analyzers with respect to Semantic-Preserving Program Transformations

With the prevalence of publicly available source code repositories to tr...
research
03/26/2018

Source Code Optimization using Equivalent Mutants

A mutant is a program obtained by syntactically modifying a program's so...
research
12/11/2022

Authorship Identification of Source Code Segments Written by Multiple Authors Using Stacking Ensemble Method

Source code segment authorship identification is the task of identifying...
research
01/30/2021

ICodeNet – A Hierarchical Neural Network Approach for Source Code Author Identification

With the open-source revolution, source codes are now more easily access...
research
08/26/2022

I still know it's you! On Challenges in Anonymizing Source Code

The source code of a program not only defines its semantics but also con...
research
03/09/2021

Mining Program Properties From Neural Networks Trained on Source Code Embeddings

In this paper, we propose a novel approach for mining different program ...

Please sign up or login with your details

Forgot password? Click here to reset