Authorship Attribution of Source Code: A Language-Agnostic Approach and Applicability in Software Engineering

01/30/2020
by   Egor Bogomolov, et al.
0

Authorship attribution of source code has been an established research topic for several decades. State-of-the-art results for the authorship attribution problem look promising for the software engineering field, where they could be applied to detect plagiarized code and prevent legal issues. With this study, we first introduce a language-agnostic approach to authorship attribution of source code. Two machine learning models based on our approach match or improve over state-of-the-art results, originally achieved by language-specific approaches, on existing datasets for code in C++, Python, and Java. After that, we discuss limitations of existing synthetic datasets for authorship attribution, and propose a data collection approach that delivers datasets that better reflect aspects important for potential practical use in software engineering. In particular, we discuss the concept of work context and its importance for authorship attribution. Finally, we demonstrate that high accuracy of authorship attribution models on existing datasets drastically drops when they are evaluated on more realistic data. We conclude the paper by outlining next steps in design and evaluation of authorship attribution models that could bring the research efforts closer to practical use.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/29/2019

Misleading Authorship Attribution of Source Code using Adversarial Learning

In this paper, we present a novel attack against authorship attribution ...
research
09/15/2021

A Comparison of Code Embeddings and Beyond

Program representation learning is a fundamental task in software engine...
research
02/27/2023

The (ab)use of Open Source Code to Train Large Language Models

In recent years, Large Language Models (LLMs) have gained significant po...
research
02/07/2020

SPN-CNN: Boosting Sensor-Based Source Camera Attribution With Deep Learning

We explore means to advance source camera identification based on sensor...
research
10/14/2021

Analysis of the first Genetic Engineering Attribution Challenge

The ability to identify the designer of engineered biological sequences ...
research
11/16/2020

Datasets and Models for Authorship Attribution on Italian Personal Writings

Existing research on Authorship Attribution (AA) focuses on texts for wh...
research
05/06/2023

TASTY: A Transformer based Approach to Space and Time complexity

Code based Language Models (LMs) have shown very promising results in th...

Please sign up or login with your details

Forgot password? Click here to reset