Towards a Dataset of Programming Contest Plagiarism in Java

03/19/2023
by   Evgeniy Slobodkin, et al.
0

In this paper, we describe and present the first dataset of source code plagiarism specifically aimed at contest plagiarism. The dataset contains 251 pairs of plagiarized solutions of competitive programming tasks in Java, as well as 660 non-plagiarized ones, however, the described approach can be used to extend the dataset in the future. Importantly, each pair comes in two versions: (a) "raw" and (b) with participants' repeated template code removed, allowing for evaluating tools in different settings. We used the collected dataset to compare the available source code plagiarism detection tools, including state-of-the-art ones, specifically in their ability to detect contest plagiarism. Our results indicate that the tools show significantly worse performance on the contest plagiarism because of the template code and the presence of other misleadingly similar code. Of the tested tools, token-based ones demonstrated the best performance in both variants of the dataset.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/10/2021

Megadiff: A Dataset of 600k Java Source Code Changes Categorized by Diff Size

This paper presents Megadiff, a dataset of source code diffs. It focuses...
research
02/08/2021

Evaluating the robustness of source code plagiarism detection tools to pervasive plagiarism-hiding modifications

Source code plagiarism is a common occurrence in undergraduate computer ...
research
08/26/2019

Using LSTMs to Model the Java Programming Language

Recurrent neural networks (RNNs), specifically long-short term memory ne...
research
12/18/2022

JEMMA: An Extensible Java Dataset for ML4Code Applications

Machine Learning for Source Code (ML4Code) is an active research field i...
research
03/08/2021

Atoms of Confusion in Java

Although writing code seems trivial at times, problems arise when humans...
research
02/10/2022

Spork: Structured Merge for Java with Formatting Preservation

The highly parallel workflows of modern software development have made m...
research
12/09/2020

TaskTracker-tool: a Toolkit for Tracking of Code Snapshots and Activity Data During Solution of Programming Tasks

The process of writing code and use of features in an integrated develop...

Please sign up or login with your details

Forgot password? Click here to reset