The Adverse Effects of Code Duplication in Machine Learning Models of Code

12/16/2018
by   Miltiadis Allamanis, et al.
0

The field of big code relies on mining large corpora of code to perform some learning task. A significant threat to this approach has been recently identified by Lopes et al. (2017) who found a large amount of code duplication on GitHub. However, the impact of code duplication has not been noticed by researchers devising machine learning models for source code. In this article, we study the effect of code duplication to machine learning models showing that reported metrics are sometimes inflated by up to 100 duplicated code corpora compared to the performance on de-duplicated corpora which more accurately represent how machine learning models of code are used by software engineers. We present an "errata" for widely used datasets, list best practices for collecting code corpora and evaluating machine learning models on them, and release tools to help the community avoid this problem in future research.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/03/2019

Modeling Vocabulary for Big Code Machine Learning

When building machine learning models that operate on source code, sever...
research
10/25/2021

CoProtector: Protect Open-Source Code against Unauthorized Training Usage with Data Poisoning

Github Copilot, trained on billions of lines of public code, has recentl...
research
06/22/2022

Exploring the Impact of Code Style in Identifying Good Programmers

Code style reflects the choice of textual representation of source code....
research
12/30/2021

AntiCopyPaster: Extracting Code Duplicates As Soon As They Are Introduced in the IDE

We have developed a plugin for IntelliJ IDEA called AntiCopyPaster that ...
research
11/12/2019

Position Paper: Towards Transparent Machine Learning

Transparent machine learning is introduced as an alternative form of mac...
research
03/15/2023

DACOS-A Manually Annotated Dataset of Code Smells

Researchers apply machine-learning techniques for code smell detection t...
research
07/28/2021

Investigating Text Simplification Evaluation

Modern text simplification (TS) heavily relies on the availability of go...

Please sign up or login with your details

Forgot password? Click here to reset