Towards Using Data-Influence Methods to Detect Noisy Samples in Source Code Corpora

05/25/2022
by   Anh T. V. Dau, et al.
16

Despite the recent trend of developing and applying neural source code models to software engineering tasks, the quality of such models is insufficient for real-world use. This is because there could be noise in the source code corpora used to train such models. We adapt data-influence methods to detect such noises in this paper. Data-influence methods are used in machine learning to evaluate the similarity of a target sample to the correct samples in order to determine whether or not the target sample is noisy. Our evaluation results show that data-influence methods can identify noisy samples from neural code models in classification-based tasks. This approach will contribute to the larger vision of developing better neural source code models from a data-centric perspective, which is a key driver for developing useful source code models in practice.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/26/2019

Learning Semantic Vector Representations of Source Code via a Siamese Neural Network

The abundance of open-source code, coupled with the success of recent ad...
research
11/21/2012

Scaling Genetic Programming for Source Code Modification

In Search Based Software Engineering, Genetic Programming has been used ...
research
02/27/2023

The (ab)use of Open Source Code to Train Large Language Models

In recent years, Large Language Models (LLMs) have gained significant po...
research
06/21/2021

A Mocktail of Source Code Representations

Efficient representation of source code is essential for various softwar...
research
02/08/2019

Code Smell Detection using Multilabel Classification Approach

Code smells are characteristics of the software that indicates a code or...
research
06/11/2020

Backdoors in Neural Models of Source Code

Deep neural networks are vulnerable to a range of adversaries. A particu...
research
03/22/2021

psc2code: Denoising Code Extraction from Programming Screencasts

In this paper, we propose an approach named psc2code to denoise the proc...

Please sign up or login with your details

Forgot password? Click here to reset