Preprocessing Source Code Comments for Linguistic Models

08/23/2022
by   Sergey Matskevich, et al.
0

Comments are an important part of the source code and are a primary source of documentation. This has driven interest in using large bodies of comments to train or evaluate tools that consume or produce them – such as generating oracles or even code from comments, or automatically generating code summaries. Most of this work makes strong assumptions about the structure and quality of comments, such as assuming they consist mostly of proper English sentences. However, we know little about the actual quality of existing comments for these use cases. Comments often contain unique structures and elements that are not seen in other types of text, and filtering or extracting information from them requires some extra care. This paper explores the contents and quality of Python comments drawn from 840 most popular open source projects from GitHub and 8422 projects from SriLab dataset, and the impact of naïve vs. in-depth filtering can have on the use of existing comments for training and evaluation of systems that generate comments.

READ FULL TEXT
research
05/06/2019

Analyzing Code Comments to Boost Program Comprehension

We are trying to find source code comments that help programmers underst...
research
12/22/2014

Reply to the commentary "Be careful when assuming the obvious", by P. Alday

Here we respond to some comments by Alday concerning headedness in lingu...
research
03/24/2021

Learning to Generate Code Comments from Class Hierarchies

Descriptive code comments are essential for supporting code comprehensio...
research
08/12/2020

Prevalence, Contents and Automatic Detection of KL-SATD

When developers use different keywords such as TODO and FIXME in source ...
research
06/07/2018

Is preprocessing of text really worth your time for online comment classification?

A large proportion of online comments present on public domains are cons...
research
09/22/2017

Code Attention: Translating Code to Comments by Exploiting Domain Features

Appropriate comments of code snippets provide insight for code functiona...
research
06/25/2020

Source Code Comments: Overlooked in the Realm of Code Clone Detection

Reusing code can produce duplicate or near-duplicate code clones in code...

Please sign up or login with your details

Forgot password? Click here to reset