PENTACET data – 23 Million Contextual Code Comments and 500,000 SATD comments

03/24/2023
by   Murali Sridharan, et al.
0

Most Self-Admitted Technical Debt (SATD) research utilizes explicit SATD features such as 'TODO' and 'FIXME' for SATD detection. A closer look reveals several SATD research uses simple SATD ('Easy to Find') code comments without the contextual data (preceding and succeeding source code context). This work addresses this gap through PENTACET (or 5C dataset) data. PENTACET is a large Curated Contextual Code Comments per Contributor and the most extensive SATD data. We mine 9,096 Open Source Software Java projects with a total of 435 million LOC. The outcome is a dataset with 23 million code comments, preceding and succeeding source code context for each comment, and more than 500,000 comments labeled as SATD, including both 'Easy to Find' and 'Hard to Find' SATD. We believe PENTACET data will further SATD research using Artificial Intelligence techniques.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/13/2019

Associating Natural Language Comment and Source Code Entities

Comments are an integral part of software development; they are natural ...
research
08/12/2020

Prevalence, Contents and Automatic Detection of KL-SATD

When developers use different keywords such as TODO and FIXME in source ...
research
05/04/2023

Notes on Refactoring Exponential Macros in Common Lisp

I recently consulted for a very big Common Lisp project having more than...
research
03/16/2023

Measuring Improvement of F_1-Scores in Detection of Self-Admitted Technical Debt

Artificial Intelligence and Machine Learning have witnessed rapid, signi...
research
12/18/2022

JEMMA: An Extensible Java Dataset for ML4Code Applications

Machine Learning for Source Code (ML4Code) is an active research field i...
research
06/25/2020

Source Code Comments: Overlooked in the Realm of Code Clone Detection

Reusing code can produce duplicate or near-duplicate code clones in code...
research
08/13/2021

Open comments on the Task Force SIRS report: Scholarly Infrastructures for Research Software (EOSC Executive Board, EOSCArchitecture)

The goal of this document is to openly contribute with our comments to t...

Please sign up or login with your details

Forgot password? Click here to reset