Pitfalls and Guidelines for Using Time-Based Git Data

09/09/2022
by   Samuel W. Flint, et al.
0

Many software engineering research papers rely on time-based data (e.g., commit timestamps, issue report creation/update/close dates, release dates). Like most real-world data however, time-based data is often dirty. To date, there are no studies that quantify how frequently such data is used by the software engineering research community, or investigate sources of and quantify how often such data is dirty. Depending on the research task and method used, including such dirty data could affect the research results. This paper presents an extended survey of papers that utilize time-based data, published in the Mining Software Repositories (MSR) conference series. Out of the 754 technical track and data papers published in MSR 2004–2021, we saw at least 290 (38 time-based data used in research papers comes in the form of Git commits, often from GitHub. Based on those results, we then used the Boa and Software Heritage infrastructures to help identify and quantify several sources of dirty Git timestamp data. Finally we provide guidelines/best practices for researchers utilizing time-based data from Git repositories.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/21/2021

Escaping the Time Pit: Pitfalls and Guidelines for Using Time-Based Git Data

Many software engineering research papers rely on time-based data (e.g.,...
research
09/06/2017

Extracting data from vector figures in scholarly articles

It is common for authors to communicate their results in graphical figur...
research
01/28/2019

An Empirically Evaluated Checklist for Surveys in Software Engineering

Context: Over the past decade Software Engineering research has seen a s...
research
12/02/2020

Software Module Clustering: An In-Depth Literature Analysis

Software module clustering is an unsupervised learning method used to cl...
research
04/07/2022

The General Index of Software Engineering Papers

We introduce the General Index of Software Engineering Papers, a dataset...
research
12/16/2022

A Comprehensive Survey of Benchmarks for Automated Improvement of Software's Non-Functional Properties

Performance is a key quality of modern software. Although recent years h...
research
09/02/2020

Understanding Peer Review of Software Engineering Papers

Peer review is a key activity intended to preserve the quality and integ...

Please sign up or login with your details

Forgot password? Click here to reset