How are Software Repositories Mined? A Systematic Literature Review of Workflows, Methodologies, Reproducibility, and Tools

by   Adam Tutko, et al.

With the advent of open source software, a veritable treasure trove of previously proprietary software development data was made available. This opened the field of empirical software engineering research to anyone in academia. Data that is mined from software projects, however, requires extensive processing and needs to be handled with utmost care to ensure valid conclusions. Since the software development practices and tools have changed over two decades, we aim to understand the state-of-the-art research workflows and to highlight potential challenges. We employ a systematic literature review by sampling over one thousand papers from leading conferences and by analyzing the 286 most relevant papers from the perspective of data workflows, methodologies, reproducibility, and tools. We found that an important part of the research workflow involving dataset selection was particularly problematic, which raises questions about the generality of the results in existing literature. Furthermore, we found a considerable number of papers provide little or no reproducibility instructions – a substantial deficiency for a data-intensive field. In fact, 33 their data was retrieved. Based on these findings, we propose ways to address these shortcomings via existing tools and also provide recommendations to improve research workflows and the reproducibility of research.


page 1

page 2

page 3

page 4


Improving Software Engineering in Biostatistics: Challenges and Opportunities

Programming is ubiquitous in applied biostatistics; adopting software en...

The Use of Public Data and Free Tools in National CSIRTs' Operational Practices: A Systematic Literature Review

Many CSIRTs, including national CSIRTs, routinely use public data, inclu...

Replicability Study: Corpora For Understanding Simulink Models Projects

Background: Empirical studies on widely used model-based development too...

Evaluation Methodologies in Software Protection Research

Man-at-the-end (MATE) attackers have full control over the system on whi...

The Evolution of Code Review Research: A Systematic Mapping Study

Code Review (CR) is a cornerstone for Quality Assurance within software ...

State of the Practice for GIS Software

We present a reproducible method to analyze the state of software develo...

Reproducible Domain-Specific Knowledge Graphs in the Life Sciences: a Systematic Literature Review

Knowledge graphs (KGs) are widely used for representing and organizing s...

Please sign up or login with your details

Forgot password? Click here to reset