Fingerprinting and Building Large Reproducible Datasets

06/20/2023
by   Romain Lefeuvre, et al.
0

Obtaining a relevant dataset is central to conducting empirical studies in software engineering. However, in the context of mining software repositories, the lack of appropriate tooling for large scale mining tasks hinders the creation of new datasets. Moreover, limitations related to data sources that change over time (e.g., code bases) and the lack of documentation of extraction processes make it difficult to reproduce datasets over time. This threatens the quality and reproducibility of empirical studies. In this paper, we propose a tool-supported approach facilitating the creation of large tailored datasets while ensuring their reproducibility. We leveraged all the sources feeding the Software Heritage append-only archive which are accessible through a unified programming interface to outline a reproducible and generic extraction process. We propose a way to define a unique fingerprint to characterize a dataset which, when provided to the extraction process, ensures that the same dataset will be extracted. We demonstrate the feasibility of our approach by implementing a prototype. We show how it can help reduce the limitations researchers face when creating or reproducing datasets.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/02/2021

Stop Building Castles on a Swamp! The Crisis of Reproducing Automatic Search in Evidence-based Software Engineering

The evidence-based approach has increasingly been employed to synthesize...
research
04/10/2018

Protocol and Tools for Conducting Agile Software Engineering Research in an Industrial-Academic Setting: A Preliminary Study

Conducting empirical research in software engineering industry is a proc...
research
11/16/2017

Software Metric Framework

Many researchers have criticized the field of Software Complexity metric...
research
09/01/2022

A large dataset of software mentions in the biomedical literature

We describe the CZ Software Mentions dataset, a new dataset of software ...
research
02/24/2022

Should I Get Involved? On the Privacy Perils of Mining Software Repositories for Research Participants

Mining Software Repositories (MSRs) is an evidence-based methodology tha...
research
05/19/2023

Pitfalls in Experiments with DNN4SE: An Analysis of the State of the Practice

Software engineering techniques are increasingly relying on deep learnin...

Please sign up or login with your details

Forgot password? Click here to reset