Experience: Quality Benchmarking of Datasets Used in Software Effort Estimation

12/20/2020
by   Michael F. Bosu, et al.
0

Data is a cornerstone of empirical software engineering (ESE) research and practice. Data underpin numerous process and project management activities, including the estimation of development effort and the prediction of the likely location and severity of defects in code. Serious questions have been raised, however, over the quality of the data used in ESE. Data quality problems caused by noise, outliers, and incompleteness have been noted as being especially prevalent. Other quality issues, although also potentially important, have received less attention. In this study, we assess the quality of 13 datasets that have been used extensively in research on software effort estimation. The quality issues considered in this article draw on a taxonomy that we published previously based on a systematic mapping of data quality issues in ESE. Our contributions are as follows: (1) an evaluation of the "fitness for purpose" of these commonly used datasets and (2) an assessment of the utility of the taxonomy in terms of dataset benchmarking. We also propose a template that could be used to both improve the ESE data collection/submission process and to evaluate other such datasets, contributing to enhanced awareness of data quality issues in the ESE community and, in time, the availability and use of higher-quality datasets.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/11/2021

A Taxonomy of Data Quality Challenges in Empirical Software Engineering

Reliable empirical models such as those used in software effort estimati...
research
05/23/2021

Data Quality in Empirical Software Engineering: A Targeted Review

Context: The utility of prediction models in empirical software engineer...
research
01/26/2021

Software Effort Estimation Accuracy Prediction of Machine Learning Techniques: A Systematic Performance Evaluation

Software effort estimation accuracy is a key factor in effective plannin...
research
07/06/2020

Incorrect Data in the Widely Used Inside Airbnb Dataset

Several recently published papers in Decision Support Systems discussed ...
research
11/20/2019

The Evolution of Code Review Research: A Systematic Mapping Study

Code Review (CR) is a cornerstone for Quality Assurance within software ...
research
01/31/2019

Methods to Evaluate Lifecycle Models for Research Data Management

Lifecycle models for research data are often abstract and simple. This c...
research
03/26/2019

The Personal Software Process, Experiences from Denmark

Software process improvement (SPI) research and practice is transforming...

Please sign up or login with your details

Forgot password? Click here to reset