Data Quality in Empirical Software Engineering: A Targeted Review

05/23/2021
by   Michael Franklin Bosu, et al.
0

Context: The utility of prediction models in empirical software engineering (ESE) is heavily reliant on the quality of the data used in building those models. Several data quality challenges such as noise, incompleteness, outliers and duplicate data points may be relevant in this regard. Objective: We investigate the reporting of three potentially influential elements of data quality in ESE studies: data collection, data pre-processing, and the identification of data quality issues. This enables us to establish how researchers view the topic of data quality and the mechanisms that are being used to address it. Greater awareness of data quality should inform both the sound conduct of ESE research and the robust practice of ESE data collection and processing. Method: We performed a targeted literature review of empirical software engineering studies covering the period January 2007 to September 2012. A total of 221 relevant studies met our inclusion criteria and were characterized in terms of their consideration and treatment of data quality. Results: We obtained useful insights as to how the ESE community considers these three elements of data quality. Only 23 of these 221 studies reported on all three elements of data quality considered in this paper. Conclusion: The reporting of data collection procedures is not documented consistently in ESE studies. It will be useful if data collection challenges are reported in order to improve our understanding of why there are problems with software engineering data sets and the models developed from them. More generally, data quality should be given far greater attention by the community. The improvement of data sets through enhanced data collection, pre-processing and quality assessment should lead to more reliable prediction models, thus improving the practice of software engineering.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/20/2020

Experience: Quality Benchmarking of Datasets Used in Software Effort Estimation

Data is a cornerstone of empirical software engineering (ESE) research a...
research
11/14/2019

On the Time-Based Conclusion Stability of Software Defect Prediction Models

Researchers in empirical software engineering often make claims based on...
research
06/11/2021

A Taxonomy of Data Quality Challenges in Empirical Software Engineering

Reliable empirical models such as those used in software effort estimati...
research
10/07/2021

How Tertiary Studies perform Quality Assessment of Secondary Studies in Software Engineering

Context: Tertiary studies are becoming increasingly popular in software ...
research
10/25/2022

Measuring uncertainty when pooling interval-censored data sets with different precision

Data quality is an important consideration in many engineering applicati...
research
04/01/2019

Data of low quality is better than no data

Missing data is not uncommon in empirical software engineering research ...
research
01/25/2018

Agile development for vulnerable populations: lessons learned and recommendations

In this paper we draw attention to the challenges of managing software p...

Please sign up or login with your details

Forgot password? Click here to reset