Hazards in Deep Learning Testing: Prevalence, Impact and Recommendations

09/11/2023
by   Salah Ghamizi, et al.
0

Much research on Machine Learning testing relies on empirical studies that evaluate and show their potential. However, in this context empirical results are sensitive to a number of parameters that can adversely impact the results of the experiments and potentially lead to wrong conclusions (Type I errors, i.e., incorrectly rejecting the Null Hypothesis). To this end, we survey the related literature and identify 10 commonly adopted empirical evaluation hazards that may significantly impact experimental results. We then perform a sensitivity analysis on 30 influential studies that were published in top-tier SE venues, against our hazard set and demonstrate their criticality. Our findings indicate that all 10 hazards we identify have the potential to invalidate experimental findings, such as those made by the related literature, and should be handled properly. Going a step further, we propose a point set of 10 good empirical practices that has the potential to mitigate the impact of the hazards. We believe our work forms the first step towards raising awareness of the common pitfalls and good practices within the software engineering community and hopefully contribute towards setting particular expectations for empirical research in the field of deep learning testing.

READ FULL TEXT
research
06/23/2021

Publication Bias: A Detailed Analysis of Experiments Published in ESEM

Background: Publication bias is the failure to publish the results of a ...
research
04/29/2023

Towards machine learning guided by best practices

Nowadays, machine learning (ML) is being used in software systems with m...
research
07/27/2020

Case Survey Studies in Software Engineering Research

Background: Given the social aspects of Software Engineering (SE), in th...
research
06/25/2020

On the Replicability and Reproducibility of Deep Learning in Software Engineering

Deep learning (DL) techniques have gained significant popularity among s...
research
07/13/2021

What Evidence We Would Miss If We Do Not Use Grey Literature?

Context: Over the last years, Grey Literature (GL) is gaining increasing...
research
06/04/2021

Over-optimism in benchmark studies and the multiplicity of design and analysis options when interpreting their results

In recent years, the need for neutral benchmark studies that focus on th...
research
05/06/2020

Beware the Normative Fallacy

Behavioral research can provide important insights for SE practices. But...

Please sign up or login with your details

Forgot password? Click here to reset